Name: SynthBench Leaderboard
Creator: SynthBench
License: https://opensource.org/licenses/MIT

Diagnosis without measurement

Why a benchmark pinned to real human response distributions is the missing companion piece to the cross-provider homogenization critique.

Romasanta, Thomas, and Levina (HBR, March 2026) ran 15,000 strategic-advice queries across seven frontier LLMs and found that responses converged on the same trend-aligned recommendations regardless of scenario context — trendslop, a shared attractor across providers rather than an idiosyncrasy of any one model. The critique lands. But it is a diagnosis without a measurement: cross-model consensus is only evidence of failure if you can point to the distribution the models should have matched, and on open-ended strategic questions no such ground truth exists.

SynthBench is the measurement. Every question we evaluate carries a human response distribution — from Pew American Trends Panel, SubPOP subpopulations, and GlobalOpinionQA cross-country data — and we score how far each provider's output lands from that distribution, overall and per demographic slice.

Two orthogonal diagnostics

Ground-truth fidelity

Model vs. human distribution

Mean Jensen-Shannon divergence against real survey responses, rolled up into the Synthetic Panel Score (SPS). This is the primary SynthBench metric — measured end-to-end in the sections below.

Cross-provider concordance

Model vs. model distribution

Pairwise JSD between raw-LLM providers on the same items. Low cross-provider JSD with high human-alignment means real population variance is being tracked; low cross-provider JSD with low human-alignment is the trendslop signature. See the Cross-Provider JSD Matrix →

The rest of this page documents the metrics, datasets, baselines, scoring protocol, cost model, and related literature behind those two diagnostics.

How We Score

The Survey Parity Score (SPS) measures how closely AI-generated survey responses match real human opinion distributions—1.0 is a perfect match. It is the equal-weighted average of five sub-metrics (six in later phases), each capturing a distinct dimension of synthetic-respondent quality on a 0–1 scale.

SPS = 0.2 × P_dist + 0.2 × P_rank + 0.2 × P_cond + 0.2 × P_sub + 0.2 × P_refuse

P_dist

Distributional Parity

How closely the model's answer percentages match real human survey responses.

Poor

Fair

Good

Excellent

0 0.25 0.5 0.75 1.0

poor: Completely different distributions
fair: Some overlap but systematic divergence
good: Close match on most questions
excellent: Nearly identical to human distributions

Mathematical detail

P_dist = 1 - mean(JSD) across all question-demographic pairs, where JSD is Jensen-Shannon divergence.

P_rank

Rank-Order Parity

Whether the model ranks response options in the same order as humans, even if exact percentages differ.

Poor

Fair

Good

Excellent

0 0.25 0.5 0.75 1.0

poor: Reversed or random ordering
fair: Gets the top option right but scrambles the rest
good: Mostly correct ordering with minor swaps
excellent: Perfect rank agreement with humans

Mathematical detail

P_rank = (1 + mean(tau_b)) / 2, where tau_b is Kendall's tau-b on probability rankings, normalized to [0, 1].

P_cond

Conditioning Fidelity

When told "respond as a 65-year-old conservative," does the model actually shift its answers to match that demographic?

Poor

Fair

Good

Excellent

0 0.25 0.5 0.75 1.0

poor: Personas have no effect on output
fair: Some demographic sensitivity but inconsistent
good: Meaningful shifts that track real demographic differences
excellent: Conditioning precisely reproduces demographic patterns

Mathematical detail

P_cond = mean(max(0, align_conditioned(G) - align_default(G))) across all demographic groups G.

P_sub

Subgroup Consistency

Whether accuracy is even across all demographic groups, or if some populations are systematically underserved.

Poor

Fair

Good

Excellent

0 0.25 0.5 0.75 1.0

poor: Wildly uneven across groups
fair: Accurate for majorities, poor for minorities
good: Modest variation across groups
excellent: Equally accurate for all demographics

Mathematical detail

P_sub = 1 - CV(group_scores), where CV is the coefficient of variation (std / mean) of per-group P_dist.

P_refuse

Refusal Calibration

Whether the model declines to answer at rates matching real human refusal patterns.

Poor

Fair

Good

Excellent

0 0.25 0.5 0.75 1.0

poor: Refusal rates completely off (answers everything or refuses everything)
fair: Gets the direction right but magnitudes are off
good: Close calibration on most questions
excellent: Matches human refusal patterns precisely

Mathematical detail

P_refuse = 1 - mean(|R_provider - R_human|) across all question-demographic pairs.

P_theme

Thematic Parity Phase 2+

For open-ended responses, whether the model's themes and reasoning align with human qualitative patterns.

Poor

Fair

Good

Excellent

0 0.25 0.5 0.75 1.0

poor: Off-topic or generic reasoning
fair: Hits some themes but misses key ones
good: Covers most human themes with reasonable proportions
excellent: Themes and reasoning indistinguishable from human responses

Mathematical detail

LLM-as-judge evaluation: theme relevance, theme distribution accuracy, and reasoning quality scored on a rubric.

Baselines: What to Compare Against

A model that scores 0.70 SPS sounds good — until you realize the majority-class baseline scores 0.45. Baselines give meaning to raw scores by anchoring the scale.

Baseline	SPS	Role
Random Baseline	~0.31	floor
Majority-Class	~0.45	low
Population-Average	~0.52	mid
Unconditioned LLM	~0.58	high
Human Ceiling	~0.99	ceiling
Temporal Drift Floor	N/A	ceiling

0.0

1.0

~0.31 Random

~0.45 Majority-Class

~0.52 Population-Average

~0.58 Unconditioned LLM

~0.99 Human Ceiling

Meaningful evaluation range: Unconditioned LLM (~0.58) to Human Ceiling (~0.99)

Normalized position ("Range %")

Raw SPS values for strong models compress into a narrow band (~0.82-0.85) that is visually hard to separate on a 0-1 bar. The leaderboard also reports a normalized position — where a row sits inside the meaningful evaluation range bounded below by the raw-LLM baseline and above by the Human Ceiling.

Range % = (SPS − P_unconditioned) / (P_ceiling − P_unconditioned)

P_unconditioned is the SPS of the raw-LLM baseline for the same underlying model on the same dataset (the "just prompt the model" reference). Raw-LLM rows therefore resolve to 0% — they are the reference. Product rows (e.g. conditioned SynthPanel variants) show their lift above the corresponding raw model.

P_ceiling is the dataset's aggregate Human Ceiling (see below). Rows without a resolvable raw-LLM baseline (statistical baselines, multi-model ensembles) display "—" for Range % and should be compared via raw SPS instead.

Raw SPS remains the primary column for academic comparability. Range % is a supplementary display that amplifies product lift — a 0.01 SPS gain on a narrow range becomes a double-digit percentage of headroom closed.

Random Baseline

SPS ~0.31

Uniform random distribution over all non-refusal options. For a question with k options, each gets P = 1/k.

What it tests: The absolute floor. Any provider scoring at or below this is adding negative value.

Majority-Class

SPS ~0.45

Always picks the most popular human answer. Assigns P = 1.0 to the mode, 0.0 to everything else.

What it tests: Whether a model does better than just echoing the most common response. Scores well on consensus questions, poorly on divisive ones.

Population-Average

SPS ~0.52

Uses the overall population distribution (ignoring demographics) for every group. The same answer regardless of who is being simulated.

What it tests: Isolates the value of demographic conditioning. The gap between this baseline and a conditioned provider is the conditioning premium.

Unconditioned LLM

SPS ~0.58

Raw model output with no persona or demographic conditioning. This is the "just prompt ChatGPT" approach many researchers currently use.

What it tests: The competitive baseline. Every dedicated synthetic respondent product must beat this to justify its existence.

Human Ceiling

SPS ~0.99

Agreement between independent halves of the human survey panel, measured via multinomial-bootstrap split-half reliability. The theoretical maximum score a model can achieve.

What it tests: A provider scoring above this is overfitting or exploiting artifacts. Sets the upper bound for meaningful evaluation.

Methodology details (for researchers)

What it is

Split-half reliability via multinomial bootstrap. Given observed counts c = [c_1, ..., c_k] with total n per question × subpop, we treat p̂ = c/n as the multinomial MLE and draw two independent half-samples of size ⌊n/2⌋ from Multinomial(n/2, p̂). We compute JSD between the two empirical distributions, repeat B times, and report Ceiling = 1 − mean(JSD). Published values use B=1000, the full bootstrap budget; a vectorized multinomial-and-JSD path keeps this feasible at publish time. Reproducible with bootstrap seed 42.

Why it matters

It is the theoretical maximum for any evaluation built on this human data. Humans disagree with each other; no model can be more consistent with the survey than the survey is with itself. Per-subpopulation ceilings genuinely differ — Spearman-Brown says reliability scales with √n — so small subpops have materially lower ceilings than large ones.

Per-dataset protocol

OpinionsQA: computed within-wave (never cross-wave — that measures drift, not reliability), aggregated weighted by n_questions per wave. SubPOP: per-subpop (22 subpopulations) + weighted aggregate. GlobalOpinionQA: per-country + regional aggregates, weighted by actual (country, question) coverage to avoid over-weighting US data.

Two granularities: aggregate vs subgroup

P_dist is measured at wave granularity (~4.5k respondents/wave), so it compares against the wave-aggregate ceiling (~0.9995). P_sub is measured at (wave × attribute × group) granularity — typical n is 50-500, not 4,500 — so the aggregate ceiling materially overstates achievable headroom. OpinionsQA now emits a per-subgroup ceiling distribution and uses the median as the reference for P_sub; the wave-aggregate is kept for P_dist. A model at P_sub = 0.88 is not sitting 0.12 below a true ceiling of 0.9995 — its real headroom is the gap to the subgroup median, which is meaningfully smaller and shrinks further on low-n groups where the ceiling itself drops to ~0.95.

Measured values (v0.1)

OpinionsQA: 0.9995 (raw counts from 15 Pew ATP waves, n≈56k/wave). SubPOP: 0.9954 (22 subpops, inferred n=500 per subpop, quality flag 'medium'). GlobalOpinionQA: 0.9972 (per-country, inferred n=1000, quality flag 'medium'). All values are reproducible with bootstrap seed 42. The high ceilings reflect large within-wave sample sizes; the realistic gap between models and humans is most visible on small, contentious subpops.

Sample-size quality flags (Cochran 1977)

High: raw subpop n ≥ 400 — use directly. Medium: 200 ≤ n < 400 — report with CI caveat. Low: n < 200 — report with warning; do not use as a gating threshold. SubPOP and GlobalOpinionQA ceilings are flagged 'medium' because sample sizes are inferred from published probabilities, not shipped as raw counts.

Survey-weight caveat

Ceiling is computed from raw category counts. Pew applies survey weights; ignoring them could shift the ceiling by 1-3% on demographically skewed subgroups. The raw-count approach is conservative — a weighted ceiling would be slightly tighter, meaning our published ceilings are a mild upper-bound on achievable reliability.

Citations

Methodology: Santurkar et al. 2023 (arxiv:2303.17548); Durmus et al. 2023 (arxiv:2306.16388); Suh et al. 2025 — SubPOP (arxiv:2502.16761); Pew Research Center Methodology. Statistical foundations: Spearman (1910) & Brown (1910) — Spearman-Brown prophecy; Efron (1979) — bootstrap; Lin (1991) — Jensen-Shannon divergence; Cochran (1977) — Sampling Techniques, for the n=400 rule-of-thumb.

Temporal Drift Floor

SPS N/A

A baseline-adjacent metric separate from Human Ceiling: JSD between the same-wording questions across different Pew ATP waves. Pew repeats ~15-20% of items across waves for trend tracking.

What it tests: Quantifies how much real-world opinions shift year-over-year on repeated items. Useful framing for P_refuse and longitudinal claims — drift is a property of reality, not of a model.

Methodology details (for researchers)

What it is

For every OpinionsQA question stem that appears in two or more waves, we compute JSD between its human distributions across waves. We report mean JSD overall, broken down by year-gap (1, 2, 3, 5 years), with bootstrap 95% CI over question pairs.

Why it is separate from the Human Ceiling

Split-half reliability must be within-wave: cross-wave JSD conflates sampling noise with real opinion shift. Temporal Drift Floor captures only the latter. A model that perfectly predicts 2022 opinions using 2017 training data would still score imperfectly against 2017 humans — the drift floor tells you how much of that gap is unavoidable.

Status

Computed for OpinionsQA only (requires same-wording repeated questions, which Pew ATP uniquely ships). Not applicable to SubPOP or GlobalOpinionQA snapshots. Values emitted to leaderboard.json under baselines.temporal_drift.

Datasets

SynthBench validates models against real human survey data across nine registered datasets. Each card shows its redistribution tier — full ships per-question distributions publicly, gated requires a research-use sign-in to reach per-question payloads.

OpinionsQA

gated

The primary ground-truth dataset. Multiple-choice questions covering political, social, and economic attitudes from nationally representative US surveys.

Questions: 1,498 (300 core)
Source: Pew American Trends Panel
Scope: US population
Coverage: 2017–2022 (15 survey waves)
Demographics: 56 groups across 11 attributes

GlobalOpinionQA

gated Phase 2

Cross-cultural attitudes dataset for Phase 2 expansion. Tests whether models capture cultural differences in opinion beyond US demographics.

Questions: 2,556
Source: Pew Global Attitudes
Scope: 138 countries
Coverage: Cross-national
Demographics: Country-level cultural variation

SubPOP

gated

Extended dataset focused on demographic subgroup variation. Tests whether models can differentiate between fine-grained population segments.

Questions: 3,362
Source: 22 US subpopulations
Scope: US demographic subgroups
Coverage: Cross-sectional
Demographics: 11 attributes: age, education, gender, ideology, party, income, religion, attendance, region, marital status, citizenship

GSS

full

General Social Survey cumulative microdata from NORC. The only public-domain adapter that ships per-respondent rows, enabling real-sampling convergence curves alongside the bootstrap baseline.

Source: NORC General Social Survey
Scope: US population
Coverage: 1972–present (cumulative file)
Demographics: Individual-level microdata (microdata-capable)

NTIA Internet Use Supplement

full

NTIA Computer & Internet Use Supplement to the Current Population Survey. U.S. Government public-domain work under 17 USC 105; per-question distributions ship openly.

Source: US Census / NTIA CPS Supplement
Scope: US households
Coverage: Biennial waves, 1994–present
Demographics: Standard CPS demographics

WVS (World Values Survey)

gated

World Values Survey Wave 7. Cross-cultural value orientations across national populations. Research-use license; per-question artifacts gated behind JWT auth.

Source: World Values Survey Association
Scope: Cross-national (Wave 7, 2017–2022)
Coverage: Wave 7 (2017–2022)
Demographics: Country + standard demographic cells

Eurobarometer

gated

Eurobarometer Consumer Modules. EU-wide consumer conditions, digital services, and sustainability attitudes. GESIS redistribution restricted; per-question served from gated origin.

Source: European Commission / GESIS
Scope: EU member states
Coverage: Quarterly (Standard / Flash / Special)
Demographics: Country + EU-harmonized demographic cells

Michigan (Surveys of Consumers)

gated

University of Michigan Surveys of Consumers — consumer sentiment, inflation expectations, and household finance. Academic use terms; per-question gated.

Source: University of Michigan SRC
Scope: US consumers
Coverage: Monthly cadence
Demographics: Standard consumer-survey demographics

Pew Technology

gated

Pew Research Internet & Technology waves: tech adoption, digital privacy, social media, and AI attitudes. Pew redistribution terms; per-question gated.

Source: Pew Research Center Internet & Technology
Scope: US population (American Trends Panel)
Coverage: 10 tech waves (W86–W113)
Demographics: ATP demographic weights

Dataset redistribution policy

Each upstream dataset carries its own license terms. SynthBench honors them at publish time across four tiers:

full — per-question distributions published openly (public-domain datasets).
gated — per-question distributions served from a Cloudflare Worker that validates a Supabase JWT; research-use sign-in required. No distribution ships anonymously.
aggregates_only — aggregate scores public; no per-question artifact ships at all (gating does not substitute for missing redistribution permission).
citation_only — question metadata only; every downstream metric suppressed.

Aggregate metrics (SPS, JSD, ranks, cross-model parity) are derived quantities and remain public regardless of tier.

Per-dataset redistribution policy tier and provenance, used by synthbench publish to gate which fields appear in public artifacts.
Dataset	Tier	Source & license
eurobarometer	gated Per-question human distributions served from authenticated origin; sign-in required (research-use license).	European Commission / GESIS — Eurobarometer Consumer Modules (license)
globalopinionqa	gated Per-question human distributions served from authenticated origin; sign-in required (research-use license).	Durmus et al. 2023, Anthropic — llm_global_opinions (license)
gss	full Per-question human distributions published openly on the static site.	NORC at the University of Chicago — General Social Survey (license)
michigan	gated Per-question human distributions served from authenticated origin; sign-in required (research-use license).	Survey Research Center, University of Michigan — Surveys of Consumers (license)
ntia	full Per-question human distributions published openly on the static site.	NTIA Internet Use Survey (U.S. Government work, 17 USC 105) (license)
opinionsqa	gated Per-question human distributions served from authenticated origin; sign-in required (research-use license).	Santurkar et al., ICML 2023 — Whose Opinions Do LLMs Reflect? (derived from Pew American Trends Panel) (license)
pewtech	gated Per-question human distributions served from authenticated origin; sign-in required (research-use license).	Pew Research Center — Internet & Technology (American Trends Panel) (license)
subpop	gated Per-question human distributions served from authenticated origin; sign-in required (research-use license).	Suh et al., ACL 2025 — SubPOP: Subpopulation-Level Opinion Prediction (license)
wvs	gated Per-question human distributions served from authenticated origin; sign-in required (research-use license).	World Values Survey Association — WVS Wave 7 (2017-2022) (license)

Private holdout split

Every holdout-enabled dataset is partitioned deterministically into an 80% public subset and a 20% private subset. Per-question human distributions for the private subset are suppressed from every public artifact — the site, the run-detail JSON, and the question-explorer pages — so a submitter can't reverse-engineer the private answer key by reading public files.

Submissions must include results for every question, public and private. We (the SynthBench maintainers) score the private subset locally against the hidden distribution and publish the resulting sps_private alongside sps_public. Rows whose public/private SPS delta stays within a tolerance of 0.05 earn a ✓ verified badge; rows outside the tolerance are ⚠ flagged for review.

The partition is derived from a SHA-256 hash of dataset_name + ":" + question_key. That makes it stable across runs, machines, and Python versions without needing a seed file. We do not publish which questions land in the private subset — doing so would defeat the anti-fabrication property.

Why bother. Two independent motivations converge on the same mechanism. First, future LLMs may train on SynthBench itself; holding out 20% slows that recursion. Second (and more important after the public flip): once per-question human distributions are visible, a bad actor could fabricate a submission JSON that exactly matches them. The private subset is a cheap cheating detector — a fabricator has no signal on the hidden keys, so their public and private SPS diverge sharply.

Held-out re-evaluation

On pewtech and globalopinionqa, 25% of items are held back from contributors entirely. The headline SPS reported by synthbench run is computed against the public 75% cut; the held-out 25% is invisible to the runner unless the operator sets SYNTHBENCH_HELD_OUT_AUTH and passes --held-out (server use only).

A periodic worker re-evaluates published configs against the held-out cut and writes the result back as sps_held_out on each row. The |sps − sps_held_out| delta drives the trust badge on the leaderboard: ✓ held-out when the delta is at most 0.05, and ⚠ held-out when it exceeds the threshold and the row is under investigation. Rows without a held-out run yet carry no badge.

The partition is deterministic — sha256(item_id + ":" + seed) with a pinned canonical seed — so the same item lands in the same cut across runs, machines, and Python versions. The threshold above is a starting placeholder borrowed from the publish-time cheat-detector (private_holdout.SPS_DIVERGENCE_THRESHOLD); it will be calibrated per-dataset once enough re-eval runs have accumulated.

How this differs from the private holdout above. The private holdout split runs once at publish time as a cheat-detector across every holdout-enabled dataset. The held-out re-evaluation runs periodically on the server against a cut that contributors never see — protection against a config that overfits to the public sample over time. Both signals can sit on the same row.

How a Benchmark Run Works

Every provider goes through the same evaluation pipeline. Here is what happens when you run synthbench run.

1

Select dataset and question set

Choose a suite: Core (300 questions, ~1h) for iteration or Full (1,498 questions, ~6h) for publication-grade results.
2

Generate samples for each question

For each question-persona pair, generate N independent samples from the model. Default: 30 samples. Publication-grade: 100 samples.
3

Parse responses into distributions

Map each model response to a survey option. Compute P(option_i) = count(option_i) / N. Logprob-capable providers return distributions directly.
4

Compare against human ground truth

For each question-demographic pair, compute JSD between the model distribution and the human distribution from real survey data.
5

Compute per-question metrics

Calculate all five sub-metrics (P_dist, P_rank, P_cond, P_sub, P_refuse) at the question level.
6

Aggregate with confidence intervals

Average across questions. Bootstrap resampling produces 95% confidence intervals for each metric and the composite SPS.

Sampling

• Minimum 30 samples per question-persona pair
• 100 samples recommended for publication
• Wilson score intervals quantify estimation uncertainty
• Parse failures are logged and excluded from counts

Replication

• Multiple independent runs recommended
• Convergence data validates stability
• Run-to-run variance reported alongside point estimates

Fairness

• Temperature fixed at provider default (not optimized)
• Each provider tested via its native interface
• Same question sets and evaluation pipeline for all providers

Run Validity Filtering

Silent API failures — budget exhaustion, missed model aliases, provider fallbacks — can return responses that parse cleanly but carry no real signal, producing a run whose per-question distributions are all perfectly uniform. Existing parse-failure tracking does not catch this case because the response did parse. We filter these runs at publish time so they never reach the leaderboard.

A run is excluded when all three of the following hold:

• Uniform fraction. At least 80% of per-question model_distribution entries are within 0.01 of perfectly uniform (e.g. {0.25, 0.25, 0.25, 0.25} for a 4-option question).
• Negligible refusals. Mean per-question model_refusal_rate ≤ 0.05. A model that genuinely refuses most questions is a legitimate safety pattern, not a failed API response.
• Enough questions. At least 10 questions in the run. Shorter runs don't give enough signal to distinguish real flat distributions from API-failure garbage.

Excluded runs are listed in leaderboard.json#excluded_runs for transparency, and operators can run synthbench scan-invalid locally against any results directory. The rule is deliberately strict: we would rather miss a borderline case than false-flag a legitimate run whose distributions happen to be flat.

Anti-gaming and adversarial robustness

SynthBench takes the threat of benchmark gaming seriously. Our threat model and defences are informed directly by recent adversarial analyses of AI benchmarks — in particular, Wang et al. (UC Berkeley, 2026) [1], who broke eight widely-used agent benchmarks to near-perfect scores without solving a single task.

SynthBench is a submission-artifact benchmark — participants submit a JSON file, we score it — so attacks that assume a co-resident evaluating agent (sandbox escapes, eval() on agent input, binary-trojan wrappers, LLM-judge prompt injection) are structurally inapplicable. But the pattern behind Wang et al.'s attacks — attack the evaluator, not the task — applies to us, and we have designed three lines of defence.

1

Private holdout

Each holdout-enabled dataset is deterministically partitioned into a public 80% (whose per-question human distributions are published on the leaderboard) and a private 20% (whose distributions are withheld). A submitter who copies published distributions has no signal for the private subset, so their public-vs-private SPS must diverge. A separation above 0.05 at production scale is flagged as a validation error. The split is keyed on a versioned salt rotated quarterly, so the partition is not permanently fixed.
2

Tiered statistical validation

Every submission runs through three progressively deeper checks. Tier 1 verifies schema, bounds, and distribution validity. Tier 2 recomputes JSD, Kendall's τ, and composite parity from per-question distributions and rejects any submission whose reported aggregates do not reconcile with its claimed per-question data. Tier 3 applies statistical-anomaly detectors that compare the submission against the distribution of legitimate runs — implausibly perfect per-question JSD, suspiciously-close-to-public-human distributions, and same-family peer deviation.
3

Adversarial regression suite

Our test harness maintains a set of fabricated submissions that must fail validation — pure answer-key copies, public-copy-with-fabricated-private attacks, constant-offset noise attacks, aggregate inflation, claimed-model swaps. New detectors are evaluated against this suite before merge, and any tuning of existing detector thresholds must keep the suite passing. The suite is versioned alongside the validator and runs as a CI gate.

Null-agent baselines

Following Wang et al.'s recommendation, we additionally track baseline "null agent" submissions — a uniform-random baseline and a majority-class baseline — over time. These are the floor any benchmark-serious model must clear to appear non-trivial. An upward drift in baseline SPS on a stable dataset is itself a scoring-function bug, not a success, and is monitored as part of CI.

Null agent	Dataset	SPS floor	n
Random baseline	OpinionsQA	0.763	684
Random baseline	SubPOP	0.758	200
Random baseline	GlobalOpinionQA	0.710	10
Majority baseline	OpinionsQA	0.715	80
Majority baseline	GlobalOpinionQA	0.690	100
Majority baseline	SubPOP	0.673	200

Canonical floors are the maximum composite_parity observed per (provider, dataset) across every baseline run in leaderboard-results/ — mirroring the leaderboard's own display logic, so these are the numbers a new submission must beat to appear non-trivial. CI enforces random-baseline SPS < 0.80 and majority-baseline SPS < 0.85; the Berkeley paper's aspirational target for the random baseline is 0.70, with the residual gap attributable to SynthBench's p_refuse term rewarding agreement with human DK/Refused rates. The git log of docs/baseline-floors-log.md is the canonical drift record.

We intend this posture to be conservative. The correct failure mode for a scientific leaderboard is to reject a legitimate superhuman submission for human review, not to accept a fabricated one. Submitters whose runs are flagged can appeal, but the null hypothesis at validation time is that suspicious-looking data was fabricated.

Reference [1]

Wang, Y., Mang, K., Cheung, T., Sen, S., & Song, D. (2026). How We Broke Top AI Agent Benchmarks. UC Berkeley RDI. rdi.berkeley.edu/blog/trustworthy-benchmarks-cont (opens in a new tab)

Cost computation

Cost is a derived quantity on SynthBench: providers return measured token counts at runtime, and the publish pipeline multiplies those by a dated pricing snapshot to produce the $/100Q numbers you see on the leaderboard.

What we measure

Every benchmark run records per-call token usage from the provider API — input tokens, output tokens, and (for Anthropic) cache-creation and cache-read tokens. Tokens are a measured quantity returned by the provider, not an estimate. The runner aggregates per-question usage into a run-level total stored alongside scores in the raw run JSON.

Runs on providers that do not return usage metadata (e.g. Ollama local inference) record no token counts, and downstream cost is reported as null rather than guessed.

What we derive

At publish time, we multiply recorded tokens by per-model pricing from synthpanel to derive four fields per leaderboard row: cost_usd (total spend for the run), cost_per_100q (spend normalized to 100 questions), cost_per_sps_point (spend per SPS point achieved), and is_cost_estimated (true if any pricing dimension was inferred rather than measured).

Derivation happens in synthbench/publish.py. Tokens stay immutable in raw run JSON; cost is recomputed every time we republish, so price-list updates flow through without rerunning any benchmark.

Pricing snapshot

The exact pricing table used to derive costs is serialized into leaderboard.json under a top-level pricing_snapshot block. It includes a snapshot_date and per-provider rates (input, output, cache-creation, and cache-read cost per million tokens). Readers can reproduce every displayed cost from the snapshot and the raw token counts.

Rates track the public price list documented on provider pricing pages (e.g. https://www.anthropic.com/pricing). The snapshot is updated when synthpanel's cost.py constants are bumped.

Self-hosted and unknown models

For self-hosted models (Ollama, local inference) and any provider whose pricing is not tracked in synthpanel, cost_usd is reported as null. We deliberately do not impute a cost — hardware, electricity, and amortization vary too widely to produce a number readers can fairly compare against API-priced rows.

Null cost rows sort last in the leaderboard's $/100Q column and are excluded from the Cost-vs-SPS Pareto chart on the Findings page.

Ensemble cost

An ensemble run's cost is the sum of its constituent runs' costs. If every constituent has a tracked cost, the ensemble's cost_usd is the sum; if any constituent is null (self-hosted or unknown), the ensemble's cost_usd is null rather than a partial total.

This mirrors how ensembles aggregate elsewhere on the site: a composite number is only emitted when every input to that number is known.

Pre-tracking runs

Runs produced before per-call token capture landed in the runner have no usage metadata and therefore cost_usd is null. We do not backfill an estimate; the dash in the $/100Q column tells readers honestly that the number was never measured for that row.

A run's cost can only be recovered by re-running it on the current runner. Raw run JSONs are append-only.

Related Work

SynthBench builds on a growing body of work on LLM-based synthetic survey respondents and social simulation, and sits inside a broader industry conversation about LLM failure modes in advisory settings. The references below anchor our methodological choices and locate SynthBench within both literatures.

SynthPanel evaluation & social simulation

Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions
ACL 2025

Suh, Jahanparast, Moon, Kang, Chang · arXiv:2502.16761

SubPOP — the source of our 22-subpopulation dataset. Fine-tunes language models on scaled survey data to predict population-distribution responses. We use the shipped question-subpop distributions as ground truth for P_dist and P_sub on demographic slices.
Assessing the Reliability of Persona-Conditioned LLMs as Synthetic Survey Respondents
arXiv 2026

Persona-reliability evaluation · arXiv:2602.18462

Contemporary methodological antecedent: evaluates 2 models across 70K WVS respondent-item pairs and finds that persona prompting does not uniformly improve alignment. SynthBench's P_cond sub-metric is designed to quantify exactly when conditioning helps versus hurts — surfacing the per-group cases their paper reports in aggregate.
SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users
arXiv 2025

Zhang, Lin, Wei, et al. (Fudan DISC) · arXiv:2504.10157

SocioVerse focuses on population-scale simulation dynamics via a 10M-user pool; SynthBench measures the per-question distributional parity layer that any system like SocioVerse depends on. Orthogonal and complementary — a population-simulator's downstream fidelity is bounded by the per-item parity its respondent models achieve.
From Individual to Society: A Survey on Social Simulation Driven by LLM-based Agents
arXiv 2024

Survey · arXiv:2412.03563

One-stop survey of LLM-driven social simulation spanning individual-level persona modeling through society-scale multi-agent dynamics. Useful entry point for readers placing SynthBench's per-question parity focus within the broader simulation landscape.

Broader LLM-as-advisor context

Work on systemic LLM failure modes in advisory settings. These papers motivate the need for distributional measurement infrastructure without themselves operationalizing a ground-truth comparison.

Researchers Asked LLMs for Strategic Advice. They Got 'Trendslop' in Return.
HBR, March 2026

Romasanta, Thomas, Levina · hbr.org

15,000 strategic-advice queries across seven frontier LLMs converged on the same trend-aligned recommendations regardless of scenario context. Coined 'trendslop' as a shared attractor across providers. SynthBench's Cross-Provider JSD Matrix operationalizes this as a measurable quantity paired with ground-truth human distributions — see the diagnostic on the Findings page.
Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality
HBS Working Paper 24-013, 2023

Dell'Acqua, McFowland, Mollick, Lifshitz-Assaf, Kellogg, Rajendran, Krayer, Candelon, Lakhani · hbs.edu

Field experiment with 758 BCG consultants demonstrating that LLM assistance lifts performance inside the capability frontier and degrades it outside — the 'jagged frontier' framing. Complements the trendslop critique: advisor quality depends on where the task sits relative to current model competence, not just on output fluency. Both motivate per-question, per-population measurement rather than aggregate benchmark leaderboards.

Roadmap

SynthBench expands in phases, each adding new ground-truth datasets and metric dimensions.

1

Phase 1: US Opinion Parity

Current

Establish the core benchmark with US survey data. Validate metrics against human baselines and publish initial leaderboard.

Datasets: OpinionsQA, SubPOP

Metrics: P_dist, P_rank, P_cond, P_sub, P_refuse

2

Phase 2: Global Cultural Parity

Extend evaluation to 138 countries. Test whether models capture cultural differences in opinion beyond US demographics.

Datasets: GlobalOpinionQA expansion

Metrics: + cross-cultural P_dist

3

Phase 3: Open-Ended Qualitative

Move beyond multiple choice. Evaluate free-text responses for thematic accuracy using automated judge models.

Datasets: Custom open-ended question sets

Metrics: + P_theme (LLM-as-judge)

4

Phase 4: Temporal Stability

Track how model accuracy changes over time. Measure whether models keep pace as public opinion shifts.

Datasets: Longitudinal tracking panels

Metrics: + temporal drift metrics

Citation

If you use SynthBench in your research, please cite:

@software{synthbench2026,
  title   = {SynthBench: Open Benchmark for Synthetic Survey Respondents},
  author  = {DataViking-Tech},
  year    = {2026},
  url     = {https://github.com/DataViking-Tech/synthbench},
  version = {0.1.0}
}

Methodology

Diagnosis without measurement

Two orthogonal diagnostics

Ground-truth fidelity

Cross-provider concordance

How We Score

P_dist

P_rank

P_cond

P_sub

P_refuse

P_theme

Baselines: What to Compare Against

Normalized position ("Range %")

Random Baseline

Majority-Class

Population-Average

Unconditioned LLM

Human Ceiling

What it is

Why it matters

Per-dataset protocol

Two granularities: aggregate vs subgroup

Measured values (v0.1)

Sample-size quality flags (Cochran 1977)

Survey-weight caveat

Citations

Temporal Drift Floor

What it is

Why it is separate from the Human Ceiling

Status

Datasets

OpinionsQA

GlobalOpinionQA

SubPOP

GSS

NTIA Internet Use Supplement

WVS (World Values Survey)

Eurobarometer

Michigan (Surveys of Consumers)

Pew Technology

Dataset redistribution policy

Private holdout split

Held-out re-evaluation

How a Benchmark Run Works

Select dataset and question set

Generate samples for each question

Parse responses into distributions

Compare against human ground truth

Compute per-question metrics

Aggregate with confidence intervals

Sampling

Replication

Fairness

Run Validity Filtering

Anti-gaming and adversarial robustness

Private holdout

Tiered statistical validation

Adversarial regression suite

Null-agent baselines

Cost computation

What we measure

What we derive

Pricing snapshot

Self-hosted and unknown models

Ensemble cost

Pre-tracking runs

Related Work

SynthPanel evaluation & social simulation

Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions

Assessing the Reliability of Persona-Conditioned LLMs as Synthetic Survey Respondents

SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users

From Individual to Society: A Survey on Social Simulation Driven by LLM-based Agents

Broader LLM-as-advisor context

Researchers Asked LLMs for Strategic Advice. They Got 'Trendslop' in Return.

Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality

Roadmap

Phase 1: US Opinion Parity

Phase 2: Global Cultural Parity

Phase 3: Open-Ended Qualitative