Methodology
How SynthBench measures synthetic survey respondent quality — from the metrics we use to the datasets we validate against.
Diagnosis without measurement
Why a benchmark pinned to real human response distributions is the missing companion piece to the cross-provider homogenization critique.
Romasanta, Thomas, and Levina (HBR, March 2026) ran 15,000 strategic-advice queries across seven frontier LLMs and found that responses converged on the same trend-aligned recommendations regardless of scenario context — trendslop, a shared attractor across providers rather than an idiosyncrasy of any one model. The critique lands. But it is a diagnosis without a measurement: cross-model consensus is only evidence of failure if you can point to the distribution the models should have matched, and on open-ended strategic questions no such ground truth exists.
SynthBench is the measurement. Every question we evaluate carries a human response distribution — from Pew American Trends Panel, SubPOP subpopulations, and GlobalOpinionQA cross-country data — and we score how far each provider's output lands from that distribution, overall and per demographic slice.
Two orthogonal diagnostics
Ground-truth fidelity
Model vs. human distribution
Mean Jensen-Shannon divergence against real survey responses, rolled up into the Synthetic Panel Score (SPS). This is the primary SynthBench metric — measured end-to-end in the sections below.
Cross-provider concordance
Model vs. model distribution
Pairwise JSD between raw-LLM providers on the same items. Low cross-provider JSD with high human-alignment means real population variance is being tracked; low cross-provider JSD with low human-alignment is the trendslop signature. See the Cross-Provider JSD Matrix →
The rest of this page documents the metrics, datasets, baselines, scoring protocol, cost model, and related literature behind those two diagnostics.
How We Score
The SynthBench Parity Score (SPS) is the average of five sub-metrics (six in later phases). Each measures a distinct dimension of how well a model reproduces human survey behavior on a 0–1 scale.
P_dist
Distributional ParityHow closely the model's answer percentages match real human survey responses.
- poor
- Completely different distributions
- fair
- Some overlap but systematic divergence
- good
- Close match on most questions
- excellent
- Nearly identical to human distributions
Mathematical detail
P_dist = 1 - mean(JSD) across all question-demographic pairs, where JSD is Jensen-Shannon divergence.
P_rank
Rank-Order ParityWhether the model ranks response options in the same order as humans, even if exact percentages differ.
- poor
- Reversed or random ordering
- fair
- Gets the top option right but scrambles the rest
- good
- Mostly correct ordering with minor swaps
- excellent
- Perfect rank agreement with humans
Mathematical detail
P_rank = (1 + mean(tau_b)) / 2, where tau_b is Kendall's tau-b on probability rankings, normalized to [0, 1].
P_cond
Conditioning FidelityWhen told "respond as a 65-year-old conservative," does the model actually shift its answers to match that demographic?
- poor
- Personas have no effect on output
- fair
- Some demographic sensitivity but inconsistent
- good
- Meaningful shifts that track real demographic differences
- excellent
- Conditioning precisely reproduces demographic patterns
Mathematical detail
P_cond = mean(max(0, align_conditioned(G) - align_default(G))) across all demographic groups G.
P_sub
Subgroup ConsistencyWhether accuracy is even across all demographic groups, or if some populations are systematically underserved.
- poor
- Wildly uneven across groups
- fair
- Accurate for majorities, poor for minorities
- good
- Modest variation across groups
- excellent
- Equally accurate for all demographics
Mathematical detail
P_sub = 1 - CV(group_scores), where CV is the coefficient of variation (std / mean) of per-group P_dist.
P_refuse
Refusal CalibrationWhether the model declines to answer at rates matching real human refusal patterns.
- poor
- Refusal rates completely off (answers everything or refuses everything)
- fair
- Gets the direction right but magnitudes are off
- good
- Close calibration on most questions
- excellent
- Matches human refusal patterns precisely
Mathematical detail
P_refuse = 1 - mean(|R_provider - R_human|) across all question-demographic pairs.
P_theme
Thematic Parity Phase 2+For open-ended responses, whether the model's themes and reasoning align with human qualitative patterns.
- poor
- Off-topic or generic reasoning
- fair
- Hits some themes but misses key ones
- good
- Covers most human themes with reasonable proportions
- excellent
- Themes and reasoning indistinguishable from human responses
Mathematical detail
LLM-as-judge evaluation: theme relevance, theme distribution accuracy, and reasoning quality scored on a rubric.
Baselines: What to Compare Against
A model that scores 0.70 SPS sounds good — until you realize the majority-class baseline scores 0.45. Baselines give meaning to raw scores by anchoring the scale.
| Baseline | SPS | Role |
|---|---|---|
| Random Baseline | ~0.31 | floor |
| Majority-Class | ~0.45 | low |
| Population-Average | ~0.52 | mid |
| Unconditioned LLM | ~0.58 | high |
| Human Ceiling | ~0.99 | ceiling |
| Temporal Drift Floor | N/A | ceiling |
Normalized position ("Range %")
Raw SPS values for strong models compress into a narrow band (~0.82-0.85) that is visually hard to separate on a 0-1 bar. The leaderboard also reports a normalized position — where a row sits inside the meaningful evaluation range bounded below by the raw-LLM baseline and above by the Human Ceiling.
Range % = (SPS − P_unconditioned) / (P_ceiling − P_unconditioned)
P_unconditioned is the SPS of the raw-LLM baseline for the same underlying model on the same dataset (the "just prompt the model" reference). Raw-LLM rows therefore resolve to 0% — they are the reference. Product rows (e.g. conditioned SynthPanel variants) show their lift above the corresponding raw model.
P_ceiling is the dataset's aggregate Human Ceiling (see below). Rows without a resolvable raw-LLM baseline (statistical baselines, multi-model ensembles) display "—" for Range % and should be compared via raw SPS instead.
Raw SPS remains the primary column for academic comparability. Range % is a supplementary display that amplifies product lift — a 0.01 SPS gain on a narrow range becomes a double-digit percentage of headroom closed.
OpinionsQA: subgroup ceiling distribution
median 0.9955
Expand for per-subgroup distribution, lowest-ceiling subgroups, and
sample-size quality flags.
OpinionsQA: subgroup ceiling distribution
median 0.9955Per-(wave, attribute, group) ceilings across 1046 subgroups from 15 Pew ATP waves. The wave-aggregate ceiling ( 0.9995, used for P_dist) is a different quantity and does not bound P_sub.
Lowest-ceiling subgroups
- ATP W36 · POLPARTY_RACE · Democrat × Refused 0.8184
- ATP W82 · RACE · Other 0.8234
- ATP W36 · POLPARTY_RACE · Refused × White 0.8326
- ATP W29 · POLPARTY_RACE · Refused × White 0.8340
- ATP W27 · POLPARTY_RACE · Republican × Black 0.8375
Quality flags (Cochran 1977): 585 high · 93 medium · 368 low — low-n subgroups are retained but flagged so low reliability is not conflated with poor model fit.
Used by the leaderboard when scoring P_sub; compare P_dist against the wave-aggregate ceiling instead.
Random Baseline
SPS ~0.31Uniform random distribution over all non-refusal options. For a question with k options, each gets P = 1/k.
What it tests: The absolute floor. Any provider scoring at or below this is adding negative value.
Majority-Class
SPS ~0.45Always picks the most popular human answer. Assigns P = 1.0 to the mode, 0.0 to everything else.
What it tests: Whether a model does better than just echoing the most common response. Scores well on consensus questions, poorly on divisive ones.
Population-Average
SPS ~0.52Uses the overall population distribution (ignoring demographics) for every group. The same answer regardless of who is being simulated.
What it tests: Isolates the value of demographic conditioning. The gap between this baseline and a conditioned provider is the conditioning premium.
Unconditioned LLM
SPS ~0.58Raw model output with no persona or demographic conditioning. This is the "just prompt ChatGPT" approach many researchers currently use.
What it tests: The competitive baseline. Every dedicated synthetic respondent product must beat this to justify its existence.
Human Ceiling
SPS ~0.99Agreement between independent halves of the human survey panel, measured via multinomial-bootstrap split-half reliability. The theoretical maximum score a model can achieve.
What it tests: A provider scoring above this is overfitting or exploiting artifacts. Sets the upper bound for meaningful evaluation.
Methodology details (for researchers)
What it is
Split-half reliability via multinomial bootstrap. Given observed counts c = [c_1, ..., c_k] with total n per question × subpop, we treat p̂ = c/n as the multinomial MLE and draw two independent half-samples of size ⌊n/2⌋ from Multinomial(n/2, p̂). We compute JSD between the two empirical distributions, repeat B times, and report Ceiling = 1 − mean(JSD). Published values use B=1000, the full bootstrap budget; a vectorized multinomial-and-JSD path keeps this feasible at publish time. Reproducible with bootstrap seed 42.
Why it matters
It is the theoretical maximum for any evaluation built on this human data. Humans disagree with each other; no model can be more consistent with the survey than the survey is with itself. Per-subpopulation ceilings genuinely differ — Spearman-Brown says reliability scales with √n — so small subpops have materially lower ceilings than large ones.
Per-dataset protocol
OpinionsQA: computed within-wave (never cross-wave — that measures drift, not reliability), aggregated weighted by n_questions per wave. SubPOP: per-subpop (22 subpopulations) + weighted aggregate. GlobalOpinionQA: per-country + regional aggregates, weighted by actual (country, question) coverage to avoid over-weighting US data.
Two granularities: aggregate vs subgroup
P_dist is measured at wave granularity (~4.5k respondents/wave), so it compares against the wave-aggregate ceiling (~0.9995). P_sub is measured at (wave × attribute × group) granularity — typical n is 50-500, not 4,500 — so the aggregate ceiling materially overstates achievable headroom. OpinionsQA now emits a per-subgroup ceiling distribution and uses the median as the reference for P_sub; the wave-aggregate is kept for P_dist. A model at P_sub = 0.88 is not sitting 0.12 below a true ceiling of 0.9995 — its real headroom is the gap to the subgroup median, which is meaningfully smaller and shrinks further on low-n groups where the ceiling itself drops to ~0.95.
Measured values (v0.1)
OpinionsQA: 0.9995 (raw counts from 15 Pew ATP waves, n≈56k/wave). SubPOP: 0.9954 (22 subpops, inferred n=500 per subpop, quality flag 'medium'). GlobalOpinionQA: 0.9972 (per-country, inferred n=1000, quality flag 'medium'). All values are reproducible with bootstrap seed 42. The high ceilings reflect large within-wave sample sizes; the realistic gap between models and humans is most visible on small, contentious subpops.
Sample-size quality flags (Cochran 1977)
High: raw subpop n ≥ 400 — use directly. Medium: 200 ≤ n < 400 — report with CI caveat. Low: n < 200 — report with warning; do not use as a gating threshold. SubPOP and GlobalOpinionQA ceilings are flagged 'medium' because sample sizes are inferred from published probabilities, not shipped as raw counts.
Survey-weight caveat
Ceiling is computed from raw category counts. Pew applies survey weights; ignoring them could shift the ceiling by 1-3% on demographically skewed subgroups. The raw-count approach is conservative — a weighted ceiling would be slightly tighter, meaning our published ceilings are a mild upper-bound on achievable reliability.
Citations
Methodology: Santurkar et al. 2023 (arxiv:2303.17548); Durmus et al. 2023 (arxiv:2306.16388); Suh et al. 2025 — SubPOP (arxiv:2502.16761); Pew Research Center Methodology. Statistical foundations: Spearman (1910) & Brown (1910) — Spearman-Brown prophecy; Efron (1979) — bootstrap; Lin (1991) — Jensen-Shannon divergence; Cochran (1977) — Sampling Techniques, for the n=400 rule-of-thumb.
Temporal Drift Floor
SPS N/AA baseline-adjacent metric separate from Human Ceiling: JSD between the same-wording questions across different Pew ATP waves. Pew repeats ~15-20% of items across waves for trend tracking.
What it tests: Quantifies how much real-world opinions shift year-over-year on repeated items. Useful framing for P_refuse and longitudinal claims — drift is a property of reality, not of a model.
Methodology details (for researchers)
What it is
For every OpinionsQA question stem that appears in two or more waves, we compute JSD between its human distributions across waves. We report mean JSD overall, broken down by year-gap (1, 2, 3, 5 years), with bootstrap 95% CI over question pairs.
Why it is separate from the Human Ceiling
Split-half reliability must be within-wave: cross-wave JSD conflates sampling noise with real opinion shift. Temporal Drift Floor captures only the latter. A model that perfectly predicts 2022 opinions using 2017 training data would still score imperfectly against 2017 humans — the drift floor tells you how much of that gap is unavoidable.
Status
Computed for OpinionsQA only (requires same-wording repeated questions, which Pew ATP uniquely ships). Not applicable to SubPOP or GlobalOpinionQA snapshots. Values emitted to leaderboard.json under baselines.temporal_drift.
See also: See drift-by-year-gap visualization on the Findings page
Datasets
SynthBench validates models against real human survey data. Each dataset provides ground-truth response distributions from representative populations.
OpinionsQA
The primary ground-truth dataset. Multiple-choice questions covering political, social, and economic attitudes from nationally representative US surveys.
- Questions
- 1,498 (300 core)
- Source
- Pew American Trends Panel
- Scope
- US population
- Coverage
- 2017–2022 (15 survey waves)
- Demographics
- 56 groups across 11 attributes
SubPOP
Extended dataset focused on demographic subgroup variation. Tests whether models can differentiate between fine-grained population segments.
- Questions
- 3,362
- Source
- 22 US subpopulations
- Scope
- US demographic subgroups
- Coverage
- Cross-sectional
- Demographics
- 11 attributes: age, education, gender, ideology, party, income, religion, attendance, region, marital status, citizenship
GlobalOpinionQA
Phase 2Cross-cultural attitudes dataset for Phase 2 expansion. Tests whether models capture cultural differences in opinion beyond US demographics.
- Questions
- 2,556
- Source
- Pew Global Attitudes
- Scope
- 138 countries
- Coverage
- Cross-national
- Demographics
- Country-level cultural variation
Private holdout split
Every holdout-enabled dataset is partitioned deterministically into an 80% public subset and a 20% private subset. Per-question human distributions for the private subset are suppressed from every public artifact — the site, the run-detail JSON, and the question-explorer pages — so a submitter can't reverse-engineer the private answer key by reading public files.
Submissions must include results for every question, public and private. We (the SynthBench maintainers) score the private subset locally against the hidden distribution and publish the resulting sps_private alongside sps_public. Rows whose public/private SPS delta stays within a tolerance of 0.05 earn a ✓ verified badge; rows outside the tolerance are ⚠ flagged for review.
The partition is derived from a SHA-256 hash of dataset_name + ":" + question_key. That makes it stable across runs, machines, and Python versions without needing a seed file. We do not publish which questions land in the private subset — doing so would defeat the anti-fabrication property.
Why bother. Two independent motivations converge on the same mechanism. First, future LLMs may train on SynthBench itself; holding out 20% slows that recursion. Second (and more important after the public flip): once per-question human distributions are visible, a bad actor could fabricate a submission JSON that exactly matches them. The private subset is a cheap cheating detector — a fabricator has no signal on the hidden keys, so their public and private SPS diverge sharply.
How a Benchmark Run Works
Every provider goes through the same evaluation pipeline. Here is what
happens when you run synthbench run.
- 1
Select dataset and question set
Choose a suite: Core (300 questions, ~1h) for iteration or Full (1,498 questions, ~6h) for publication-grade results.
- 2
Generate samples for each question
For each question-persona pair, generate N independent samples from the model. Default: 30 samples. Publication-grade: 100 samples.
- 3
Parse responses into distributions
Map each model response to a survey option. Compute P(option_i) = count(option_i) / N. Logprob-capable providers return distributions directly.
- 4
Compare against human ground truth
For each question-demographic pair, compute JSD between the model distribution and the human distribution from real survey data.
- 5
Compute per-question metrics
Calculate all five sub-metrics (P_dist, P_rank, P_cond, P_sub, P_refuse) at the question level.
- 6
Aggregate with confidence intervals
Average across questions. Bootstrap resampling produces 95% confidence intervals for each metric and the composite SPS.
Sampling
- • Minimum 30 samples per question-persona pair
- • 100 samples recommended for publication
- • Wilson score intervals quantify estimation uncertainty
- • Parse failures are logged and excluded from counts
Replication
- • Multiple independent runs recommended
- • Convergence data validates stability
- • Run-to-run variance reported alongside point estimates
Fairness
- • Temperature fixed at provider default (not optimized)
- • Each provider tested via its native interface
- • Same question sets and evaluation pipeline for all providers
Run Validity Filtering
Silent API failures — budget exhaustion, missed model aliases, provider fallbacks — can return responses that parse cleanly but carry no real signal, producing a run whose per-question distributions are all perfectly uniform. Existing parse-failure tracking does not catch this case because the response did parse. We filter these runs at publish time so they never reach the leaderboard.
A run is excluded when all three of the following hold:
- • Uniform fraction. At least 80% of per-question model_distribution entries are within 0.01 of perfectly uniform (e.g. {0.25, 0.25, 0.25, 0.25} for a 4-option question).
- • Negligible refusals. Mean per-question model_refusal_rate ≤ 0.05. A model that genuinely refuses most questions is a legitimate safety pattern, not a failed API response.
- • Enough questions. At least 10 questions in the run. Shorter runs don't give enough signal to distinguish real flat distributions from API-failure garbage.
Excluded runs are listed in leaderboard.json#excluded_runs for transparency, and operators can run
synthbench scan-invalid locally against any results directory. The rule is deliberately strict:
we would rather miss a borderline case than false-flag a legitimate run
whose distributions happen to be flat.
Cost computation
Cost is a derived quantity on SynthBench: providers return measured token
counts at runtime, and the publish pipeline multiplies those by a dated
pricing snapshot to produce the $/100Q numbers you see on the leaderboard.
What we measure
Every benchmark run records per-call token usage from the provider API — input tokens, output tokens, and (for Anthropic) cache-creation and cache-read tokens. Tokens are a measured quantity returned by the provider, not an estimate. The runner aggregates per-question usage into a run-level total stored alongside scores in the raw run JSON.
Runs on providers that do not return usage metadata (e.g. Ollama local inference) record no token counts, and downstream cost is reported as null rather than guessed.
What we derive
At publish time, we multiply recorded tokens by per-model pricing from synthpanel to derive four fields per leaderboard row: cost_usd (total spend for the run), cost_per_100q (spend normalized to 100 questions), cost_per_sps_point (spend per SPS point achieved), and is_cost_estimated (true if any pricing dimension was inferred rather than measured).
Derivation happens in synthbench/publish.py. Tokens stay immutable in raw run JSON; cost is recomputed every time we republish, so price-list updates flow through without rerunning any benchmark.
Pricing snapshot
The exact pricing table used to derive costs is serialized into leaderboard.json under a top-level pricing_snapshot block. It includes a snapshot_date and per-provider rates (input, output, cache-creation, and cache-read cost per million tokens). Readers can reproduce every displayed cost from the snapshot and the raw token counts.
Rates track the public price list documented on provider pricing pages (e.g. https://www.anthropic.com/pricing). The snapshot is updated when synthpanel's cost.py constants are bumped.
Self-hosted and unknown models
For self-hosted models (Ollama, local inference) and any provider whose pricing is not tracked in synthpanel, cost_usd is reported as null. We deliberately do not impute a cost — hardware, electricity, and amortization vary too widely to produce a number readers can fairly compare against API-priced rows.
Null cost rows sort last in the leaderboard's $/100Q column and are excluded from the Cost-vs-SPS Pareto chart on the Findings page.
Ensemble cost
An ensemble run's cost is the sum of its constituent runs' costs. If every constituent has a tracked cost, the ensemble's cost_usd is the sum; if any constituent is null (self-hosted or unknown), the ensemble's cost_usd is null rather than a partial total.
This mirrors how ensembles aggregate elsewhere on the site: a composite number is only emitted when every input to that number is known.
Pre-tracking runs
Runs produced before per-call token capture landed in the runner have no usage metadata and therefore cost_usd is null. We do not backfill an estimate; the dash in the $/100Q column tells readers honestly that the number was never measured for that row.
A run's cost can only be recovered by re-running it on the current runner. Raw run JSONs are append-only.
Roadmap
SynthBench expands in phases, each adding new ground-truth datasets and metric dimensions.
Phase 1: US Opinion Parity
CurrentEstablish the core benchmark with US survey data. Validate metrics against human baselines and publish initial leaderboard.
Phase 2: Global Cultural Parity
Extend evaluation to 138 countries. Test whether models capture cultural differences in opinion beyond US demographics.
Phase 3: Open-Ended Qualitative
Move beyond multiple choice. Evaluate free-text responses for thematic accuracy using automated judge models.
Phase 4: Temporal Stability
Track how model accuracy changes over time. Measure whether models keep pace as public opinion shifts.
Citation
If you use SynthBench in your research, please cite:
@software{synthbench2026,
title = {SynthBench: Open Benchmark for Synthetic Survey Respondents},
author = {DataViking-Tech},
year = {2026},
url = {https://github.com/DataViking-Tech/synthbench},
version = {0.1.0}
}