Skip to main content

Methodology

How SynthBench measures synthetic survey respondent quality — from the metrics we use to the datasets we validate against.

Diagnosis without measurement

Why a benchmark pinned to real human response distributions is the missing companion piece to the cross-provider homogenization critique.

Romasanta, Thomas, and Levina (HBR, March 2026) ran 15,000 strategic-advice queries across seven frontier LLMs and found that responses converged on the same trend-aligned recommendations regardless of scenario context — trendslop, a shared attractor across providers rather than an idiosyncrasy of any one model. The critique lands. But it is a diagnosis without a measurement: cross-model consensus is only evidence of failure if you can point to the distribution the models should have matched, and on open-ended strategic questions no such ground truth exists.

SynthBench is the measurement. Every question we evaluate carries a human response distribution — from Pew American Trends Panel, SubPOP subpopulations, and GlobalOpinionQA cross-country data — and we score how far each provider's output lands from that distribution, overall and per demographic slice.

Two orthogonal diagnostics

Ground-truth fidelity

Model vs. human distribution

Mean Jensen-Shannon divergence against real survey responses, rolled up into the Synthetic Panel Score (SPS). This is the primary SynthBench metric — measured end-to-end in the sections below.

Cross-provider concordance

Model vs. model distribution

Pairwise JSD between raw-LLM providers on the same items. Low cross-provider JSD with high human-alignment means real population variance is being tracked; low cross-provider JSD with low human-alignment is the trendslop signature. See the Cross-Provider JSD Matrix →

The rest of this page documents the metrics, datasets, baselines, scoring protocol, cost model, and related literature behind those two diagnostics.

How We Score

The SynthBench Parity Score (SPS) is the average of five sub-metrics (six in later phases). Each measures a distinct dimension of how well a model reproduces human survey behavior on a 0–1 scale.

SPS = 0.2 × P_dist + 0.2 × P_rank + 0.2 × P_cond + 0.2 × P_sub + 0.2 × P_refuse

P_dist

Distributional Parity

How closely the model's answer percentages match real human survey responses.

Poor
Fair
Good
Excellent
0 0.25 0.5 0.75 1.0
poor
Completely different distributions
fair
Some overlap but systematic divergence
good
Close match on most questions
excellent
Nearly identical to human distributions
Mathematical detail

P_dist = 1 - mean(JSD) across all question-demographic pairs, where JSD is Jensen-Shannon divergence.

P_rank

Rank-Order Parity

Whether the model ranks response options in the same order as humans, even if exact percentages differ.

Poor
Fair
Good
Excellent
0 0.25 0.5 0.75 1.0
poor
Reversed or random ordering
fair
Gets the top option right but scrambles the rest
good
Mostly correct ordering with minor swaps
excellent
Perfect rank agreement with humans
Mathematical detail

P_rank = (1 + mean(tau_b)) / 2, where tau_b is Kendall's tau-b on probability rankings, normalized to [0, 1].

P_cond

Conditioning Fidelity

When told "respond as a 65-year-old conservative," does the model actually shift its answers to match that demographic?

Poor
Fair
Good
Excellent
0 0.25 0.5 0.75 1.0
poor
Personas have no effect on output
fair
Some demographic sensitivity but inconsistent
good
Meaningful shifts that track real demographic differences
excellent
Conditioning precisely reproduces demographic patterns
Mathematical detail

P_cond = mean(max(0, align_conditioned(G) - align_default(G))) across all demographic groups G.

P_sub

Subgroup Consistency

Whether accuracy is even across all demographic groups, or if some populations are systematically underserved.

Poor
Fair
Good
Excellent
0 0.25 0.5 0.75 1.0
poor
Wildly uneven across groups
fair
Accurate for majorities, poor for minorities
good
Modest variation across groups
excellent
Equally accurate for all demographics
Mathematical detail

P_sub = 1 - CV(group_scores), where CV is the coefficient of variation (std / mean) of per-group P_dist.

P_refuse

Refusal Calibration

Whether the model declines to answer at rates matching real human refusal patterns.

Poor
Fair
Good
Excellent
0 0.25 0.5 0.75 1.0
poor
Refusal rates completely off (answers everything or refuses everything)
fair
Gets the direction right but magnitudes are off
good
Close calibration on most questions
excellent
Matches human refusal patterns precisely
Mathematical detail

P_refuse = 1 - mean(|R_provider - R_human|) across all question-demographic pairs.

P_theme

Thematic Parity Phase 2+

For open-ended responses, whether the model's themes and reasoning align with human qualitative patterns.

Poor
Fair
Good
Excellent
0 0.25 0.5 0.75 1.0
poor
Off-topic or generic reasoning
fair
Hits some themes but misses key ones
good
Covers most human themes with reasonable proportions
excellent
Themes and reasoning indistinguishable from human responses
Mathematical detail

LLM-as-judge evaluation: theme relevance, theme distribution accuracy, and reasoning quality scored on a rubric.

Baselines: What to Compare Against

A model that scores 0.70 SPS sounds good — until you realize the majority-class baseline scores 0.45. Baselines give meaning to raw scores by anchoring the scale.

Baseline SPS Role
Random Baseline ~0.31 floor
Majority-Class ~0.45 low
Population-Average ~0.52 mid
Unconditioned LLM ~0.58 high
Human Ceiling ~0.99 ceiling
Temporal Drift Floor N/A ceiling
0.0
1.0
~0.31 Random
~0.45 Majority-Class
~0.52 Population-Average
~0.58 Unconditioned LLM
~0.99 Human Ceiling
Meaningful evaluation range: Unconditioned LLM (~0.58) to Human Ceiling (~0.99)

Normalized position ("Range %")

Raw SPS values for strong models compress into a narrow band (~0.82-0.85) that is visually hard to separate on a 0-1 bar. The leaderboard also reports a normalized position — where a row sits inside the meaningful evaluation range bounded below by the raw-LLM baseline and above by the Human Ceiling.

Range % = (SPS − P_unconditioned) / (P_ceiling − P_unconditioned)

P_unconditioned is the SPS of the raw-LLM baseline for the same underlying model on the same dataset (the "just prompt the model" reference). Raw-LLM rows therefore resolve to 0% — they are the reference. Product rows (e.g. conditioned SynthPanel variants) show their lift above the corresponding raw model.

P_ceiling is the dataset's aggregate Human Ceiling (see below). Rows without a resolvable raw-LLM baseline (statistical baselines, multi-model ensembles) display "—" for Range % and should be compared via raw SPS instead.

Raw SPS remains the primary column for academic comparability. Range % is a supplementary display that amplifies product lift — a 0.01 SPS gain on a narrow range becomes a double-digit percentage of headroom closed.

OpinionsQA: subgroup ceiling distribution

median 0.9955

Expand for per-subgroup distribution, lowest-ceiling subgroups, and sample-size quality flags.

Per-(wave, attribute, group) ceilings across 1046 subgroups from 15 Pew ATP waves. The wave-aggregate ceiling ( 0.9995, used for P_dist) is a different quantity and does not bound P_sub.

min
0.8184
p25
0.9776
median
0.9955
p75
0.9981
max
0.9997
0.818
1.000

Lowest-ceiling subgroups

  • ATP W36 · POLPARTY_RACE · Democrat × Refused 0.8184
  • ATP W82 · RACE · Other 0.8234
  • ATP W36 · POLPARTY_RACE · Refused × White 0.8326
  • ATP W29 · POLPARTY_RACE · Refused × White 0.8340
  • ATP W27 · POLPARTY_RACE · Republican × Black 0.8375

Quality flags (Cochran 1977): 585 high · 93 medium · 368 low — low-n subgroups are retained but flagged so low reliability is not conflated with poor model fit.

Used by the leaderboard when scoring P_sub; compare P_dist against the wave-aggregate ceiling instead.

Random Baseline

SPS ~0.31

Uniform random distribution over all non-refusal options. For a question with k options, each gets P = 1/k.

What it tests: The absolute floor. Any provider scoring at or below this is adding negative value.

Majority-Class

SPS ~0.45

Always picks the most popular human answer. Assigns P = 1.0 to the mode, 0.0 to everything else.

What it tests: Whether a model does better than just echoing the most common response. Scores well on consensus questions, poorly on divisive ones.

Population-Average

SPS ~0.52

Uses the overall population distribution (ignoring demographics) for every group. The same answer regardless of who is being simulated.

What it tests: Isolates the value of demographic conditioning. The gap between this baseline and a conditioned provider is the conditioning premium.

Unconditioned LLM

SPS ~0.58

Raw model output with no persona or demographic conditioning. This is the "just prompt ChatGPT" approach many researchers currently use.

What it tests: The competitive baseline. Every dedicated synthetic respondent product must beat this to justify its existence.

Human Ceiling

SPS ~0.99

Agreement between independent halves of the human survey panel, measured via multinomial-bootstrap split-half reliability. The theoretical maximum score a model can achieve.

What it tests: A provider scoring above this is overfitting or exploiting artifacts. Sets the upper bound for meaningful evaluation.

Methodology details (for researchers)

What it is

Split-half reliability via multinomial bootstrap. Given observed counts c = [c_1, ..., c_k] with total n per question × subpop, we treat p̂ = c/n as the multinomial MLE and draw two independent half-samples of size ⌊n/2⌋ from Multinomial(n/2, p̂). We compute JSD between the two empirical distributions, repeat B times, and report Ceiling = 1 − mean(JSD). Published values use B=1000, the full bootstrap budget; a vectorized multinomial-and-JSD path keeps this feasible at publish time. Reproducible with bootstrap seed 42.

Why it matters

It is the theoretical maximum for any evaluation built on this human data. Humans disagree with each other; no model can be more consistent with the survey than the survey is with itself. Per-subpopulation ceilings genuinely differ — Spearman-Brown says reliability scales with √n — so small subpops have materially lower ceilings than large ones.

Per-dataset protocol

OpinionsQA: computed within-wave (never cross-wave — that measures drift, not reliability), aggregated weighted by n_questions per wave. SubPOP: per-subpop (22 subpopulations) + weighted aggregate. GlobalOpinionQA: per-country + regional aggregates, weighted by actual (country, question) coverage to avoid over-weighting US data.

Two granularities: aggregate vs subgroup

P_dist is measured at wave granularity (~4.5k respondents/wave), so it compares against the wave-aggregate ceiling (~0.9995). P_sub is measured at (wave × attribute × group) granularity — typical n is 50-500, not 4,500 — so the aggregate ceiling materially overstates achievable headroom. OpinionsQA now emits a per-subgroup ceiling distribution and uses the median as the reference for P_sub; the wave-aggregate is kept for P_dist. A model at P_sub = 0.88 is not sitting 0.12 below a true ceiling of 0.9995 — its real headroom is the gap to the subgroup median, which is meaningfully smaller and shrinks further on low-n groups where the ceiling itself drops to ~0.95.

Measured values (v0.1)

OpinionsQA: 0.9995 (raw counts from 15 Pew ATP waves, n≈56k/wave). SubPOP: 0.9954 (22 subpops, inferred n=500 per subpop, quality flag 'medium'). GlobalOpinionQA: 0.9972 (per-country, inferred n=1000, quality flag 'medium'). All values are reproducible with bootstrap seed 42. The high ceilings reflect large within-wave sample sizes; the realistic gap between models and humans is most visible on small, contentious subpops.

Sample-size quality flags (Cochran 1977)

High: raw subpop n ≥ 400 — use directly. Medium: 200 ≤ n < 400 — report with CI caveat. Low: n < 200 — report with warning; do not use as a gating threshold. SubPOP and GlobalOpinionQA ceilings are flagged 'medium' because sample sizes are inferred from published probabilities, not shipped as raw counts.

Survey-weight caveat

Ceiling is computed from raw category counts. Pew applies survey weights; ignoring them could shift the ceiling by 1-3% on demographically skewed subgroups. The raw-count approach is conservative — a weighted ceiling would be slightly tighter, meaning our published ceilings are a mild upper-bound on achievable reliability.

Citations

Methodology: Santurkar et al. 2023 (arxiv:2303.17548); Durmus et al. 2023 (arxiv:2306.16388); Suh et al. 2025 — SubPOP (arxiv:2502.16761); Pew Research Center Methodology. Statistical foundations: Spearman (1910) & Brown (1910) — Spearman-Brown prophecy; Efron (1979) — bootstrap; Lin (1991) — Jensen-Shannon divergence; Cochran (1977) — Sampling Techniques, for the n=400 rule-of-thumb.

Temporal Drift Floor

SPS N/A

A baseline-adjacent metric separate from Human Ceiling: JSD between the same-wording questions across different Pew ATP waves. Pew repeats ~15-20% of items across waves for trend tracking.

What it tests: Quantifies how much real-world opinions shift year-over-year on repeated items. Useful framing for P_refuse and longitudinal claims — drift is a property of reality, not of a model.

Methodology details (for researchers)

What it is

For every OpinionsQA question stem that appears in two or more waves, we compute JSD between its human distributions across waves. We report mean JSD overall, broken down by year-gap (1, 2, 3, 5 years), with bootstrap 95% CI over question pairs.

Why it is separate from the Human Ceiling

Split-half reliability must be within-wave: cross-wave JSD conflates sampling noise with real opinion shift. Temporal Drift Floor captures only the latter. A model that perfectly predicts 2022 opinions using 2017 training data would still score imperfectly against 2017 humans — the drift floor tells you how much of that gap is unavoidable.

Status

Computed for OpinionsQA only (requires same-wording repeated questions, which Pew ATP uniquely ships). Not applicable to SubPOP or GlobalOpinionQA snapshots. Values emitted to leaderboard.json under baselines.temporal_drift.

See also: See drift-by-year-gap visualization on the Findings page

Datasets

SynthBench validates models against real human survey data. Each dataset provides ground-truth response distributions from representative populations.

OpinionsQA

The primary ground-truth dataset. Multiple-choice questions covering political, social, and economic attitudes from nationally representative US surveys.

Questions
1,498 (300 core)
Source
Pew American Trends Panel
Scope
US population
Coverage
2017–2022 (15 survey waves)
Demographics
56 groups across 11 attributes

SubPOP

Extended dataset focused on demographic subgroup variation. Tests whether models can differentiate between fine-grained population segments.

Questions
3,362
Source
22 US subpopulations
Scope
US demographic subgroups
Coverage
Cross-sectional
Demographics
11 attributes: age, education, gender, ideology, party, income, religion, attendance, region, marital status, citizenship

GlobalOpinionQA

Phase 2

Cross-cultural attitudes dataset for Phase 2 expansion. Tests whether models capture cultural differences in opinion beyond US demographics.

Questions
2,556
Source
Pew Global Attitudes
Scope
138 countries
Coverage
Cross-national
Demographics
Country-level cultural variation

Private holdout split

Every holdout-enabled dataset is partitioned deterministically into an 80% public subset and a 20% private subset. Per-question human distributions for the private subset are suppressed from every public artifact — the site, the run-detail JSON, and the question-explorer pages — so a submitter can't reverse-engineer the private answer key by reading public files.

Submissions must include results for every question, public and private. We (the SynthBench maintainers) score the private subset locally against the hidden distribution and publish the resulting sps_private alongside sps_public. Rows whose public/private SPS delta stays within a tolerance of 0.05 earn a ✓ verified badge; rows outside the tolerance are ⚠ flagged for review.

The partition is derived from a SHA-256 hash of dataset_name + ":" + question_key. That makes it stable across runs, machines, and Python versions without needing a seed file. We do not publish which questions land in the private subset — doing so would defeat the anti-fabrication property.

Why bother. Two independent motivations converge on the same mechanism. First, future LLMs may train on SynthBench itself; holding out 20% slows that recursion. Second (and more important after the public flip): once per-question human distributions are visible, a bad actor could fabricate a submission JSON that exactly matches them. The private subset is a cheap cheating detector — a fabricator has no signal on the hidden keys, so their public and private SPS diverge sharply.

How a Benchmark Run Works

Every provider goes through the same evaluation pipeline. Here is what happens when you run synthbench run.

  1. 1

    Select dataset and question set

    Choose a suite: Core (300 questions, ~1h) for iteration or Full (1,498 questions, ~6h) for publication-grade results.

  2. 2

    Generate samples for each question

    For each question-persona pair, generate N independent samples from the model. Default: 30 samples. Publication-grade: 100 samples.

  3. 3

    Parse responses into distributions

    Map each model response to a survey option. Compute P(option_i) = count(option_i) / N. Logprob-capable providers return distributions directly.

  4. 4

    Compare against human ground truth

    For each question-demographic pair, compute JSD between the model distribution and the human distribution from real survey data.

  5. 5

    Compute per-question metrics

    Calculate all five sub-metrics (P_dist, P_rank, P_cond, P_sub, P_refuse) at the question level.

  6. 6

    Aggregate with confidence intervals

    Average across questions. Bootstrap resampling produces 95% confidence intervals for each metric and the composite SPS.

Sampling

  • Minimum 30 samples per question-persona pair
  • 100 samples recommended for publication
  • Wilson score intervals quantify estimation uncertainty
  • Parse failures are logged and excluded from counts

Replication

  • Multiple independent runs recommended
  • Convergence data validates stability
  • Run-to-run variance reported alongside point estimates

Fairness

  • Temperature fixed at provider default (not optimized)
  • Each provider tested via its native interface
  • Same question sets and evaluation pipeline for all providers

Run Validity Filtering

Silent API failures — budget exhaustion, missed model aliases, provider fallbacks — can return responses that parse cleanly but carry no real signal, producing a run whose per-question distributions are all perfectly uniform. Existing parse-failure tracking does not catch this case because the response did parse. We filter these runs at publish time so they never reach the leaderboard.

A run is excluded when all three of the following hold:

Excluded runs are listed in leaderboard.json#excluded_runs for transparency, and operators can run synthbench scan-invalid locally against any results directory. The rule is deliberately strict: we would rather miss a borderline case than false-flag a legitimate run whose distributions happen to be flat.

Cost computation

Cost is a derived quantity on SynthBench: providers return measured token counts at runtime, and the publish pipeline multiplies those by a dated pricing snapshot to produce the $/100Q numbers you see on the leaderboard.

What we measure

Every benchmark run records per-call token usage from the provider API — input tokens, output tokens, and (for Anthropic) cache-creation and cache-read tokens. Tokens are a measured quantity returned by the provider, not an estimate. The runner aggregates per-question usage into a run-level total stored alongside scores in the raw run JSON.

Runs on providers that do not return usage metadata (e.g. Ollama local inference) record no token counts, and downstream cost is reported as null rather than guessed.

What we derive

At publish time, we multiply recorded tokens by per-model pricing from synthpanel to derive four fields per leaderboard row: cost_usd (total spend for the run), cost_per_100q (spend normalized to 100 questions), cost_per_sps_point (spend per SPS point achieved), and is_cost_estimated (true if any pricing dimension was inferred rather than measured).

Derivation happens in synthbench/publish.py. Tokens stay immutable in raw run JSON; cost is recomputed every time we republish, so price-list updates flow through without rerunning any benchmark.

Pricing snapshot

The exact pricing table used to derive costs is serialized into leaderboard.json under a top-level pricing_snapshot block. It includes a snapshot_date and per-provider rates (input, output, cache-creation, and cache-read cost per million tokens). Readers can reproduce every displayed cost from the snapshot and the raw token counts.

Rates track the public price list documented on provider pricing pages (e.g. https://www.anthropic.com/pricing). The snapshot is updated when synthpanel's cost.py constants are bumped.

Self-hosted and unknown models

For self-hosted models (Ollama, local inference) and any provider whose pricing is not tracked in synthpanel, cost_usd is reported as null. We deliberately do not impute a cost — hardware, electricity, and amortization vary too widely to produce a number readers can fairly compare against API-priced rows.

Null cost rows sort last in the leaderboard's $/100Q column and are excluded from the Cost-vs-SPS Pareto chart on the Findings page.

Ensemble cost

An ensemble run's cost is the sum of its constituent runs' costs. If every constituent has a tracked cost, the ensemble's cost_usd is the sum; if any constituent is null (self-hosted or unknown), the ensemble's cost_usd is null rather than a partial total.

This mirrors how ensembles aggregate elsewhere on the site: a composite number is only emitted when every input to that number is known.

Pre-tracking runs

Runs produced before per-call token capture landed in the runner have no usage metadata and therefore cost_usd is null. We do not backfill an estimate; the dash in the $/100Q column tells readers honestly that the number was never measured for that row.

A run's cost can only be recovered by re-running it on the current runner. Raw run JSONs are append-only.

Roadmap

SynthBench expands in phases, each adding new ground-truth datasets and metric dimensions.

1

Phase 1: US Opinion Parity

Current

Establish the core benchmark with US survey data. Validate metrics against human baselines and publish initial leaderboard.

Datasets: OpinionsQA, SubPOP
Metrics: P_dist, P_rank, P_cond, P_sub, P_refuse
2

Phase 2: Global Cultural Parity

Extend evaluation to 138 countries. Test whether models capture cultural differences in opinion beyond US demographics.

Datasets: GlobalOpinionQA expansion
Metrics: + cross-cultural P_dist
3

Phase 3: Open-Ended Qualitative

Move beyond multiple choice. Evaluate free-text responses for thematic accuracy using automated judge models.

Datasets: Custom open-ended question sets
Metrics: + P_theme (LLM-as-judge)
4

Phase 4: Temporal Stability

Track how model accuracy changes over time. Measure whether models keep pace as public opinion shifts.

Datasets: Longitudinal tracking panels
Metrics: + temporal drift metrics

Citation

If you use SynthBench in your research, please cite:

@software{synthbench2026,
  title   = {SynthBench: Open Benchmark for Synthetic Survey Respondents},
  author  = {DataViking-Tech},
  year    = {2026},
  url     = {https://github.com/DataViking-Tech/synthbench},
  version = {0.1.0}
}