Skip to main content

Research Findings

Six experiments that reveal what drives synthetic survey quality — from the levers that matter most to the biases hiding in default model behavior.

What Matters Most

Effect size of each optimization lever, sorted by maximum impact. Ensemble blending delivers +5-7 SPS points at zero additional cost.

Optimization Lever Effect Sizes

Horizontal bar chart of effect sizes for each optimization lever (ensemble, temperature, conditioning, and more), sorted by maximum impact.

The Biggest Lever: Ensemble Blending

For each dataset, ensemble blending of 3 models outperforms the best single model by +5.2 to +7.0 SPS points. Zero additional API cost — just arithmetic.

Ensemble Gain over Best Single Model

Bar chart comparing the SPS of the best single model against a 3-model ensemble for each dataset. Ensemble gains range from +5.2 to +7.0 points.

Cost vs Performance Pareto

Each point is one configuration. Pareto-optimal configurations (no other run is cheaper AND higher-SPS) are highlighted; dominated runs are muted. Lower-left of the frontier is the cost/quality trade you cannot beat.

Cost data not yet available. Re-run publish after the cost-tracking pipeline lands.

Cross-Provider JSD Matrix

Pairwise Jensen-Shannon divergence between each raw-LLM pair, averaged across shared questions per dataset. Low values (cooler) mean providers agree with each other — a red flag for HBR's trendslop hypothesis (cross-model consensus reached on shared errors). High values (warmer) mean providers genuinely diverge. Ground-truth concordance (model vs. human) stays a separate axis — see mean_human_jsd below the chart.

Cross-Provider JSD Matrix

Heatmap of pairwise Jensen-Shannon divergence between raw-LLM providers, per dataset. Lower off-diagonal values indicate cross-model agreement; higher values indicate divergence.

Show data table

How the Ensemble Is Built

The biggest lever in the benchmark is also the simplest. Six things worth knowing about how blending turns three good models into one better one.

What we blend

Per-question response distributions from three models — Claude Haiku 4.5, Gemini Flash Lite, and GPT-4o-mini — run through the SynthPanel conditioning framework on the same inputs.

How the blend works

Equal-weight arithmetic average of the three distributions per question. For each option on each question, we take the mean of the three models' probability mass. No learned weights, no router, no post-hoc calibration.

Why it works

The three models make uncorrelated errors on different questions. Blending cancels noise that is specific to any one model while preserving signal they agree on — 72–81% of questions improve under blending across datasets.

What we tested against it

Equal-weight, score-proportional, and inverse-JSD weighting — scores are indistinguishable to three decimals. The blending weights don't matter; the ensemble itself does.

What we didn't beat

Per-question oracle selection (the best single model picked retroactively) adds negligible headroom over equal-weight blending. There is little left to extract from this set of models without adding new ones.

What it costs

Zero additional API calls. The blend is pure offline arithmetic on results we already have. The only cost is running the three base models once — which is what any serious evaluation does anyway.

Temperature: It Depends on the Model

SPS across temperature settings for each model. Error bars show ±1 std across replicates. Gemini Flash Lite gains +4.5 SPS points from 0.3 to 2.0 (well outside its ±0.003 std), while Claude Haiku 4.5 moves under its own noise band.

SPS vs Temperature — per-model sensitivity

Line chart of SPS across temperature settings for each model. Error bars show ±1 standard deviation across replicates.

Demographic Conditioning Reveals Model Bias

How much conditioning on a demographic group shifts model responses (p_cond). Republican conditioning is 2.2x stronger than Democrat — the model already leans progressive by default.

POLPARTY INCOME EDUCATION
Demographic Conditioning Gap

Chart showing how conditioning on different demographic groups shifts model responses (p_cond). Republican conditioning produces a larger shift than Democrat — evidence of a default progressive lean.

SPS by Topic

Best-per-model performance across the ten topic categories. Baselines and temperature variants hidden; scroll the legend to toggle models.

SPS by Topic — per-model breakdown

Grouped bar chart of best-per-model SPS across ten topic categories. Each group is a topic; each bar is a model.

SPS Convergence

How SPS stabilizes as replicate count increases. Showing the ensemble and the three blended single models.

SPS Convergence by replicate count

Line chart showing how Survey Parity Score stabilizes as the number of replicates increases. One line per model plus the ensemble.

Temporal Drift Floor

Mean Jensen-Shannon divergence between the same Pew ATP question stems across different waves, grouped by year gap. A separate baseline from the Human Ceiling — drift measures how much real-world opinions shift, not sampling noise.

Temporal Drift Floor — mean JSD by year gap

Bar chart of mean Jensen-Shannon divergence between the same Pew ATP question stems across different waves, grouped by year gap. Shaded band indicates bootstrap 95% CI on the overall mean.

A model frozen at time T cannot improve beyond the temporal drift floor when evaluated against later waves. Shaded band = bootstrap 95% CI on the overall mean (0.00019–0.00235, n_pairs=5, n_stems=5). See the Temporal Drift Floor definition on the methodology page.