Research Findings
Six experiments that reveal what drives synthetic survey quality — from the levers that matter most to the biases hiding in default model behavior.
What Matters Most
Effect size of each optimization lever, sorted by maximum impact. Ensemble blending delivers +5-7 SPS points at zero additional cost.
Horizontal bar chart of effect sizes for each optimization lever (ensemble, temperature, conditioning, and more), sorted by maximum impact.
The Biggest Lever: Ensemble Blending
For each dataset, ensemble blending of 3 models outperforms the best single model by +5.2 to +7.0 SPS points. Zero additional API cost — just arithmetic.
Bar chart comparing the SPS of the best single model against a 3-model ensemble for each dataset. Ensemble gains range from +5.2 to +7.0 points.
Cost vs Performance Pareto
Each point is one configuration. Pareto-optimal configurations (no other run is cheaper AND higher-SPS) are highlighted; dominated runs are muted. Lower-left of the frontier is the cost/quality trade you cannot beat.
Cross-Provider JSD Matrix
Pairwise Jensen-Shannon divergence between each raw-LLM pair, averaged
across shared questions per dataset. Low values (cooler) mean providers
agree with each other — a red flag for HBR's
trendslop hypothesis (cross-model consensus reached on shared errors). High values (warmer) mean
providers genuinely diverge. Ground-truth concordance (model vs. human) stays
a separate axis — see mean_human_jsd below the chart.
Heatmap of pairwise Jensen-Shannon divergence between raw-LLM providers, per dataset. Lower off-diagonal values indicate cross-model agreement; higher values indicate divergence.
Show data table
How the Ensemble Is Built
The biggest lever in the benchmark is also the simplest. Six things worth knowing about how blending turns three good models into one better one.
What we blend
Per-question response distributions from three models — Claude Haiku 4.5, Gemini Flash Lite, and GPT-4o-mini — run through the SynthPanel conditioning framework on the same inputs.
How the blend works
Equal-weight arithmetic average of the three distributions per question. For each option on each question, we take the mean of the three models' probability mass. No learned weights, no router, no post-hoc calibration.
Why it works
The three models make uncorrelated errors on different questions. Blending cancels noise that is specific to any one model while preserving signal they agree on — 72–81% of questions improve under blending across datasets.
What we tested against it
Equal-weight, score-proportional, and inverse-JSD weighting — scores are indistinguishable to three decimals. The blending weights don't matter; the ensemble itself does.
What we didn't beat
Per-question oracle selection (the best single model picked retroactively) adds negligible headroom over equal-weight blending. There is little left to extract from this set of models without adding new ones.
What it costs
Zero additional API calls. The blend is pure offline arithmetic on results we already have. The only cost is running the three base models once — which is what any serious evaluation does anyway.
Temperature: It Depends on the Model
SPS across temperature settings for each model. Error bars show ±1 std across replicates. Gemini Flash Lite gains +4.5 SPS points from 0.3 to 2.0 (well outside its ±0.003 std), while Claude Haiku 4.5 moves under its own noise band.
Line chart of SPS across temperature settings for each model. Error bars show ±1 standard deviation across replicates.
Demographic Conditioning Reveals Model Bias
How much conditioning on a demographic group shifts model responses (p_cond). Republican conditioning is 2.2x stronger than Democrat — the model already leans progressive by default.
Chart showing how conditioning on different demographic groups shifts model responses (p_cond). Republican conditioning produces a larger shift than Democrat — evidence of a default progressive lean.
SPS by Topic
Best-per-model performance across the ten topic categories. Baselines and temperature variants hidden; scroll the legend to toggle models.
Grouped bar chart of best-per-model SPS across ten topic categories. Each group is a topic; each bar is a model.
SPS Convergence
How SPS stabilizes as replicate count increases. Showing the ensemble and the three blended single models.
Line chart showing how Survey Parity Score stabilizes as the number of replicates increases. One line per model plus the ensemble.
Temporal Drift Floor
Mean Jensen-Shannon divergence between the same Pew ATP question stems across different waves, grouped by year gap. A separate baseline from the Human Ceiling — drift measures how much real-world opinions shift, not sampling noise.
Bar chart of mean Jensen-Shannon divergence between the same Pew ATP question stems across different waves, grouped by year gap. Shaded band indicates bootstrap 95% CI on the overall mean.
A model frozen at time T cannot improve beyond the temporal drift floor when evaluated against later waves. Shaded band = bootstrap 95% CI on the overall mean (0.00019–0.00235, n_pairs=5, n_stems=5). See the Temporal Drift Floor definition on the methodology page.