SynthBench
Not all synthetic surveys are built equal
A single LLM will almost always validate your idea—even when it shouldn't. Model selection, conditioning, and sampling determine whether synthetic respondents actually represent real human opinions.
v0.1.0 · Generated 4/15/2026
Best Model vs Random Baseline
Survey Parity Score (SPS) — higher is better. 1.0 = perfect match to human survey distributions.
Bar chart comparing the top-performing model's SPS against the random-pick baseline. Higher is better; 1.0 is a perfect match to human distributions.
Key Findings
The most surprising results from our benchmark runs
Ensemble Advantage
+6-7 SPS points
Blending 3 models beats any single model. Zero additional API cost—just arithmetic on existing responses.
Conditioning Asymmetry
2.2× gap
Republican conditioning shifts responses 2.2× more than Democrat—revealing the model’s progressive default lean.
Temperature Matters (Sometimes)
+4.5% for Gemini, ±0.7% for Haiku
Temperature sensitivity is model-specific, not universal. One size does not fit all.
Leaderboard Summary
Top 3 models per dataset by Survey Parity Score · activate a row (Enter) for config details
| # | Model | Dataset | SPS | Range % | p_dist | p_rank |
|---|---|---|---|---|---|---|
| 1 | SynthPanel (GPT-4o-mini) conditioned | globalopinionqa | 0.786 | — | 0.689 | 0.694 |
| 2 | Gemini 2.5 Flash | globalopinionqa | 0.770 | — | 0.687 | 0.645 |
| 3 | Llama 3.3 70B | globalopinionqa | 0.762 | — | 0.635 | 0.672 |
| 1 | SynthPanel Ensemble (3-model) ensemble | opinionsqa | 0.835 | — | 0.833 | 0.837 |
| 2 | Gemini 2.5 Flash | opinionsqa | 0.829 | — | 0.738 | 0.761 |
| 3 | SynthPanel (Sonnet 4) conditioned | opinionsqa | 0.829 | — | 0.726 | 0.793 |
| 1 | SynthPanel Ensemble (3-model) ensemble | subpop | 0.833 | — | 0.871 | 0.795 |
| 2 | SynthPanel (Gemini Flash Lite) conditioned | subpop | 0.821 | — | 0.707 | 0.780 |
| 3 | SynthPanel (Haiku 4.5) conditioned | subpop | 0.809 | — | 0.712 | 0.757 |
Try SynthPanel
Run synthetic surveys with built-in best practices. pip install or clone from GitHub.
Get SynthPanelExplore Methodology
How we score models, what SPS measures, and why distribution fidelity matters.
Read methodologySubmit Your Model
Run the benchmark on your model and submit results. Open to any provider or framework.
Submit results