What Matters Most

Effect size of each optimization lever, sorted by maximum impact. Ensemble blending delivers +5-7 SPS points at zero additional cost.

The Biggest Lever: Ensemble Blending

For each dataset, ensemble blending of 3 models outperforms the best single model by +5.2 to +7.0 SPS points. Zero additional API cost — just arithmetic.

Cost vs Performance Pareto

Each point is one configuration. Pareto-optimal configurations (no other run is cheaper AND higher-SPS) are highlighted; dominated runs are muted. Lower-left of the frontier is the cost/quality trade you cannot beat.

Show data table

Cost vs SPS: per-configuration cost in USD, Survey Parity Score, and Pareto status.
Configuration	Cost (USD)	SPS	Pareto Status
Claude Sonnet 4.6 (raw)	$0.654	0.738	Pareto-optimal

Cross-Provider JSD Matrix

Pairwise Jensen-Shannon divergence between each raw-LLM pair, averaged across shared questions per dataset. Low values (cooler) mean providers agree with each other — a red flag for HBR's trendslop hypothesis (cross-model consensus reached on shared errors). High values (warmer) mean providers genuinely diverge. Ground-truth concordance (model vs. human) stays a separate axis — see mean_human_jsd below the chart.

Dataset:

Show data table

How the Ensemble Is Built

The biggest lever in the benchmark is also the simplest. Six things worth knowing about how blending turns three good models into one better one.

What we blend

Per-question response distributions from three models — Claude Haiku 4.5, Gemini Flash Lite, and GPT-4o-mini — run through the SynthPanel conditioning framework on the same inputs.

How the blend works

Equal-weight arithmetic average of the three distributions per question. For each option on each question, we take the mean of the three models' probability mass. No learned weights, no router, no post-hoc calibration.

Why it works

The three models make uncorrelated errors on different questions. Blending cancels noise that is specific to any one model while preserving signal they agree on — 72–81% of questions improve under blending across datasets.

What we tested against it

Equal-weight, score-proportional, and inverse-JSD weighting — scores are indistinguishable to three decimals. The blending weights don't matter; the ensemble itself does.

What we didn't beat

Per-question oracle selection (the best single model picked retroactively) adds negligible headroom over equal-weight blending. There is little left to extract from this set of models without adding new ones.

What it costs

Zero additional API calls. The blend is pure offline arithmetic on results we already have. The only cost is running the three base models once — which is what any serious evaluation does anyway.

Temperature: It Depends on the Model

SPS across temperature settings for each model. Error bars show ±1 std across replicates. Gemini Flash Lite gains +4.5 SPS points from 0.3 to 2.0 (well outside its ±0.003 std), while Claude Haiku 4.5 moves under its own noise band.

Demographic Conditioning Reveals Model Bias

How much conditioning on a demographic group shifts model responses (p_cond). Republican conditioning is 2.2x stronger than Democrat — the model already leans progressive by default.

POLPARTY INCOME EDUCATION

SPS by Topic

Best-per-model performance across the ten topic categories. Baselines and temperature variants hidden; scroll the legend to toggle models.

SPS Convergence

How SPS stabilizes as replicate count increases. Showing the ensemble and the three blended single models.

Temporal Drift Floor

Mean Jensen-Shannon divergence between the same Pew ATP question stems across different waves, grouped by year gap. A separate baseline from the Human Ceiling — drift measures how much real-world opinions shift, not sampling noise.

Temporal Drift Floor ships May 2026

The next publish cycle lifts the full cross-wave JSD table from baselines.temporal_drift in leaderboard.json — mean drift, 95% CI, and per-year-gap pairs are already computed from repeated Pew ATP stems on OpinionsQA. See the Temporal Drift Floor definition on the methodology page.

Research Findings