Skip to main content

Leaderboard

Full rankings across all datasets and models.

Default view hides configs with <3 runs on <2 datasets. Toggle “All variants” to see every run.

Leaderboard

Select a column header to sort. Activate a row (Enter) to open its configuration, or use the chevron button to expand details inline.

Sub-Metric Radar

Top 3 models compared on SPS sub-metrics: distribution accuracy (p_dist), rank correlation (p_rank), and refusal match (p_refuse).

Sub-Metric Radar — top 3 models across SPS components

Radar plot comparing the top 3 models on distribution accuracy (p_dist), rank correlation (p_rank), and refusal match (p_refuse). Larger polygon = better.

Demographic Parity Heatmap

Models × demographic groups, colored by p_dist (distribution similarity — higher = closer match to the conditioned subpopulation). Use the selector to drill into a specific attribute.

Coverage flag derived from n_questions: high (≥100) medium (50–99) low (<50)

SPS by Model

Survey Parity Score per model with 95% confidence intervals. Higher is better.

SPS by Model with 95% confidence intervals

Dot plot of Survey Parity Score per model; horizontal whiskers are 95% CIs. Higher is better.

Per-Metric Breakdown

SPS and component metrics side-by-side per model. All metrics: higher is better.

Per-Metric Breakdown — SPS and component metrics per model

Grouped dot plot of SPS, p_dist, p_rank, and p_refuse for every model. Legend below identifies each metric.

SPS: Survey Parity Score (higher is better) p_dist: Distribution similarity (higher is better) p_rank: Rank preservation (higher is better) p_refuse: Non-refusal rate (higher is better)

Confidence Intervals

95% confidence interval for each model's SPS. Center dot = point estimate, whiskers = CI bounds.

SPS Confidence Intervals per Model

Each row shows a model's point-estimate SPS (dot) and 95% confidence interval (whiskers).