Submit Your Model

Name: SynthBench Leaderboard
Creator: SynthBench
License: https://opensource.org/licenses/MIT

Benchmark your synthetic survey approach against real human data.

Quick Start

Three commands to benchmark your model and submit results.

pip install synthbench
synthbench run --dataset opinionsqa --provider your-provider
# Copy result JSON to leaderboard-results/ and open a PR

pip install synthbench

Install the SynthBench CLI and evaluation framework from PyPI.

synthbench run --dataset opinionsqa --provider your-provider

Run the benchmark against your model. Replace your-provider with your adapter (e.g. openai, anthropic, openrouter).

Copy result JSON to leaderboard-results/ and open a PR

Fork the repo, add your result file, and submit a pull request. CI validates everything automatically.

GitHub repository

Submission Flow

Step-by-step process from installation to leaderboard listing.

1

Install synthbench

Install the package from PyPI with pip install synthbench. Requires Python 3.10+.
2

Run the benchmark

Execute synthbench run against your provider and model. The CLI handles sampling, parsing, and metric computation.
3

Review result JSON

The run produces a result JSON file with per-question response distributions and aggregate scores.
4

Fork the repo and add your results

Fork DataViking-Tech/synthbench on GitHub and copy your result JSON into the leaderboard-results/ directory.
5

Open a pull request

Submit a PR with your result file. CI validates the JSON schema and recomputes all aggregate scores from your per-question data. You cannot fake scores.
6

Leaderboard updates on merge

Once the PR is merged, GitHub Pages rebuilds automatically and your model appears on the leaderboard.

Submission Requirements

What a valid submission must include, and what is optional.

Required

Valid JSON matching the synthbench result schema
Per-question human_distribution must match ground truth (hash-verified)
Per-question model_distribution must sum to 1.0
Aggregate scores must be recomputable from per-question data
Minimum 100 questions evaluated
benchmark field must equal "synthbench"

Optional

Temperature and template metadata (encouraged for reproducibility)
Demographic breakdown (only available for SubPOP dataset)
Multiple runs for confidence intervals (recommended: 3+)

Adding a New Provider

Benchmarking a net-new provider or product? Write a thin adapter — the harness handles sampling, parsing, and metrics.

1

Subclass Provider

Create a new file in src/synthbench/providers/ and subclass Provider from providers/base.py. Implement async respond() and the name property.
2

Wire distribution output (optional)

If your provider returns logprobs or native probabilities, override get_distribution() and set supports_distribution to True. Otherwise the base class samples respond() to build an empirical distribution.
3

Register in the PROVIDERS dict

Add a name -> dotted path entry in src/synthbench/providers/__init__.py so the CLI can load it via --provider <your-name>.
4

Test end-to-end

Run synthbench run --provider <your-name> --dataset opinionsqa against a small question slice, then submit the result JSON as usual.

src/synthbench/providers/my_provider.py

from synthbench.providers.base import Provider, Response, PersonaSpec

class MyProvider(Provider):
    @property
    def name(self) -> str:
        return "my-provider"

    async def respond(
        self,
        question: str,
        options: list[str],
        *,
        persona: PersonaSpec | None = None,
    ) -> Response:
        # Call your API, pick an option from `options`
        choice = await call_my_api(question, options, persona)
        return Response(selected_option=choice)

See ollama.py for a compact example (HTTP-backed, sampling-only), or raw_openai.py for a logprob-based adapter that overrides get_distribution().

FAQ

Common questions about running benchmarks and submitting results.

How long does a benchmark run take?

Core suite (200 questions, 30 samples each) takes about 1 hour. Full suite (684+ questions) takes about 6 hours. Cost depends on your provider.

Can I benchmark a closed-source model?

Yes. SynthBench works with any model that accepts text prompts. Use the provider adapter interface or the OpenRouter integration.

How are scores validated?

CI recomputes all aggregate scores from your per-question data. If reported scores don't match recomputed scores, the PR is rejected.

Can I submit results from my own conditioning or persona approach?

Yes. Include your prompt template and any conditioning metadata. The leaderboard shows provider and configuration.

What datasets can I benchmark against?

OpinionsQA (684 questions, US), SubPOP (3,362 questions, 22 US subpopulations), GlobalOpinionQA (2,556 questions, 138 countries).

Ready to benchmark?

Run the benchmark locally, then upload the result JSON — or open a PR if you prefer the power-user path.

Upload a run Fork on GitHub

Submit Your Model

Quick Start

Submission Flow

Install synthbench

Run the benchmark

Review result JSON

Fork the repo and add your results

Open a pull request

Leaderboard updates on merge

Submission Requirements

Required

Optional

Adding a New Provider

Subclass Provider

Wire distribution output (optional)

Register in the PROVIDERS dict

Test end-to-end

FAQ

Ready to benchmark?