Skip to main content

Submit Your Model

Benchmark your synthetic survey approach against real human data.

Quick Start

Three commands to benchmark your model and submit results.

pip install synthbench
synthbench run --dataset opinionsqa --provider your-provider
# Copy result JSON to leaderboard-results/ and open a PR
1

pip install synthbench

Install the SynthBench CLI and evaluation framework from PyPI.

2

synthbench run --dataset opinionsqa --provider your-provider

Run the benchmark against your model. Replace your-provider with your adapter (e.g. openai, anthropic, openrouter).

3

Copy result JSON to leaderboard-results/ and open a PR

Fork the repo, add your result file, and submit a pull request. CI validates everything automatically.

Submission Flow

Step-by-step process from installation to leaderboard listing.

  1. 1

    Install synthbench

    Install the package from PyPI with pip install synthbench. Requires Python 3.10+.

  2. 2

    Run the benchmark

    Execute synthbench run against your provider and model. The CLI handles sampling, parsing, and metric computation.

  3. 3

    Review result JSON

    The run produces a result JSON file with per-question response distributions and aggregate scores.

  4. 4

    Fork the repo and add your results

    Fork DataViking-Tech/synthbench on GitHub and copy your result JSON into the leaderboard-results/ directory.

  5. 5

    Open a pull request

    Submit a PR with your result file. CI validates the JSON schema and recomputes all aggregate scores from your per-question data. You cannot fake scores.

  6. 6

    Leaderboard updates on merge

    Once the PR is merged, GitHub Pages rebuilds automatically and your model appears on the leaderboard.

Submission Requirements

What a valid submission must include, and what is optional.

Required

  • Valid JSON matching the synthbench result schema
  • Per-question human_distribution must match ground truth (hash-verified)
  • Per-question model_distribution must sum to 1.0
  • Aggregate scores must be recomputable from per-question data
  • Minimum 100 questions evaluated
  • benchmark field must equal "synthbench"

Optional

  • Temperature and template metadata (encouraged for reproducibility)
  • Demographic breakdown (only available for SubPOP dataset)
  • Multiple runs for confidence intervals (recommended: 3+)

Adding a New Provider

Benchmarking a net-new provider or product? Write a thin adapter — the harness handles sampling, parsing, and metrics.

  1. 1

    Subclass Provider

    Create a new file in src/synthbench/providers/ and subclass Provider from providers/base.py. Implement async respond() and the name property.

  2. 2

    Wire distribution output (optional)

    If your provider returns logprobs or native probabilities, override get_distribution() and set supports_distribution to True. Otherwise the base class samples respond() to build an empirical distribution.

  3. 3

    Register in the PROVIDERS dict

    Add a name -> dotted path entry in src/synthbench/providers/__init__.py so the CLI can load it via --provider <your-name>.

  4. 4

    Test end-to-end

    Run synthbench run --provider <your-name> --dataset opinionsqa against a small question slice, then submit the result JSON as usual.

src/synthbench/providers/my_provider.py
from synthbench.providers.base import Provider, Response, PersonaSpec

class MyProvider(Provider):
    @property
    def name(self) -> str:
        return "my-provider"

    async def respond(
        self,
        question: str,
        options: list[str],
        *,
        persona: PersonaSpec | None = None,
    ) -> Response:
        # Call your API, pick an option from `options`
        choice = await call_my_api(question, options, persona)
        return Response(selected_option=choice)

See ollama.py for a compact example (HTTP-backed, sampling-only), or raw_openai.py for a logprob-based adapter that overrides get_distribution().

FAQ

Common questions about running benchmarks and submitting results.

How long does a benchmark run take?

Core suite (200 questions, 30 samples each) takes about 1 hour. Full suite (684+ questions) takes about 6 hours. Cost depends on your provider.

Can I benchmark a closed-source model?

Yes. SynthBench works with any model that accepts text prompts. Use the provider adapter interface or the OpenRouter integration.

How are scores validated?

CI recomputes all aggregate scores from your per-question data. If reported scores don't match recomputed scores, the PR is rejected.

Can I submit results from my own conditioning or persona approach?

Yes. Include your prompt template and any conditioning metadata. The leaderboard shows provider and configuration.

What datasets can I benchmark against?

OpinionsQA (684 questions, US), SubPOP (3,362 questions, 22 US subpopulations), GlobalOpinionQA (2,556 questions, 138 countries).

Ready to benchmark?

Fork the repository, run the benchmark, and submit your results.

Fork on GitHub (opens in a new tab)