Submit Your Model
Benchmark your synthetic survey approach against real human data.
Quick Start
Three commands to benchmark your model and submit results.
pip install synthbench synthbench run --dataset opinionsqa --provider your-provider # Copy result JSON to leaderboard-results/ and open a PR
pip install synthbench
Install the SynthBench CLI and evaluation framework from PyPI.
synthbench run --dataset opinionsqa --provider your-provider
Run the benchmark against your model. Replace your-provider with your adapter (e.g. openai, anthropic, openrouter).
Copy result JSON to leaderboard-results/ and open a PR
Fork the repo, add your result file, and submit a pull request. CI validates everything automatically.
Submission Flow
Step-by-step process from installation to leaderboard listing.
- 1
Install synthbench
Install the package from PyPI with pip install synthbench. Requires Python 3.10+.
- 2
Run the benchmark
Execute synthbench run against your provider and model. The CLI handles sampling, parsing, and metric computation.
- 3
Review result JSON
The run produces a result JSON file with per-question response distributions and aggregate scores.
- 4
Fork the repo and add your results
Fork DataViking-Tech/synthbench on GitHub and copy your result JSON into the leaderboard-results/ directory.
- 5
Open a pull request
Submit a PR with your result file. CI validates the JSON schema and recomputes all aggregate scores from your per-question data. You cannot fake scores.
- 6
Leaderboard updates on merge
Once the PR is merged, GitHub Pages rebuilds automatically and your model appears on the leaderboard.
Submission Requirements
What a valid submission must include, and what is optional.
Required
- Valid JSON matching the synthbench result schema
- Per-question human_distribution must match ground truth (hash-verified)
- Per-question model_distribution must sum to 1.0
- Aggregate scores must be recomputable from per-question data
- Minimum 100 questions evaluated
- benchmark field must equal "synthbench"
Optional
- Temperature and template metadata (encouraged for reproducibility)
- Demographic breakdown (only available for SubPOP dataset)
- Multiple runs for confidence intervals (recommended: 3+)
Adding a New Provider
Benchmarking a net-new provider or product? Write a thin adapter — the harness handles sampling, parsing, and metrics.
- 1
Subclass Provider
Create a new file in src/synthbench/providers/ and subclass Provider from providers/base.py. Implement async respond() and the name property.
- 2
Wire distribution output (optional)
If your provider returns logprobs or native probabilities, override get_distribution() and set supports_distribution to True. Otherwise the base class samples respond() to build an empirical distribution.
- 3
Register in the PROVIDERS dict
Add a name -> dotted path entry in src/synthbench/providers/__init__.py so the CLI can load it via --provider <your-name>.
- 4
Test end-to-end
Run synthbench run --provider <your-name> --dataset opinionsqa against a small question slice, then submit the result JSON as usual.
from synthbench.providers.base import Provider, Response, PersonaSpec
class MyProvider(Provider):
@property
def name(self) -> str:
return "my-provider"
async def respond(
self,
question: str,
options: list[str],
*,
persona: PersonaSpec | None = None,
) -> Response:
# Call your API, pick an option from `options`
choice = await call_my_api(question, options, persona)
return Response(selected_option=choice) See ollama.py for a compact example (HTTP-backed, sampling-only), or raw_openai.py for a logprob-based adapter that overrides get_distribution().
FAQ
Common questions about running benchmarks and submitting results.
How long does a benchmark run take?
Core suite (200 questions, 30 samples each) takes about 1 hour. Full suite (684+ questions) takes about 6 hours. Cost depends on your provider.
Can I benchmark a closed-source model?
Yes. SynthBench works with any model that accepts text prompts. Use the provider adapter interface or the OpenRouter integration.
How are scores validated?
CI recomputes all aggregate scores from your per-question data. If reported scores don't match recomputed scores, the PR is rejected.
Can I submit results from my own conditioning or persona approach?
Yes. Include your prompt template and any conditioning metadata. The leaderboard shows provider and configuration.
What datasets can I benchmark against?
OpinionsQA (684 questions, US), SubPOP (3,362 questions, 22 US subpopulations), GlobalOpinionQA (2,556 questions, 138 countries).
Ready to benchmark?
Fork the repository, run the benchmark, and submit your results.
Fork on GitHub (opens in a new tab)