Verifiable ground truth
Every task has an objective answer: MCQ letter match, numeric tolerance, or executed code compared to reference implementations.
FinanceBenchmark v1 evaluates models on 45 tasks using a Python harness and publishes results to this leaderboard.
Every task has an objective answer: MCQ letter match, numeric tolerance, or executed code compared to reference implementations.
Quant tasks use seeded parameters so answers are reproducible but not easily memorized from public training data.
Temperature 0, three runs per task, versioned harness (0.1.0) and task set (v1), pinned prompts.
Includes Greeks precision and multi-step pricing where frontier models are known to underperform conceptual finance questions.
| Category | Metric | Tolerance |
|---|---|---|
| Knowledge | Letter accuracy | Exact match |
| Analysis | Numeric accuracy | 1% relative |
| Quant | Code execution vs reference | 0.1% (0.1–1% for MC) |
pip install -e . finbench run --model anthropic/claude-opus-4-20250514 --tasks all --runs 3 finbench publish results/<model>_<timestamp>.json