MATH 500
A sample of 500 diverse problems from the MATH benchmark, spanning topics like probability, algebra, trigonometry, and geometry. The questions test a model's ability to apply mathematical principles, execute complex calculations, and communicate solutions clearly. Models are prompted to present final answers in boxed LaTeX format, and evaluated using parsing logic from the PRM800K dataset grader. Most models are evaluated with temperature set to 0, except for reasoning models that require specific temperature settings.
MATH 500 Leaderboard
Rank | Model | Provider | Score | Parameters | Released | Type |
---|---|---|---|---|---|---|
1 | DeepSeek-R1 | DeepSeek | 97.3 | 671B (37B activated) | 2025-01-20 | Text |
2 | o1-mini | OpenAI | 90 | 2024-09-12 | Text | |
3 | o1-preview | OpenAI | 85.5 | 2024-09-12 | Text |
About MATH 500
Methodology
MATH 500 evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.
Publication
This benchmark was published in 2025.Read the full paper