MATH 500

mathematicsPending Verification

A sample of 500 diverse problems from the MATH benchmark, spanning topics like probability, algebra, trigonometry, and geometry. The questions test a model's ability to apply mathematical principles, execute complex calculations, and communicate solutions clearly. Models are prompted to present final answers in boxed LaTeX format, and evaluated using parsing logic from the PRM800K dataset grader. Most models are evaluated with temperature set to 0, except for reasoning models that require specific temperature settings.

Published: 2025
Score Range: 0-100
Top Score: 97.3

MATH 500 Leaderboard

RankModelProviderScoreParametersReleasedType
1DeepSeek-R1DeepSeek
97.3
671B (37B activated)2025-01-20Text
2o1-miniOpenAI
90
2024-09-12Text
3o1-previewOpenAI
85.5
2024-09-12Text

About MATH 500

Methodology

MATH 500 evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2025.Read the full paper

Related Benchmarks