MATH 500

mathematicsPending Verification

A sample of 500 diverse problems from the MATH benchmark, spanning topics like probability, algebra, trigonometry, and geometry. The questions test a model's ability to apply mathematical principles, execute complex calculations, and communicate solutions clearly. Models are prompted to present final answers in boxed LaTeX format, and evaluated using parsing logic from the PRM800K dataset grader. Most models are evaluated with temperature set to 0, except for reasoning models that require specific temperature settings.

Published: 2025

Score Range: 0-100

Top Score: 97.3

Read Paper

MATH 500 Leaderboard

Rank	Model	Provider	Score	Parameters	Released	Type
1	DeepSeek-R1	DeepSeek	97.3	671B (37B activated)	2025-01-20	Text
2	o1-mini	OpenAI	90		2024-09-12	Text
3	o1-preview	OpenAI	85.5		2024-09-12	Text

About MATH 500

Methodology

MATH 500 evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.