MATH
mathematicsPending Verification
A dataset of 12,500 challenging competition mathematics problems requiring multi-step reasoning.
Published: 2021
Score Range: 0-100
Top Score: 97.4
MATH Leaderboard
Rank | Model | Provider | Score | Parameters | Released | Type |
---|---|---|---|---|---|---|
1 | Kimi K2 | Moonshot AI | 97.4 | 1T total, 32B activated | 2025-07-11 | Text |
2 | Claude 3.7 Sonnet | Anthropic | 96.2 | 2025-02-24 | Multimodal | |
3 | o1 | OpenAI | 94.8 | 2024-09-12 | Multimodal | |
4 | Gemini 2.0 Pro | 91.8 | 2025-02-05 | Multimodal | ||
5 | Gemini 2.0 Flash | 90.9 | 2025-02-25 | Multimodal | ||
6 | DeepSeek-V3 | DeepSeek | 90.2 | 671B total, 37B activated | 2024-12-26 | Text |
7 | Gemini 2.0 Flash-Lite | 86.8 | 2025-02-25 | Multimodal | ||
8 | GPT-4o | OpenAI | 76.6 | 2024-05-13 | Multimodal | |
9 | Grok-2 | xAI | 76.1 | Unknown | 2024-08-13 | Multimodal |
10 | Grok-2 mini | xAI | 73 | Unknown | 2024-08-13 | Multimodal |
About MATH
Methodology
MATH evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.
Publication
This benchmark was published in 2021.Technical Paper