MMLU
knowledgePending Verification
Massive Multitask Language Understanding (MMLU) tests knowledge across 57 subjects including mathematics, history, law, and more.
Published: 2020
Score Range: 0-100
Top Score: 92.3
MMLU Leaderboard
Rank | Model | Provider | Score | Parameters | Released | Type |
---|---|---|---|---|---|---|
1 | o1 | OpenAI | 92.3 | 2024-09-12 | Multimodal | |
2 | DeepSeek-R1 | DeepSeek | 90.8 | 671B (37B activated) | 2025-01-20 | Text |
3 | o1-preview | OpenAI | 90.8 | 2024-09-12 | Text | |
4 | GPT-4.1 | OpenAI | 90.2 | 2025-04-14 | Multimodal | |
5 | GPT-OSS-120B | OpenAI | 90 | 117B total (5.1B active per token) | 2025-08-05 | Text |
6 | Kimi K2 | Moonshot AI | 89.5 | 1T total, 32B activated | 2025-07-11 | Text |
7 | Claude 3.5 Sonnet | Anthropic | 88.7 | 2024-06-20 | Multimodal | |
8 | GPT-4o | OpenAI | 88.7 | 2024-05-13 | Multimodal | |
9 | DeepSeek-V3 | DeepSeek | 88.5 | 671B total, 37B activated | 2024-12-26 | Text |
10 | Gemini 2.5 Flash | 88.4 | 2025-05-20 | Multimodal |
About MMLU
Methodology
MMLU evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.
Publication
This benchmark was published in 2020.Technical Paper