Humanitys-Last-Exam
diversePending Verification
A challenging benchmark of novel problems designed to test the limits of AI capabilities.
Published: 2023
Score Range: 0-100
Top Score: 38.6
Humanitys-Last-Exam Leaderboard
Rank | Model | Provider | Score | Parameters | Released | Type |
---|---|---|---|---|---|---|
1 | Grok 4 | xAI | 38.6 | Unknown | 2025-07-09 | Multimodal |
2 | GPT-OSS-120B | OpenAI | 19 | 117B total (5.1B active per token) | 2025-08-05 | Text |
3 | Gemini 2.5 Pro | 17.8 | 2025-05-06 | Multimodal | ||
4 | o4-mini | OpenAI | 17.7 | 2025-04-16 | Multimodal | |
5 | GPT-OSS-20B | OpenAI | 17.3 | 21B total (3.6B active per token) | 2025-08-05 | Text |
6 | Gemini 2.5 Flash | 11 | 2025-05-20 | Multimodal | ||
7 | Gemini 2.5 Flash-Lite | 6.9 | 2025-06-17 | Multimodal |
About Humanitys-Last-Exam
Methodology
Humanitys-Last-Exam evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.
Publication
This benchmark was published in 2023.Technical Paper