Humanitys-Last-Exam

diversePending Human Review

A challenging benchmark of novel problems designed to test the limits of AI capabilities.

Published: 2023
Score Range: 0-100
Top Score: 44.9

Humanitys-Last-Exam Leaderboard

RankModelProviderScoreParametersReleasedType
1Kimi K2Moonshot AI
44.9
1T total, 32B activated2025-07-11Text
2Claude Opus 4.6Anthropic
40
Unreleased2026-02-05Multimodal
3Grok 4xAI
38.6
Unknown2025-07-09Multimodal
4Gemini 3 ProGoogle
37.5
Proprietary2025-11-18Multimodal
5GPT-OSS-120BOpenAI
19
117B total (5.1B active per token)2025-08-05Text
6Gemini 2.5 ProGoogle
17.8
2025-05-06Multimodal
7o4-miniOpenAI
17.7
2025-04-16Multimodal
8GPT-OSS-20BOpenAI
17.3
21B total (3.6B active per token)2025-08-05Text
9Nemotron 3 NanoNVIDIA
15.5
31.6B (Total), ~3.2B (Active)2025-12-15Text
10Gemini 2.5 FlashGoogle
11
2025-05-20Multimodal

About Humanitys-Last-Exam

Methodology

Humanitys-Last-Exam evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2023.Technical Paper

Related Benchmarks