A challenging benchmark of novel problems designed to test the limits of AI capabilities.

Humanitys-Last-Exam

Moonshot's latest Mixture-of-Experts model with 32 billion activated parameters and 1 trillion total parameters. Achieves state-of-the-art performance in frontier knowledge, math, and coding among non-thinking models. Meticulously optimized for agentic tasks with sophisticated tool-use capabilities and multi-turn interactions.

Kimi K2

Claude Opus 4.6 is Anthropic's most intelligent model, upgrading its predecessor's coding, reasoning, and agentic capabilities. It plans more carefully, sustains agentic tasks for longer, operates more reliably in larger codebases, and has superior code review and debugging skills. In a first for Opus-class models, it features a 1M token context window (beta) and 128K max output tokens. Opus 4.6 achieves state-of-the-art results on Terminal-Bench 2.0 (65.4%), leads on Humanity's Last Exam, BrowseComp, and GDPval-AA, and scores 76% on the 8-needle 1M variant of MRCR v2 (vs. 18.5% for Sonnet 4.5), representing a qualitative leap in long-context performance. It introduces adaptive thinking and four effort levels (low, medium, high, max) for fine-grained control over intelligence, speed, and cost.

Claude Opus 4.6

xAI's latest generation model with enhanced mathematical reasoning capabilities, showing significant improvements in competition-level mathematics benchmarks. Features 2x faster end-to-end latency, supports 5 different voices, and achieves 10x daily user seconds compared to previous models.

Grok 4

Gemini 3 Pro is Google's flagship multimodal foundation model, released in November 2025. Built on a sparse Mixture-of-Experts (MoE) Transformer architecture, it features a 1 million token context window and native understanding of text, images, audio, and video. The model introduces 'Deep Think' capabilities for enhanced reasoning, controlled via a 'thinking_level' parameter, and is optimized for 'agentic' workflows and 'vibe coding'—the generation of full applications from natural language. It supports advanced function calling and tool use, making it suitable for complex software engineering and long-context analysis tasks.

Gemini 3 Pro

GPT-OSS-120B is a state-of-the-art open-weight language model that delivers strong real-world performance at low cost. This 120 billion parameter mixture-of-experts model achieves near-parity with OpenAI o4-mini on core reasoning benchmarks while running efficiently on a single 80 GB GPU. It was trained using reinforcement learning and techniques informed by OpenAI's most advanced internal models, including o3 and other frontier systems.

GPT-OSS-120B

Gemini 2.5 Pro is capable of reasoning through its thoughts before responding, resulting in enhanced performance and improved accuracy. Features Deep Think, an enhanced reasoning mode, and native audio outputs that capture subtle nuances of speech.

Gemini 2.5 Pro

A smaller model optimized for fast, cost-efficient reasoning. Achieves remarkable performance for its size and cost, particularly in math, coding, and visual tasks. It is the best-performing benchmarked model on AIME 2024 and 2025, with significantly higher usage limits than o3.

o4-mini

GPT-OSS-20B is a medium-sized open-weight language model that delivers similar results to OpenAI o3‑mini on common benchmarks and can run on edge devices with just 16 GB of memory. This makes it ideal for on-device use cases, local inference, or rapid iteration without costly infrastructure. Despite its smaller size, it demonstrates strong performance on reasoning tasks, tool use, and competition mathematics.

GPT-OSS-20B

Nemotron 3 Nano is a 31.6B parameter hybrid large language model developed by NVIDIA, released on December 15, 2025. It employs a novel Hybrid Mamba-Transformer Mixture-of-Experts (MoE) architecture, activating only ~3.2B parameters per token for efficient inference. Optimized for agentic AI, reasoning, and coding, it supports a 1 million token context window and features a 'Reasoning ON/OFF' toggle with a configurable thinking budget. NVIDIA has released the model with open weights, training data, and training recipes under the NVIDIA Open Model License.

Nemotron 3 Nano

Improved across key benchmarks for reasoning, multimodality, code and long context while getting even more efficient. Best for fast performance on complex tasks.

Rank	Model	Provider	Score	Parameters	Released	Type
1	Kimi K2	Moonshot AI	44.9	1T total, 32B activated	2025-07-11	Text
2	Claude Opus 4.6	Anthropic	40	Unreleased	2026-02-05	Multimodal
3	Grok 4	xAI	38.6	Unknown	2025-07-09	Multimodal
4	Gemini 3 Pro	Google	37.5	Proprietary	2025-11-18	Multimodal
5	GPT-OSS-120B	OpenAI	19	117B total (5.1B active per token)	2025-08-05	Text
6	Gemini 2.5 Pro	Google	17.8		2025-05-06	Multimodal
7	o4-mini	OpenAI	17.7		2025-04-16	Multimodal
8	GPT-OSS-20B	OpenAI	17.3	21B total (3.6B active per token)	2025-08-05	Text
9	Nemotron 3 Nano	NVIDIA	15.5	31.6B (Total), ~3.2B (Active)	2025-12-15	Text
10	Gemini 2.5 Flash	Google	11		2025-05-20	Multimodal

Humanitys-Last-Exam

Humanitys-Last-Exam Leaderboard

About Humanitys-Last-Exam

Methodology

Publication

Related Benchmarks

BIG-bench

Scale-MultiChallenge