Massive Multitask Language Understanding (MMLU) tests knowledge across 57 subjects including mathematics, history, law, and more.

MMLU

The o1 model series is trained with large-scale reinforcement learning to reason using chain-of-thought. OpenAI o1 is the next model in this series (previously OpenAI o1-preview), with advanced reasoning capabilities that provide new avenues for improving safety and robustness. The model can reason about safety policies in context when responding to potentially unsafe prompts, through deliberative alignment.

Claude Opus 4.6 is Anthropic's most intelligent model, upgrading its predecessor's coding, reasoning, and agentic capabilities. It plans more carefully, sustains agentic tasks for longer, operates more reliably in larger codebases, and has superior code review and debugging skills. In a first for Opus-class models, it features a 1M token context window (beta) and 128K max output tokens. Opus 4.6 achieves state-of-the-art results on Terminal-Bench 2.0 (65.4%), leads on Humanity's Last Exam, BrowseComp, and GDPval-AA, and scores 76% on the 8-needle 1M variant of MRCR v2 (vs. 18.5% for Sonnet 4.5), representing a qualitative leap in long-context performance. It introduces adaptive thinking and four effort levels (low, medium, high, max) for fine-grained control over intelligence, speed, and cost.

Claude Opus 4.6

DeepSeek-R1 is a first-generation reasoning model trained via large-scale reinforcement learning. Built on DeepSeek-V3-Base, it incorporates cold-start data before RL to address challenges like endless repetition and poor readability found in DeepSeek-R1-Zero. Achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks through advanced chain-of-thought reasoning capabilities.

DeepSeek-R1

Large reasoning model with strong capabilities for solving hard problems through extended thinking. Designed to spend more time reasoning before responding, with enhanced performance in science, coding, and math.

o1-preview

Flagship GPT model for complex tasks. It is well suited for problem solving across domains. Features major improvements in coding, instruction following, and long context comprehension.

GPT-4.1

GPT-OSS-120B is a state-of-the-art open-weight language model that delivers strong real-world performance at low cost. This 120 billion parameter mixture-of-experts model achieves near-parity with OpenAI o4-mini on core reasoning benchmarks while running efficiently on a single 80 GB GPU. It was trained using reinforcement learning and techniques informed by OpenAI's most advanced internal models, including o3 and other frontier systems.

GPT-OSS-120B

Moonshot's latest Mixture-of-Experts model with 32 billion activated parameters and 1 trillion total parameters. Achieves state-of-the-art performance in frontier knowledge, math, and coding among non-thinking models. Meticulously optimized for agentic tasks with sophisticated tool-use capabilities and multi-turn interactions.

Kimi K2

Claude 3.5 Sonnet raises the industry bar for intelligence, outperforming competitor models and Claude 3 Opus on a wide range of evaluations, with the speed and cost of a mid-tier model. It operates at twice the speed of Claude 3 Opus.

Claude 3.5 Sonnet

GPT-4o ('o' for 'omni') is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in a conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models.

GPT-4o

A powerful Mixture-of-Experts (MoE) language model with 671B total parameters and 37B activated for each token. Features Multi-head Latent Attention (MLA) and DeepSeekMoE architectures with innovative auxiliary-loss-free load balancing and multi-token prediction training. Pre-trained on 14.8T high-quality tokens with only 2.788M H800 GPU hours.

Rank	Model	Provider	Score	Parameters	Released	Type
1	o1	OpenAI	92.3		2024-09-12	Multimodal
2	Claude Opus 4.6	Anthropic	91.1	Unreleased	2026-02-05	Multimodal
3	DeepSeek-R1	DeepSeek	90.8	671B (37B activated)	2025-01-20	Text
4	o1-preview	OpenAI	90.8		2024-09-12	Text
5	GPT-4.1	OpenAI	90.2		2025-04-14	Multimodal
6	GPT-OSS-120B	OpenAI	90	117B total (5.1B active per token)	2025-08-05	Text
7	Kimi K2	Moonshot AI	89.5	1T total, 32B activated	2025-07-11	Text
8	Claude 3.5 Sonnet	Anthropic	88.7		2024-06-20	Multimodal
9	GPT-4o	OpenAI	88.7		2024-05-13	Multimodal
10	DeepSeek-V3	DeepSeek	88.5	671B total, 37B activated	2024-12-26	Text

MMLU

MMLU Leaderboard

About MMLU

Methodology

Publication

Related Benchmarks

Global-MMLU-Lite

Global-MMLU

MMLU-Pro