MMLU-Pro is an enhanced benchmark with over 12,000 challenging questions across 14 domains including Biology, Business, Chemistry, Computer Science, Economics, Engineering, Health, History, Law, Math, Philosophy, Physics, Psychology, and Others. It features 10 answer choices per question (vs. 4 in MMLU) and focuses on complex reasoning tasks.

MMLU-Pro

Gemini 3 Pro is Google's flagship multimodal foundation model, released in November 2025. Built on a sparse Mixture-of-Experts (MoE) Transformer architecture, it features a 1 million token context window and native understanding of text, images, audio, and video. The model introduces 'Deep Think' capabilities for enhanced reasoning, controlled via a 'thinking_level' parameter, and is optimized for 'agentic' workflows and 'vibe coding'—the generation of full applications from natural language. It supports advanced function calling and tool use, making it suitable for complex software engineering and long-context analysis tasks.

Gemini 3 Pro

Advanced coding-focused language model with significant improvements in multilingual agentic coding, terminal-based tasks, UI generation, tool using, and complex reasoning. Features Interleaved Thinking, Preserved Thinking, and Turn-level Thinking capabilities for enhanced stability in complex multi-turn tasks. Achieves substantial gains over GLM-4.6 across coding benchmarks, web browsing, and mathematical reasoning.

GLM-4.7

Moonshot's latest Mixture-of-Experts model with 32 billion activated parameters and 1 trillion total parameters. Achieves state-of-the-art performance in frontier knowledge, math, and coding among non-thinking models. Meticulously optimized for agentic tasks with sophisticated tool-use capabilities and multi-turn interactions.

Kimi K2

DeepSeek-R1 is a first-generation reasoning model trained via large-scale reinforcement learning. Built on DeepSeek-V3-Base, it incorporates cold-start data before RL to address challenges like endless repetition and poor readability found in DeepSeek-R1-Zero. Achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks through advanced chain-of-thought reasoning capabilities.

DeepSeek-R1

xAI's most advanced model yet, blending superior reasoning with extensive pretraining knowledge. Trained on the Colossus supercluster with 10x the compute of previous state-of-the-art models. Features test-time compute and reasoning capabilities through reinforcement learning, allowing it to think for seconds to minutes while correcting errors and exploring alternatives. Achieved an Elo score of 1402 in the Chatbot Arena.

Grok 3

Google's best model yet for coding performance and complex prompts, with better understanding and reasoning of world knowledge than any previous release. Features a massive 2 million token context window and the ability to call tools like Google Search and code execution.

Gemini 2.0 Pro

A smaller, more efficient version of Grok 3 from xAI. Represents a new frontier in cost-efficient reasoning, particularly strong on STEM tasks that don't require as much world knowledge. Also features test-time compute and reasoning capabilities through the Grok 3 mini (Think) variant, achieving impressive performance on mathematical and coding benchmarks while being more resource-efficient than the full Grok 3 model.

Grok 3 Mini

Nemotron 3 Nano is a 31.6B parameter hybrid large language model developed by NVIDIA, released on December 15, 2025. It employs a novel Hybrid Mamba-Transformer Mixture-of-Experts (MoE) architecture, activating only ~3.2B parameters per token for efficient inference. Optimized for agentic AI, reasoning, and coding, it supports a 1 million token context window and features a 'Reasoning ON/OFF' toggle with a configurable thinking budget. NVIDIA has released the model with open weights, training data, and training recipes under the NVIDIA Open Model License.

Nemotron 3 Nano

Next iteration of Gemini, released in three versions (Flash, Flash-Lite, Pro). Represents Google's state-of-the-art multimodal model, likely incorporating Mixture-of-Experts at unprecedented scale and targeting dominance in both text and image understanding.

Gemini 2.0 Flash

A powerful Mixture-of-Experts (MoE) language model with 671B total parameters and 37B activated for each token. Features Multi-head Latent Attention (MLA) and DeepSeekMoE architectures with innovative auxiliary-loss-free load balancing and multi-token prediction training. Pre-trained on 14.8T high-quality tokens with only 2.788M H800 GPU hours.

Rank	Model	Provider	Score	Parameters	Released	Type
1	Gemini 3 Pro	Google	90.1	Proprietary	2025-11-18	Multimodal
2	GLM-4.7	Z.ai	87.5	Unreleased	2025-12-22	Text
3	Kimi K2	Moonshot AI	84.6	1T total, 32B activated	2025-07-11	Text
4	DeepSeek-R1	DeepSeek	84	671B (37B activated)	2025-01-20	Text
5	Grok 3	xAI	79.9	Unknown (multi-trillion estimated)	2025-02-19	Multimodal
6	Gemini 2.0 Pro	Google	79.1		2025-02-05	Multimodal
7	Grok 3 Mini	xAI	78.9	Unknown	2025-02-19	Multimodal
8	Nemotron 3 Nano	NVIDIA	78.3	31.6B (Total), ~3.2B (Active)	2025-12-15	Text
9	Gemini 2.0 Flash	Google	77.6		2025-02-25	Multimodal
10	DeepSeek-V3	DeepSeek	75.9	671B total, 37B activated	2024-12-26	Text

MMLU-Pro

MMLU-Pro Leaderboard

About MMLU-Pro

Methodology

Publication

Related Benchmarks

Global-MMLU-Lite

Global-MMLU

MMLU