Graduate-level Problems in Quantitative Analysis (GPQA) evaluates advanced reasoning on graduate-level physics and mathematics problems.

GPQA

Gemini 3 Pro is Google's flagship multimodal foundation model, released in November 2025. Built on a sparse Mixture-of-Experts (MoE) Transformer architecture, it features a 1 million token context window and native understanding of text, images, audio, and video. The model introduces 'Deep Think' capabilities for enhanced reasoning, controlled via a 'thinking_level' parameter, and is optimized for 'agentic' workflows and 'vibe coding'—the generation of full applications from natural language. It supports advanced function calling and tool use, making it suitable for complex software engineering and long-context analysis tasks.

Gemini 3 Pro

Claude Opus 4.6 is Anthropic's most intelligent model, upgrading its predecessor's coding, reasoning, and agentic capabilities. It plans more carefully, sustains agentic tasks for longer, operates more reliably in larger codebases, and has superior code review and debugging skills. In a first for Opus-class models, it features a 1M token context window (beta) and 128K max output tokens. Opus 4.6 achieves state-of-the-art results on Terminal-Bench 2.0 (65.4%), leads on Humanity's Last Exam, BrowseComp, and GDPval-AA, and scores 76% on the 8-needle 1M variant of MRCR v2 (vs. 18.5% for Sonnet 4.5), representing a qualitative leap in long-context performance. It introduces adaptive thinking and four effort levels (low, medium, high, max) for fine-grained control over intelligence, speed, and cost.

Claude Opus 4.6

xAI's latest generation model with enhanced mathematical reasoning capabilities, showing significant improvements in competition-level mathematics benchmarks. Features 2x faster end-to-end latency, supports 5 different voices, and achieves 10x daily user seconds compared to previous models.

Grok 4

Anthropic's most intelligent model to date and the first hybrid reasoning model on the market. Claude 3.7 Sonnet can produce near-instant responses or extended, step-by-step thinking that is made visible to the user. API users have fine-grained control over how long the model can think for (up to 128K tokens). Priced at $3 per million input tokens and $15 per million output tokens at launch, including thinking tokens.

Claude 3.7 Sonnet

A smaller, more efficient version of Grok 3 from xAI. Represents a new frontier in cost-efficient reasoning, particularly strong on STEM tasks that don't require as much world knowledge. Also features test-time compute and reasoning capabilities through the Grok 3 mini (Think) variant, achieving impressive performance on mathematical and coding benchmarks while being more resource-efficient than the full Grok 3 model.

Grok 3 Mini

OpenAI's most powerful reasoning model that pushes the frontier across coding, math, science, visual perception, and more. Sets new state-of-the-art on benchmarks including Codeforces, SWE-bench, and MMMU. Features full tool access and can agentically use and combine every tool within ChatGPT.

Gemini 2.5 Pro is capable of reasoning through its thoughts before responding, resulting in enhanced performance and improved accuracy. Features Deep Think, an enhanced reasoning mode, and native audio outputs that capture subtle nuances of speech.

Gemini 2.5 Pro

Improved across key benchmarks for reasoning, multimodality, code and long context while getting even more efficient. Best for fast performance on complex tasks.

Gemini 2.5 Flash

A smaller model optimized for fast, cost-efficient reasoning. Achieves remarkable performance for its size and cost, particularly in math, coding, and visual tasks. It is the best-performing benchmarked model on AIME 2024 and 2025, with significantly higher usage limits than o3.

o4-mini

GPT-OSS-120B is a state-of-the-art open-weight language model that delivers strong real-world performance at low cost. This 120 billion parameter mixture-of-experts model achieves near-parity with OpenAI o4-mini on core reasoning benchmarks while running efficiently on a single 80 GB GPU. It was trained using reinforcement learning and techniques informed by OpenAI's most advanced internal models, including o3 and other frontier systems.

Rank	Model	Provider	Score	Parameters	Released	Type
1	Gemini 3 Pro	Google	91.9	Proprietary	2025-11-18	Multimodal
2	Claude Opus 4.6	Anthropic	91.3	Unreleased	2026-02-05	Multimodal
3	Grok 4	xAI	87.5	Unknown	2025-07-09	Multimodal
4	Claude 3.7 Sonnet	Anthropic	84.8		2025-02-24	Multimodal
5	Grok 3 Mini	xAI	84	Unknown	2025-02-19	Multimodal
6	o3	OpenAI	83.3		2025-04-16	Multimodal
7	Gemini 2.5 Pro	Google	83		2025-05-06	Multimodal
8	Gemini 2.5 Flash	Google	82.8		2025-05-20	Multimodal
9	o4-mini	OpenAI	81.4		2025-04-16	Multimodal
10	GPT-OSS-120B	OpenAI	80.1	117B total (5.1B active per token)	2025-08-05	Text

GPQA

GPQA Leaderboard

About GPQA

Methodology

Publication

Related Benchmarks

AGIEval (English)

ARC

DROP