Software Engineering Benchmark (SWE-bench) evaluates models on real-world software engineering tasks.

SWE-bench

Gemini 3 Pro is Google's flagship multimodal foundation model, released in November 2025. Built on a sparse Mixture-of-Experts (MoE) Transformer architecture, it features a 1 million token context window and native understanding of text, images, audio, and video. The model introduces 'Deep Think' capabilities for enhanced reasoning, controlled via a 'thinking_level' parameter, and is optimized for 'agentic' workflows and 'vibe coding'—the generation of full applications from natural language. It supports advanced function calling and tool use, making it suitable for complex software engineering and long-context analysis tasks.

Gemini 3 Pro

Claude Sonnet 4 significantly improves on Sonnet 3.7's capabilities, excelling in coding with a state-of-the-art 72.7% on SWE-bench. The model balances performance and efficiency, with enhanced steerability for greater control over implementations. Features include extended thinking with tool use, parallel tool execution, improved instruction following, and significantly reduced shortcut behaviors.

Claude Sonnet 4

Claude Opus 4 is Anthropic's most powerful model and the best coding model in the world, leading on SWE-bench (72.5%) and Terminal-bench (43.2%). It delivers sustained performance on long-running tasks that require focused effort and thousands of steps, with the ability to work continuously for several hours. Features include extended thinking with tool use, parallel tool execution, improved memory capabilities, and significantly reduced shortcut behaviors.

Claude Opus 4

Anthropic's most intelligent model to date and the first hybrid reasoning model on the market. Claude 3.7 Sonnet can produce near-instant responses or extended, step-by-step thinking that is made visible to the user. API users have fine-grained control over how long the model can think for (up to 128K tokens). Priced at $3 per million input tokens and $15 per million output tokens at launch, including thinking tokens.

Claude 3.7 Sonnet

OpenAI's most powerful reasoning model that pushes the frontier across coding, math, science, visual perception, and more. Sets new state-of-the-art on benchmarks including Codeforces, SWE-bench, and MMMU. Features full tool access and can agentically use and combine every tool within ChatGPT.

A smaller model optimized for fast, cost-efficient reasoning. Achieves remarkable performance for its size and cost, particularly in math, coding, and visual tasks. It is the best-performing benchmarked model on AIME 2024 and 2025, with significantly higher usage limits than o3.

o4-mini

Gemini 2.5 Pro is capable of reasoning through its thoughts before responding, resulting in enhanced performance and improved accuracy. Features Deep Think, an enhanced reasoning mode, and native audio outputs that capture subtle nuances of speech.

Gemini 2.5 Pro

Improved across key benchmarks for reasoning, multimodality, code and long context while getting even more efficient. Best for fast performance on complex tasks.

Gemini 2.5 Flash

DeepSeek-R1 is a first-generation reasoning model trained via large-scale reinforcement learning. Built on DeepSeek-V3-Base, it incorporates cold-start data before RL to address challenges like endless repetition and poor readability found in DeepSeek-R1-Zero. Achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks through advanced chain-of-thought reasoning capabilities.

DeepSeek-R1

Google's most cost-efficient and fastest 2.5 model yet. Higher quality than 2.0 Flash-Lite on coding, math, science, reasoning and multimodal benchmarks. Excels at high-volume, latency-sensitive tasks like translation and classification.

Rank	Model	Provider	Score	Parameters	Released	Type
1	Gemini 3 Pro	Google	76.2	Proprietary	2025-11-18	Multimodal
2	Claude Sonnet 4	Anthropic	72.7		2025-05-22	Multimodal
3	Claude Opus 4	Anthropic	72.5		2025-05-22	Multimodal
4	Claude 3.7 Sonnet	Anthropic	70.3		2025-02-24	Multimodal
5	o3	OpenAI	69.1		2025-04-16	Multimodal
6	o4-mini	OpenAI	68.1		2025-04-16	Multimodal
7	Gemini 2.5 Pro	Google	63.2		2025-05-06	Multimodal
8	Gemini 2.5 Flash	Google	60.4		2025-05-20	Multimodal
9	DeepSeek-R1	DeepSeek	49.2	671B (37B activated)	2025-01-20	Text
10	Gemini 2.5 Flash-Lite	Google	44.9		2025-06-17	Multimodal

SWE-bench

SWE-bench Leaderboard

About SWE-bench

Methodology

Publication

Related Benchmarks

Aider Polyglot

Berkeley Function-Calling Leaderboard

CodeForces