Evaluates code generation capabilities by asking models to complete Python functions based on docstrings and function signatures.

HumanEval

Faster, cost-efficient reasoning model optimized for coding and agentic applications. 80% cheaper than o1-preview with strong capabilities for complex problem-solving in tasks where context is provided within the prompt.

o1-mini

Large reasoning model with strong capabilities for solving hard problems through extended thinking. Designed to spend more time reasoning before responding, with enhanced performance in science, coding, and math.

o1-preview

Claude 3.5 Sonnet raises the industry bar for intelligence, outperforming competitor models and Claude 3 Opus on a wide range of evaluations, with the speed and cost of a mid-tier model. It operates at twice the speed of Claude 3 Opus.

Claude 3.5 Sonnet

GPT-4o ('o' for 'omni') is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in a conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models.

GPT-4o

Google's state-of-the-art, experimental text diffusion model that uses diffusion techniques to explore a new kind of language model that gives users greater control, creativity, and speed in text generation. Unlike traditional autoregressive models that generate text sequentially, diffusion models refine noise step-by-step to generate entire blocks of tokens at once.

Gemini Diffusion

xAI's Grok-2 is a significant step forward from Grok-1.5, featuring frontier capabilities in chat, coding, and reasoning. It outperformed Claude 3.5 Sonnet and GPT-4-Turbo on the LMSYS leaderboard and shows significant improvements in reasoning with retrieved content and tool use capabilities.

Grok-2

Anthropics fastest model at time of launch, delivering intelligence at blazing speeds. Claude 3.5 Haiku is optimized for quick responses while maintaining high quality.

Claude 3.5 Haiku

Moonshot's latest Mixture-of-Experts model with 32 billion activated parameters and 1 trillion total parameters. Achieves state-of-the-art performance in frontier knowledge, math, and coding among non-thinking models. Meticulously optimized for agentic tasks with sophisticated tool-use capabilities and multi-turn interactions.

Kimi K2

Grok-2 mini is a small but capable sibling of Grok-2 that offers a balance between speed and answer quality. It demonstrates significant improvements over Grok-1.5 while maintaining faster inference speeds compared to the full Grok-2 model.

Grok-2 mini

Anthropic's most intelligent model at release, with best-in-market performance on highly complex tasks. It can navigate open-ended prompts with remarkable fluency and human-like understanding. Pricing at launch: $15 per million input tokens, $75 per million output tokens.

Rank	Model	Provider	Score	Parameters	Released	Type
1	o1-mini	OpenAI	92.4		2024-09-12	Text
2	o1-preview	OpenAI	92.4		2024-09-12	Text
3	Claude 3.5 Sonnet	Anthropic	92		2024-06-20	Multimodal
4	GPT-4o	OpenAI	90.2		2024-05-13	Multimodal
5	Gemini Diffusion	Google	89.6		2025-05-20	Text
6	Grok-2	xAI	88.4	Unknown	2024-08-13	Multimodal
7	Claude 3.5 Haiku	Anthropic	88.1		2024-10-22	Multimodal
8	Kimi K2	Moonshot AI	85.7	1T total, 32B activated	2025-07-11	Text
9	Grok-2 mini	xAI	85.7	Unknown	2024-08-13	Multimodal
10	Claude 3 Opus	Anthropic	84.9		2024-03-04	Multimodal

HumanEval

HumanEval Leaderboard

About HumanEval

Methodology

Publication

Related Benchmarks

Aider Polyglot

Berkeley Function-Calling Leaderboard

CodeForces