LLM Benchmarks

Compare model performance across standardized benchmarks that test different capabilities.

Common LLM Benchmarks

ARC

reasoning

AI2 Reasoning Challenge (ARC) tests reasoning through grade-school science questions.

Scale: 0-100Published: 2018

GPQA

reasoning

Graduate-level Problems in Quantitative Analysis (GPQA) evaluates advanced reasoning on graduate-level physics and mathematics problems.

Scale: 0-100Published: 2023

HellaSwag

common sense

Tests common sense natural language inference through completion of scenarios.

Scale: 0-100Published: 2019

HumanEval

coding

Evaluates code generation capabilities by asking models to complete Python functions based on docstrings and function signatures.

Scale: 0-100Published: 2021

MATH

mathematics

A dataset of 12,500 challenging competition mathematics problems requiring multi-step reasoning.

Scale: 0-100Published: 2021

MMLU

knowledge

Massive Multitask Language Understanding (MMLU) tests knowledge across 57 subjects including mathematics, history, law, and more.

Scale: 0-100Published: 2020

Aider-Polyglot Leaderboard

coding
RankModelProviderScoreParametersReleased
1OpenAI o3OpenAI
81.3
2025-04-16
2Gemini 2.5 ProGoogle
76.5
2025-05-06
3OpenAI o4-miniOpenAI
68.9
2025-04-16
4Gemini 2.5 FlashGoogle
61.9
2025-05-20
5Qwen-3Alibaba
61.8
235B (22B active)2025-04-29

AIME-2024 Leaderboard

mathematics
RankModelProviderScoreParametersReleased
1OpenAI o4-miniOpenAI
93.4
2025-04-16
2Qwen-3Alibaba
85.7
235B (22B active)2025-04-29
3Claude 3.7 SonnetAnthropic
80
2025-02-24

AIME-2025 Leaderboard

mathematics
RankModelProviderScoreParametersReleased
1OpenAI o4-miniOpenAI
92.7
2025-04-16
2Gemini 2.5 ProGoogle
83
2025-05-06
3Qwen-3Alibaba
81.5
235B (22B active)2025-04-29
4Gemini 2.5 FlashGoogle
72
2025-05-20
5Gemini DiffusionGoogle
23.3
2025-05-20

AIME Leaderboard

mathematics
RankModelProviderScoreParametersReleased
1OpenAI o3OpenAI
91.6
2025-04-16
2Claude Opus 4Anthropic
33.9
2025-05-22
3Claude Sonnet 4Anthropic
33.1
2025-05-22

ARC Leaderboard

reasoning
RankModelProviderScoreParametersReleased
1Claude 3 OpusAnthropic
96.4
2024-03-04
2Claude 3 SonnetAnthropic
93.2
2024-03-04
3Claude 3 HaikuAnthropic
89.2
2024-03-04

Berkeley Function-Calling Leaderboard Leaderboard

coding
RankModelProviderScoreParametersReleased
1Qwen-3Alibaba
70.8
235B (22B active)2025-04-29

BIG-bench Leaderboard

diverse
RankModelProviderScoreParametersReleased
1Claude 3.5 SonnetAnthropic
93.1
2024-06-20
2Claude 3 OpusAnthropic
86.8
2024-03-04
3Claude 3 SonnetAnthropic
82.9
2024-03-04
4Claude 3 HaikuAnthropic
73.7
2024-03-04
5Gemini DiffusionGoogle
15
2025-05-20

BIRD-SQL Leaderboard

database
RankModelProviderScoreParametersReleased
1Gemini 2.0 ProGoogle
59.3
2025-02-05
2Gemini 2.0 FlashGoogle
58.7
2025-02-25
3Gemini 2.0 Flash-LiteGoogle
57.4
2025-02-25

BrowseComp Leaderboard

tool-use
RankModelProviderScoreParametersReleased
1OpenAI o3OpenAI
49.7
2025-04-16
2OpenAI o4-miniOpenAI
28.3
2025-04-16

CharXiv-Reasoning Leaderboard

scientific
RankModelProviderScoreParametersReleased
1OpenAI o3OpenAI
78.6
2025-04-16
2OpenAI o4-miniOpenAI
72
2025-04-16

Codeforces Leaderboard

coding
RankModelProviderScoreParametersReleased
1OpenAI o4-miniOpenAI
2,719
2025-04-16
2OpenAI o3OpenAI
2,706
2025-04-16
3Qwen-3Alibaba
2,056
235B (22B active)2025-04-29

CoVoST 2 Leaderboard

translation
RankModelProviderScoreParametersReleased
1Gemini 2.0 ProGoogle
40.6
2025-02-05
2Gemini 2.0 FlashGoogle
39
2025-02-25
3Gemini 2.0 Flash-LiteGoogle
38.4
2025-02-25

DROP Leaderboard

reasoning
RankModelProviderScoreParametersReleased
1Claude 3.5 SonnetAnthropic
87.1
2024-06-20
2Claude 3.5 HaikuAnthropic
83.1
2024-10-22
3Claude 3 OpusAnthropic
83.1
2024-03-04
4Claude 3 SonnetAnthropic
78.9
2024-03-04
5Claude 3 HaikuAnthropic
78.4
2024-03-04

EgoSchema Leaderboard

multimodal
RankModelProviderScoreParametersReleased
1Gemini 2.0 ProGoogle
71.9
2025-02-05
2Gemini 2.0 FlashGoogle
71.1
2025-02-25
3Gemini 2.0 Flash-LiteGoogle
67.2
2025-02-25

FACTS-Grounding Leaderboard

factuality
RankModelProviderScoreParametersReleased
1Gemini 2.5 FlashGoogle
85.3
2025-05-20
2Gemini 2.0 FlashGoogle
84.6
2025-02-25
3Gemini 2.0 Flash-LiteGoogle
83.6
2025-02-25
4Gemini 2.0 ProGoogle
82.8
2025-02-05

Global-MMLU-Lite Leaderboard

knowledge
RankModelProviderScoreParametersReleased
1Gemini 2.0 ProGoogle
86.5
2025-02-05
2Gemini 2.0 FlashGoogle
83.4
2025-02-25
3Gemini 2.0 Flash-LiteGoogle
78.2
2025-02-25

Global-MMLU Leaderboard

knowledge
RankModelProviderScoreParametersReleased
1Gemini 2.5 ProGoogle
88.6
2025-05-06

GPQA Leaderboard

reasoning
RankModelProviderScoreParametersReleased
1Claude 3.7 SonnetAnthropic
84.8
2025-02-24
2OpenAI o3OpenAI
83.3
2025-04-16
3Gemini 2.5 ProGoogle
83
2025-05-06
4Gemini 2.5 FlashGoogle
82.8
2025-05-20
5OpenAI o4-miniOpenAI
81.4
2025-04-16

GSM8K Leaderboard

mathematics
RankModelProviderScoreParametersReleased
1Claude 3.5 SonnetAnthropic
96.4
2024-06-20
2Claude 3 OpusAnthropic
95
2024-03-04
3Claude 3 SonnetAnthropic
92.3
2024-03-04
4Qwen-2Alibaba
89.5
72B2024-06-11
5Claude 3 HaikuAnthropic
88.9
2024-03-04

HellaSwag Leaderboard

common sense
RankModelProviderScoreParametersReleased
1Claude 3 OpusAnthropic
95.4
2024-03-04
2Claude 3 SonnetAnthropic
89
2024-03-04
3Claude 3 HaikuAnthropic
85.9
2024-03-04

HiddenMath Leaderboard

math
RankModelProviderScoreParametersReleased
1Gemini 2.0 ProGoogle
65.2
2025-02-05
2Gemini 2.0 FlashGoogle
63.5
2025-02-25
3Gemini 2.0 Flash-LiteGoogle
55.3
2025-02-25

HumanEval Leaderboard

coding
RankModelProviderScoreParametersReleased
1Claude 3.5 SonnetAnthropic
92
2024-06-20
2Gemini DiffusionGoogle
89.6
2025-05-20
3Claude 3.5 HaikuAnthropic
88.1
2024-10-22
4Claude 3 OpusAnthropic
84.9
2024-03-04
5Claude 3 HaikuAnthropic
75.9
2024-03-04

Humanitys-Last-Exam Leaderboard

diverse
RankModelProviderScoreParametersReleased
1Gemini 2.5 ProGoogle
17.8
2025-05-06
2OpenAI o4-miniOpenAI
17.7
2025-04-16
3Gemini 2.5 FlashGoogle
11
2025-05-20

LiveBench Leaderboard

multi-domain
RankModelProviderScoreParametersReleased
1Qwen-3Alibaba
77.1
235B (22B active)2025-04-29

LiveCodeBench-v5 Leaderboard

coding
RankModelProviderScoreParametersReleased
1Gemini 2.5 ProGoogle
75.6
2025-05-06
2Qwen-3Alibaba
70.7
235B (22B active)2025-04-29
3Gemini 2.5 FlashGoogle
63.9
2025-05-20
4Gemini 2.0 ProGoogle
36
2025-02-05
5Gemini 2.0 FlashGoogle
34.5
2025-02-25

MATH Leaderboard

mathematics
RankModelProviderScoreParametersReleased
1Claude 3.7 SonnetAnthropic
96.2
2025-02-24
2Gemini 2.0 ProGoogle
91.8
2025-02-05
3Gemini 2.0 FlashGoogle
90.9
2025-02-25
4Gemini 2.0 Flash-LiteGoogle
86.8
2025-02-25
5Claude 3.5 SonnetAnthropic
71.1
2024-06-20

MathVista Leaderboard

multimodal
RankModelProviderScoreParametersReleased
1OpenAI o3OpenAI
86.8
2025-04-16
2OpenAI o4-miniOpenAI
84.3
2025-04-16

MGSM Leaderboard

mathematics
RankModelProviderScoreParametersReleased
1Claude 3.5 SonnetAnthropic
91.6
2024-06-20
2Claude 3 OpusAnthropic
90.7
2024-03-04
3Claude 3.5 HaikuAnthropic
85.6
2024-10-22
4Claude 3 SonnetAnthropic
83.5
2024-03-04
5Claude 3 HaikuAnthropic
75.1
2024-03-04

MMLU-Pro Leaderboard

knowledge
RankModelProviderScoreParametersReleased
1Gemini 2.0 ProGoogle
79.1
2025-02-05
2Gemini 2.0 FlashGoogle
77.6
2025-02-25
3Gemini 2.0 Flash-LiteGoogle
71.6
2025-02-25
4Qwen-2Alibaba
55.6
72B2024-06-11

MMLU Leaderboard

knowledge
RankModelProviderScoreParametersReleased
1Claude 3.5 SonnetAnthropic
88.7
2024-06-20
2Gemini 2.5 FlashGoogle
88.4
2025-05-20
3Claude Opus 4Anthropic
87.4
2025-05-22
4Claude 3 OpusAnthropic
86.8
2024-03-04
5Claude 3.7 SonnetAnthropic
86.1
2025-02-24

MMMU Leaderboard

multimodal
RankModelProviderScoreParametersReleased
1OpenAI o3OpenAI
82.9
2025-04-16
2OpenAI o4-miniOpenAI
81.6
2025-04-16
3Gemini 2.5 FlashGoogle
79.7
2025-05-20
4Gemini 2.5 ProGoogle
79.6
2025-05-06
5Claude 3.7 SonnetAnthropic
75
2025-02-24

Michelangelo Long-Context Reasoning (128k) Leaderboard

reasoning
RankModelProviderScoreParametersReleased
1Gemini 2.5 FlashGoogle
74
2025-05-20

Michelangelo Long-Context Reasoning (1M) Leaderboard

reasoning
RankModelProviderScoreParametersReleased
1Gemini 2.5 ProGoogle
93
2025-05-06
2Gemini 2.0 ProGoogle
74.7
2025-02-05
3Gemini 2.0 FlashGoogle
70.5
2025-02-25
4Gemini 2.0 Flash-LiteGoogle
58
2025-02-25
5Gemini 2.5 FlashGoogle
32
2025-05-20

Multi-IF Leaderboard

instruction
RankModelProviderScoreParametersReleased
1Qwen-3Alibaba
71.9
235B (22B active)2025-04-29

Scale-MultiChallenge Leaderboard

diverse
RankModelProviderScoreParametersReleased
1OpenAI o3OpenAI
56.51
2025-04-16
2OpenAI o4-miniOpenAI
42.99
2025-04-16

SimpleQA Leaderboard

knowledge
RankModelProviderScoreParametersReleased
1Gemini 2.5 ProGoogle
50.8
2025-05-06
2Gemini 2.0 ProGoogle
44.3
2025-02-05
3Gemini 2.0 FlashGoogle
29.9
2025-02-25
4Gemini 2.5 FlashGoogle
26.9
2025-05-20
5Gemini 2.0 Flash-LiteGoogle
21.7
2025-02-25

SWE-bench Leaderboard

coding
RankModelProviderScoreParametersReleased
1Claude Sonnet 4Anthropic
72.7
2025-05-22
2Claude Opus 4Anthropic
72.5
2025-05-22
3Claude 3.7 SonnetAnthropic
70.3
2025-02-24
4OpenAI o3OpenAI
69.1
2025-04-16
5OpenAI o4-miniOpenAI
68.1
2025-04-16

SWE-Lancer Leaderboard

coding
RankModelProviderScoreParametersReleased
1OpenAI o3OpenAI
66,250
2025-04-16
2OpenAI o4-miniOpenAI
56,375
2025-04-16

TAU-bench Leaderboard

tool-use
RankModelProviderScoreParametersReleased
1Claude 3.7 SonnetAnthropic
81.2
2025-02-24
2OpenAI o3OpenAI
73.9
2025-04-16
3OpenAI o4-miniOpenAI
71.8
2025-04-16
4Claude 3.5 HaikuAnthropic
51
2024-10-22

Terminal-bench Leaderboard

system
RankModelProviderScoreParametersReleased
1Claude Opus 4Anthropic
43.2
2025-05-22

Vibe-Eval Leaderboard

style
RankModelProviderScoreParametersReleased
1Gemini 2.5 ProGoogle
65.6
2025-05-06
2Gemini 2.5 FlashGoogle
65.4
2025-05-20

Video-MME Leaderboard

multimodal
RankModelProviderScoreParametersReleased
1Gemini 2.5 ProGoogle
84.8
2025-05-06

WMT24 Leaderboard

translation
RankModelProviderScoreParametersReleased
1Gemma 3nGoogle
50.1
4B2025-05-20