LLM Benchmarks
Compare model performance across standardized benchmarks that test different capabilities.
Common LLM Benchmarks
Rank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | o3 | OpenAI | 81.3 | 2025-04-16 | |
2 | Gemini 2.5 Pro | 76.5 | 2025-05-06 | ||
3 | o4-mini | OpenAI | 68.9 | 2025-04-16 | |
4 | Gemini 2.5 Flash | 61.9 | 2025-05-20 | ||
5 | Qwen-3 | Alibaba | 61.8 | 235B (22B active) | 2025-04-29 |
AIME-2024 Leaderboard
mathematicsRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Grok 3 Mini | xAI | 95.8 | Unknown | 2025-02-19 |
2 | o4-mini | OpenAI | 93.4 | 2025-04-16 | |
3 | Qwen-3 | Alibaba | 85.7 | 235B (22B active) | 2025-04-29 |
4 | o1 | OpenAI | 83.3 | 2024-09-12 | |
5 | Claude 3.7 Sonnet | Anthropic | 80 | 2025-02-24 |
AIME-2025 Leaderboard
mathematicsRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Grok 3 | xAI | 93.3 | Unknown (multi-trillion estimated) | 2025-02-19 |
2 | o4-mini | OpenAI | 92.7 | 2025-04-16 | |
3 | Grok 4 | xAI | 91.7 | Unknown | 2025-07-09 |
4 | Grok 3 Mini | xAI | 90.8 | Unknown | 2025-02-19 |
5 | Gemini 2.5 Pro | 83 | 2025-05-06 |
AIME Leaderboard
mathematicsRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | o3 | OpenAI | 91.6 | 2025-04-16 | |
2 | o1-mini | OpenAI | 70 | 2024-09-12 | |
3 | o1-preview | OpenAI | 44.6 | 2024-09-12 | |
4 | Claude Opus 4 | Anthropic | 33.9 | 2025-05-22 | |
5 | Claude Sonnet 4 | Anthropic | 33.1 | 2025-05-22 |
ARC Leaderboard
reasoningRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Claude 3 Opus | Anthropic | 96.4 | 2024-03-04 | |
2 | Claude 3 Sonnet | Anthropic | 93.2 | 2024-03-04 | |
3 | Claude 3 Haiku | Anthropic | 89.2 | 2024-03-04 | |
4 | Mixtral 8×22B | Mistral AI | 70 | 141B (39B active) | 2024-04-17 |
Rank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Qwen-3 | Alibaba | 70.8 | 235B (22B active) | 2025-04-29 |
Rank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Claude 3.5 Sonnet | Anthropic | 93.1 | 2024-06-20 | |
2 | Claude 3 Opus | Anthropic | 86.8 | 2024-03-04 | |
3 | Claude 3 Sonnet | Anthropic | 82.9 | 2024-03-04 | |
4 | Claude 3 Haiku | Anthropic | 73.7 | 2024-03-04 | |
5 | Gemini Diffusion | 15 | 2025-05-20 |
BIRD-SQL Leaderboard
databaseRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Gemini 2.0 Pro | 59.3 | 2025-02-05 | ||
2 | Gemini 2.0 Flash | 58.7 | 2025-02-25 | ||
3 | Gemini 2.0 Flash-Lite | 57.4 | 2025-02-25 |
BrowseComp Leaderboard
retrievalRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | o4-mini | OpenAI | 2,719 | 2025-04-16 | |
2 | o3 | OpenAI | 2,706 | 2025-04-16 | |
3 | Qwen-3 | Alibaba | 2,056 | 235B (22B active) | 2025-04-29 |
4 | DeepSeek-R1 | DeepSeek | 2,029 | 671B (37B activated) | 2025-01-20 |
5 | o1 | OpenAI | 1,673 | 2024-09-12 |
CoVoST 2 Leaderboard
translationRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Gemini 2.0 Pro | 40.6 | 2025-02-05 | ||
2 | Gemini 2.0 Flash | 39 | 2025-02-25 | ||
3 | Gemini 2.0 Flash-Lite | 38.4 | 2025-02-25 |
Rank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | o1-preview | OpenAI | 43 | 2024-09-12 | |
2 | o1-mini | OpenAI | 28.7 | 2024-09-12 |
DROP Leaderboard
reasoningRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | DeepSeek-R1 | DeepSeek | 92.2 | 671B (37B activated) | 2025-01-20 |
2 | DeepSeek-V3 | DeepSeek | 91.6 | 671B total, 37B activated | 2024-12-26 |
3 | Claude 3.5 Sonnet | Anthropic | 87.1 | 2024-06-20 | |
4 | GPT-4o | OpenAI | 83.4 | 2024-05-13 | |
5 | Claude 3.5 Haiku | Anthropic | 83.1 | 2024-10-22 |
EgoSchema Leaderboard
multimodalRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Grok 3 | xAI | 74.5 | Unknown (multi-trillion estimated) | 2025-02-19 |
2 | Grok 3 Mini | xAI | 74.3 | Unknown | 2025-02-19 |
3 | Gemini 2.0 Pro | 71.9 | 2025-02-05 | ||
4 | Gemini 2.0 Flash | 71.1 | 2025-02-25 | ||
5 | Gemini 2.0 Flash-Lite | 67.2 | 2025-02-25 |
FACTS Grounding Leaderboard
factualityRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Gemini 2.5 Pro | 87.8 | 2025-05-06 | ||
2 | Gemini 2.5 Flash | 85.8 | 2025-05-20 | ||
3 | Gemini 2.0 Flash | 85.6 | 2025-02-25 | ||
4 | Gemini 2.5 Flash-Lite | 83.8 | 2025-06-17 | ||
5 | Claude 3.5 Sonnet | Anthropic | 83.3 | 2024-06-20 |
Rank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Gemini 2.0 Pro | 86.5 | 2025-02-05 | ||
2 | Gemini 2.5 Flash-Lite | 84.5 | 2025-06-17 | ||
3 | Gemini 2.0 Flash | 83.4 | 2025-02-25 | ||
4 | Gemini 2.0 Flash-Lite | 78.2 | 2025-02-25 |
Global-MMLU Leaderboard
knowledgeRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Gemini 2.5 Pro | 88.6 | 2025-05-06 | ||
2 | Gemma 3 | 75.4 | 1B, 4B, 12B, 27B | 2025-03-12 |
GPQA Leaderboard
reasoningRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Grok 4 | xAI | 87.5 | Unknown | 2025-07-09 |
2 | Claude 3.7 Sonnet | Anthropic | 84.8 | 2025-02-24 | |
3 | Grok 3 Mini | xAI | 84 | Unknown | 2025-02-19 |
4 | o3 | OpenAI | 83.3 | 2025-04-16 | |
5 | Gemini 2.5 Pro | 83 | 2025-05-06 |
GSM8K Leaderboard
mathematicsRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Claude 3.5 Sonnet | Anthropic | 96.4 | 2024-06-20 | |
2 | Kimi K2 | Moonshot AI | 95 | 1T total, 32B activated | 2025-07-11 |
3 | Claude 3 Opus | Anthropic | 95 | 2024-03-04 | |
4 | Claude 3 Sonnet | Anthropic | 92.3 | 2024-03-04 | |
5 | Qwen-2 | Alibaba | 89.5 | 72B | 2024-06-11 |
HellaSwag Leaderboard
common senseRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Claude 3 Opus | Anthropic | 95.4 | 2024-03-04 | |
2 | Claude 3 Sonnet | Anthropic | 89 | 2024-03-04 | |
3 | DeepSeek-V3 | DeepSeek | 88.9 | 671B total, 37B activated | 2024-12-26 |
4 | Mixtral 8×22B | Mistral AI | 88 | 141B (39B active) | 2024-04-17 |
5 | Claude 3 Haiku | Anthropic | 85.9 | 2024-03-04 |
Rank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Gemini 2.0 Pro | 65.2 | 2025-02-05 | ||
2 | Gemini 2.0 Flash | 63.5 | 2025-02-25 | ||
3 | Gemini 2.0 Flash-Lite | 55.3 | 2025-02-25 |
Rank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | o1-mini | OpenAI | 92.4 | 2024-09-12 | |
2 | o1-preview | OpenAI | 92.4 | 2024-09-12 | |
3 | Claude 3.5 Sonnet | Anthropic | 92 | 2024-06-20 | |
4 | GPT-4o | OpenAI | 90.2 | 2024-05-13 | |
5 | Gemini Diffusion | 89.6 | 2025-05-20 |
Rank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Grok 4 | xAI | 38.6 | Unknown | 2025-07-09 |
2 | Gemini 2.5 Pro | 17.8 | 2025-05-06 | ||
3 | o4-mini | OpenAI | 17.7 | 2025-04-16 | |
4 | Gemini 2.5 Flash | 11 | 2025-05-20 | ||
5 | Gemini 2.5 Flash-Lite | 6.9 | 2025-06-17 |
LiveBench Leaderboard
multi-domainRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Grok 4 | xAI | 79.4 | Unknown | 2025-07-09 |
2 | Gemini 2.5 Pro | 75.6 | 2025-05-06 | ||
3 | Qwen-3 | Alibaba | 70.7 | 235B (22B active) | 2025-04-29 |
4 | Gemini 2.5 Flash | 63.9 | 2025-05-20 | ||
5 | Grok 3 | xAI | 57 | Unknown (multi-trillion estimated) | 2025-02-19 |
LOFT (128k) Leaderboard
long-contextRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Grok 3 | xAI | 83.3 | Unknown (multi-trillion estimated) | 2025-02-19 |
2 | Grok 3 Mini | xAI | 83.1 | Unknown | 2025-02-19 |
MATH 500 Leaderboard
mathematicsRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | DeepSeek-R1 | DeepSeek | 97.3 | 671B (37B activated) | 2025-01-20 |
2 | o1-mini | OpenAI | 90 | 2024-09-12 | |
3 | o1-preview | OpenAI | 85.5 | 2024-09-12 |
MATH Leaderboard
mathematicsRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Kimi K2 | Moonshot AI | 97.4 | 1T total, 32B activated | 2025-07-11 |
2 | Claude 3.7 Sonnet | Anthropic | 96.2 | 2025-02-24 | |
3 | o1 | OpenAI | 94.8 | 2024-09-12 | |
4 | Gemini 2.0 Pro | 91.8 | 2025-02-05 | ||
5 | Gemini 2.0 Flash | 90.9 | 2025-02-25 |
MathVista Leaderboard
multimodalMGSM Leaderboard
mathematicsRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Claude 3.5 Sonnet | Anthropic | 91.6 | 2024-06-20 | |
2 | Claude 3 Opus | Anthropic | 90.7 | 2024-03-04 | |
3 | GPT-4o | OpenAI | 90.5 | 2024-05-13 | |
4 | Claude 3.5 Haiku | Anthropic | 85.6 | 2024-10-22 | |
5 | Claude 3 Sonnet | Anthropic | 83.5 | 2024-03-04 |
MMLU-Pro Leaderboard
knowledgeRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | DeepSeek-R1 | DeepSeek | 84 | 671B (37B activated) | 2025-01-20 |
2 | Kimi K2 | Moonshot AI | 81.1 | 1T total, 32B activated | 2025-07-11 |
3 | Grok 3 | xAI | 79.9 | Unknown (multi-trillion estimated) | 2025-02-19 |
4 | Gemini 2.0 Pro | 79.1 | 2025-02-05 | ||
5 | Grok 3 Mini | xAI | 78.9 | Unknown | 2025-02-19 |
MMLU Leaderboard
knowledgeRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | o1 | OpenAI | 92.3 | 2024-09-12 | |
2 | DeepSeek-R1 | DeepSeek | 90.8 | 671B (37B activated) | 2025-01-20 |
3 | o1-preview | OpenAI | 90.8 | 2024-09-12 | |
4 | GPT-4.1 | OpenAI | 90.2 | 2025-04-14 | |
5 | Kimi K2 | Moonshot AI | 89.5 | 1T total, 32B activated | 2025-07-11 |
MMMU Leaderboard
multimodalRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | o3 | OpenAI | 82.9 | 2025-04-16 | |
2 | o4-mini | OpenAI | 81.6 | 2025-04-16 | |
3 | Gemini 2.5 Flash | 79.7 | 2025-05-20 | ||
4 | Gemini 2.5 Pro | 79.6 | 2025-05-06 | ||
5 | o1 | OpenAI | 78.2 | 2024-09-12 |
Rank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Gemini 2.5 Flash | 74 | 2025-05-20 | ||
2 | Gemini 2.5 Flash-Lite | 30.6 | 2025-06-17 |
Rank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Gemini 2.5 Pro | 93 | 2025-05-06 | ||
2 | Gemini 2.0 Pro | 74.7 | 2025-02-05 | ||
3 | Gemini 2.0 Flash | 70.5 | 2025-02-25 | ||
4 | Gemini 2.0 Flash-Lite | 58 | 2025-02-25 | ||
5 | Gemini 2.5 Flash | 32 | 2025-05-20 |
Multi-IF Leaderboard
instructionSimpleQA Leaderboard
knowledgeRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Gemini 2.5 Pro | 50.8 | 2025-05-06 | ||
2 | Gemini 2.0 Pro | 44.3 | 2025-02-05 | ||
3 | Grok 3 | xAI | 43.6 | Unknown (multi-trillion estimated) | 2025-02-19 |
4 | Kimi K2 | Moonshot AI | 31 | 1T total, 32B activated | 2025-07-11 |
5 | DeepSeek-R1 | DeepSeek | 30.1 | 671B (37B activated) | 2025-01-20 |
Rank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Claude Sonnet 4 | Anthropic | 72.7 | 2025-05-22 | |
2 | Claude Opus 4 | Anthropic | 72.5 | 2025-05-22 | |
3 | Claude 3.7 Sonnet | Anthropic | 70.3 | 2025-02-24 | |
4 | o3 | OpenAI | 69.1 | 2025-04-16 | |
5 | o4-mini | OpenAI | 68.1 | 2025-04-16 |
Rank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Claude 3.7 Sonnet | Anthropic | 81.2 | 2025-02-24 | |
2 | o3 | OpenAI | 73.9 | 2025-04-16 | |
3 | o4-mini | OpenAI | 71.8 | 2025-04-16 | |
4 | Claude 3.5 Haiku | Anthropic | 51 | 2024-10-22 |
Rank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Claude Opus 4 | Anthropic | 43.2 | 2025-05-22 |
TruthfulQA Leaderboard
truthfulnessRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Gemma 3 | 68.7 | 1B, 4B, 12B, 27B | 2025-03-12 |
Rank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Grok 4 | xAI | 4,694.15 | Unknown | 2025-07-09 |
2 | Claude 3.5 Sonnet | Anthropic | 2,217.93 | 2024-06-20 | |
3 | Claude Opus 4 | Anthropic | 2,077.41 | 2025-05-22 | |
4 | o3 | OpenAI | 1,843.11 | 2025-04-16 | |
5 | Claude 3.7 Sonnet | Anthropic | 1,567.9 | 2025-02-24 |
Rank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Gemini 2.5 Pro | 65.6 | 2025-05-06 | ||
2 | Gemini 2.5 Flash | 65.4 | 2025-05-20 | ||
3 | Gemini 2.5 Flash-Lite | 57.5 | 2025-06-17 |
Video-MME Leaderboard
multimodalRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Gemini 2.5 Pro | 84.8 | 2025-05-06 |
WMT24 Leaderboard
translationRank | Model | Provider | Score | Parameters | Released |
---|---|---|---|---|---|
1 | Gemma 3n | 50.1 | 4B | 2025-05-20 |