LLM Benchmarks
Compare model performance across standardized benchmarks that test different capabilities.
Common LLM Benchmarks
| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | o3 | OpenAI | 81.3 | 2025-04-16 | |
| 2 | Gemini 2.5 Pro | 76.5 | 2025-05-06 | ||
| 3 | o4-mini | OpenAI | 68.9 | 2025-04-16 | |
| 4 | Gemini 2.5 Flash | 61.9 | 2025-05-20 | ||
| 5 | Qwen-3 | Alibaba | 61.8 | 235B (22B active) | 2025-04-29 |
AIME-2024 Leaderboard
mathematics| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | GPT-OSS-120B | OpenAI | 96.6 | 117B total (5.1B active per token) | 2025-08-05 |
| 2 | GPT-OSS-20B | OpenAI | 96 | 21B total (3.6B active per token) | 2025-08-05 |
| 3 | Grok 3 Mini | xAI | 95.8 | Unknown | 2025-02-19 |
| 4 | o4-mini | OpenAI | 93.4 | 2025-04-16 | |
| 5 | Qwen-3 | Alibaba | 85.7 | 235B (22B active) | 2025-04-29 |
AIME-2025 Leaderboard
mathematics| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Nemotron 3 Nano | NVIDIA | 99.2 | 31.6B (Total), ~3.2B (Active) | 2025-12-15 |
| 2 | GPT-OSS-20B | OpenAI | 98.7 | 21B total (3.6B active per token) | 2025-08-05 |
| 3 | GPT-OSS-120B | OpenAI | 97.9 | 117B total (5.1B active per token) | 2025-08-05 |
| 4 | GLM-4.7 | Z.ai | 95.7 | Unreleased | 2025-12-22 |
| 5 | Gemini 3 Pro | 95 | Proprietary | 2025-11-18 |
AIME Leaderboard
mathematics| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | o3 | OpenAI | 91.6 | 2025-04-16 | |
| 2 | o1-mini | OpenAI | 70 | 2024-09-12 | |
| 3 | o1-preview | OpenAI | 44.6 | 2024-09-12 | |
| 4 | Claude Opus 4 | Anthropic | 33.9 | 2025-05-22 | |
| 5 | Claude Sonnet 4 | Anthropic | 33.1 | 2025-05-22 |
ARC Leaderboard
reasoning| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Claude 3 Opus | Anthropic | 96.4 | 2024-03-04 | |
| 2 | Claude 3 Sonnet | Anthropic | 93.2 | 2024-03-04 | |
| 3 | Nemotron 3 Nano | NVIDIA | 91.89 | 31.6B (Total), ~3.2B (Active) | 2025-12-15 |
| 4 | Claude 3 Haiku | Anthropic | 89.2 | 2024-03-04 | |
| 5 | Mixtral 8×22B | Mistral AI | 70 | 141B (39B active) | 2024-04-17 |
| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Qwen-3 | Alibaba | 70.8 | 235B (22B active) | 2025-04-29 |
| 2 | Nemotron 3 Nano | NVIDIA | 53.8 | 31.6B (Total), ~3.2B (Active) | 2025-12-15 |
| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Claude 3.5 Sonnet | Anthropic | 93.1 | 2024-06-20 | |
| 2 | Claude 3 Opus | Anthropic | 86.8 | 2024-03-04 | |
| 3 | Claude 3 Sonnet | Anthropic | 82.9 | 2024-03-04 | |
| 4 | Claude 3 Haiku | Anthropic | 73.7 | 2024-03-04 | |
| 5 | Gemini Diffusion | 15 | 2025-05-20 |
BIRD-SQL Leaderboard
database| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Gemini 2.0 Pro | 59.3 | 2025-02-05 | ||
| 2 | Gemini 2.0 Flash | 58.7 | 2025-02-25 | ||
| 3 | Gemini 2.0 Flash-Lite | 57.4 | 2025-02-25 |
BrowseComp Leaderboard
retrieval| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Kimi K2 | Moonshot AI | 60.2 | 1T total, 32B activated | 2025-07-11 |
| 2 | Gemini 3 Pro | 59.2 | Proprietary | 2025-11-18 | |
| 3 | GLM-4.7 | Z.ai | 52 | Unreleased | 2025-12-22 |
| 4 | o3 | OpenAI | 49.7 | 2025-04-16 | |
| 5 | o4-mini | OpenAI | 28.3 | 2025-04-16 |
| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | o4-mini | OpenAI | 2,719 | 2025-04-16 | |
| 2 | o3 | OpenAI | 2,706 | 2025-04-16 | |
| 3 | GPT-OSS-120B | OpenAI | 2,622 | 117B total (5.1B active per token) | 2025-08-05 |
| 4 | GPT-OSS-20B | OpenAI | 2,516 | 21B total (3.6B active per token) | 2025-08-05 |
| 5 | Qwen-3 | Alibaba | 2,056 | 235B (22B active) | 2025-04-29 |
CoVoST 2 Leaderboard
translation| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Gemini 2.0 Pro | 40.6 | 2025-02-05 | ||
| 2 | Gemini 2.0 Flash | 39 | 2025-02-25 | ||
| 3 | Gemini 2.0 Flash-Lite | 38.4 | 2025-02-25 |
| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | o1-preview | OpenAI | 43 | 2024-09-12 | |
| 2 | o1-mini | OpenAI | 28.7 | 2024-09-12 |
DROP Leaderboard
reasoning| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | DeepSeek-R1 | DeepSeek | 92.2 | 671B (37B activated) | 2025-01-20 |
| 2 | DeepSeek-V3 | DeepSeek | 91.6 | 671B total, 37B activated | 2024-12-26 |
| 3 | Claude 3.5 Sonnet | Anthropic | 87.1 | 2024-06-20 | |
| 4 | GPT-4o | OpenAI | 83.4 | 2024-05-13 | |
| 5 | Claude 3.5 Haiku | Anthropic | 83.1 | 2024-10-22 |
EgoSchema Leaderboard
multimodal| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Grok 3 | xAI | 74.5 | Unknown (multi-trillion estimated) | 2025-02-19 |
| 2 | Grok 3 Mini | xAI | 74.3 | Unknown | 2025-02-19 |
| 3 | Gemini 2.0 Pro | 71.9 | 2025-02-05 | ||
| 4 | Gemini 2.0 Flash | 71.1 | 2025-02-25 | ||
| 5 | Gemini 2.0 Flash-Lite | 67.2 | 2025-02-25 |
FACTS Grounding Leaderboard
factuality| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Gemini 2.5 Pro | 87.8 | 2025-05-06 | ||
| 2 | Gemini 2.5 Flash | 85.8 | 2025-05-20 | ||
| 3 | Gemini 2.0 Flash | 85.6 | 2025-02-25 | ||
| 4 | Gemini 2.5 Flash-Lite | 83.8 | 2025-06-17 | ||
| 5 | Claude 3.5 Sonnet | Anthropic | 83.3 | 2024-06-20 |
| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Gemini 2.0 Pro | 86.5 | 2025-02-05 | ||
| 2 | Gemini 2.5 Flash-Lite | 84.5 | 2025-06-17 | ||
| 3 | Gemini 2.0 Flash | 83.4 | 2025-02-25 | ||
| 4 | Gemini 2.0 Flash-Lite | 78.2 | 2025-02-25 | ||
| 5 | Nemotron 3 Nano | NVIDIA | 74.47 | 31.6B (Total), ~3.2B (Active) | 2025-12-15 |
Global-MMLU Leaderboard
knowledge| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Gemini 2.5 Pro | 88.6 | 2025-05-06 | ||
| 2 | Gemma 3 | 75.4 | 1B, 4B, 12B, 27B | 2025-03-12 |
GPQA Leaderboard
reasoning| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Gemini 3 Pro | 91.9 | Proprietary | 2025-11-18 | |
| 2 | Grok 4 | xAI | 87.5 | Unknown | 2025-07-09 |
| 3 | Claude 3.7 Sonnet | Anthropic | 84.8 | 2025-02-24 | |
| 4 | Grok 3 Mini | xAI | 84 | Unknown | 2025-02-19 |
| 5 | o3 | OpenAI | 83.3 | 2025-04-16 |
GSM8K Leaderboard
mathematics| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Claude 3.5 Sonnet | Anthropic | 96.4 | 2024-06-20 | |
| 2 | Kimi K2 | Moonshot AI | 95 | 1T total, 32B activated | 2025-07-11 |
| 3 | Claude 3 Opus | Anthropic | 95 | 2024-03-04 | |
| 4 | Nemotron 3 Nano | NVIDIA | 92.34 | 31.6B (Total), ~3.2B (Active) | 2025-12-15 |
| 5 | Claude 3 Sonnet | Anthropic | 92.3 | 2024-03-04 |
HellaSwag Leaderboard
common sense| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Claude 3 Opus | Anthropic | 95.4 | 2024-03-04 | |
| 2 | Claude 3 Sonnet | Anthropic | 89 | 2024-03-04 | |
| 3 | DeepSeek-V3 | DeepSeek | 88.9 | 671B total, 37B activated | 2024-12-26 |
| 4 | Mixtral 8×22B | Mistral AI | 88 | 141B (39B active) | 2024-04-17 |
| 5 | Claude 3 Haiku | Anthropic | 85.9 | 2024-03-04 |
| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Gemini 2.0 Pro | 65.2 | 2025-02-05 | ||
| 2 | Gemini 2.0 Flash | 63.5 | 2025-02-25 | ||
| 3 | Gemini 2.0 Flash-Lite | 55.3 | 2025-02-25 |
HMMT Leaderboard
mathematics| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Gemini 3 Pro | 95.4 | Proprietary | 2025-11-18 | |
| 2 | Kimi K2 | Moonshot AI | 89.3 | 1T total, 32B activated | 2025-07-11 |
| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | o1-mini | OpenAI | 92.4 | 2024-09-12 | |
| 2 | o1-preview | OpenAI | 92.4 | 2024-09-12 | |
| 3 | Claude 3.5 Sonnet | Anthropic | 92 | 2024-06-20 | |
| 4 | GPT-4o | OpenAI | 90.2 | 2024-05-13 | |
| 5 | Gemini Diffusion | 89.6 | 2025-05-20 |
| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Kimi K2 | Moonshot AI | 44.9 | 1T total, 32B activated | 2025-07-11 |
| 2 | Grok 4 | xAI | 38.6 | Unknown | 2025-07-09 |
| 3 | Gemini 3 Pro | 37.5 | Proprietary | 2025-11-18 | |
| 4 | GPT-OSS-120B | OpenAI | 19 | 117B total (5.1B active per token) | 2025-08-05 |
| 5 | Gemini 2.5 Pro | 17.8 | 2025-05-06 |
IMO-AnswerBench Leaderboard
mathematics| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Gemini 3 Pro | 83.3 | Proprietary | 2025-11-18 | |
| 2 | Kimi K2 | Moonshot AI | 78.6 | 1T total, 32B activated | 2025-07-11 |
LiveBench Leaderboard
multi-domain| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Grok 4 | xAI | 79.4 | Unknown | 2025-07-09 |
| 2 | Gemini 2.5 Pro | 75.6 | 2025-05-06 | ||
| 3 | Qwen-3 | Alibaba | 70.7 | 235B (22B active) | 2025-04-29 |
| 4 | Gemini 2.5 Flash | 63.9 | 2025-05-20 | ||
| 5 | Grok 3 | xAI | 57 | Unknown (multi-trillion estimated) | 2025-02-19 |
| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Gemini 3 Pro | 90.7 | Proprietary | 2025-11-18 | |
| 2 | Kimi K2 | Moonshot AI | 83.1 | 1T total, 32B activated | 2025-07-11 |
| 3 | DeepSeek-V3 | DeepSeek | 46.9 | 671B total, 37B activated | 2024-12-26 |
| 4 | Claude Opus 4 | Anthropic | 44.7 | 2025-05-22 | |
| 5 | Gemini 2.5 Flash | 44.7 | 2025-05-20 |
LOFT (128k) Leaderboard
long-context| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Grok 3 | xAI | 83.3 | Unknown (multi-trillion estimated) | 2025-02-19 |
| 2 | Grok 3 Mini | xAI | 83.1 | Unknown | 2025-02-19 |
MATH 500 Leaderboard
mathematics| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | DeepSeek-R1 | DeepSeek | 97.3 | 671B (37B activated) | 2025-01-20 |
| 2 | o1-mini | OpenAI | 90 | 2024-09-12 | |
| 3 | o1-preview | OpenAI | 85.5 | 2024-09-12 | |
| 4 | Nemotron 3 Nano | NVIDIA | 78.63 | 31.6B (Total), ~3.2B (Active) | 2025-12-15 |
MATH Leaderboard
mathematics| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Kimi K2 | Moonshot AI | 97.4 | 1T total, 32B activated | 2025-07-11 |
| 2 | Claude 3.7 Sonnet | Anthropic | 96.2 | 2025-02-24 | |
| 3 | o1 | OpenAI | 94.8 | 2024-09-12 | |
| 4 | Gemini 2.0 Pro | 91.8 | 2025-02-05 | ||
| 5 | Gemini 2.0 Flash | 90.9 | 2025-02-25 |
MathVista Leaderboard
multimodalMGSM Leaderboard
mathematics| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Claude 3.5 Sonnet | Anthropic | 91.6 | 2024-06-20 | |
| 2 | Claude 3 Opus | Anthropic | 90.7 | 2024-03-04 | |
| 3 | GPT-4o | OpenAI | 90.5 | 2024-05-13 | |
| 4 | Claude 3.5 Haiku | Anthropic | 85.6 | 2024-10-22 | |
| 5 | Claude 3 Sonnet | Anthropic | 83.5 | 2024-03-04 |
MMLU-Pro Leaderboard
knowledge| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Gemini 3 Pro | 90.1 | Proprietary | 2025-11-18 | |
| 2 | GLM-4.7 | Z.ai | 87.5 | Unreleased | 2025-12-22 |
| 3 | Kimi K2 | Moonshot AI | 84.6 | 1T total, 32B activated | 2025-07-11 |
| 4 | DeepSeek-R1 | DeepSeek | 84 | 671B (37B activated) | 2025-01-20 |
| 5 | Grok 3 | xAI | 79.9 | Unknown (multi-trillion estimated) | 2025-02-19 |
MMLU Leaderboard
knowledge| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | o1 | OpenAI | 92.3 | 2024-09-12 | |
| 2 | DeepSeek-R1 | DeepSeek | 90.8 | 671B (37B activated) | 2025-01-20 |
| 3 | o1-preview | OpenAI | 90.8 | 2024-09-12 | |
| 4 | GPT-4.1 | OpenAI | 90.2 | 2025-04-14 | |
| 5 | GPT-OSS-120B | OpenAI | 90 | 117B total (5.1B active per token) | 2025-08-05 |
MMMU Leaderboard
multimodal| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | o3 | OpenAI | 82.9 | 2025-04-16 | |
| 2 | o4-mini | OpenAI | 81.6 | 2025-04-16 | |
| 3 | Gemini 2.5 Flash | 79.7 | 2025-05-20 | ||
| 4 | Gemini 2.5 Pro | 79.6 | 2025-05-06 | ||
| 5 | o1 | OpenAI | 78.2 | 2024-09-12 |
| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Gemini 2.5 Flash | 74 | 2025-05-20 | ||
| 2 | Gemini 2.5 Flash-Lite | 30.6 | 2025-06-17 |
| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Gemini 2.5 Pro | 93 | 2025-05-06 | ||
| 2 | Gemini 2.0 Pro | 74.7 | 2025-02-05 | ||
| 3 | Gemini 2.0 Flash | 70.5 | 2025-02-25 | ||
| 4 | Gemini 2.0 Flash-Lite | 58 | 2025-02-25 | ||
| 5 | Gemini 2.5 Flash | 32 | 2025-05-20 |
Multi-IF Leaderboard
instructionPIQA Leaderboard
reasoning| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | DeepSeek-V3 | DeepSeek | 84.7 | 671B total, 37B activated | 2024-12-26 |
| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | o3 | OpenAI | 56.51 | 2025-04-16 | |
| 2 | o4-mini | OpenAI | 42.99 | 2025-04-16 | |
| 3 | Nemotron 3 Nano | NVIDIA | 38.5 | 31.6B (Total), ~3.2B (Active) | 2025-12-15 |
SimpleQA Leaderboard
knowledge| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Gemini 2.5 Pro | 50.8 | 2025-05-06 | ||
| 2 | Gemini 2.0 Pro | 44.3 | 2025-02-05 | ||
| 3 | Grok 3 | xAI | 43.6 | Unknown (multi-trillion estimated) | 2025-02-19 |
| 4 | Kimi K2 | Moonshot AI | 31 | 1T total, 32B activated | 2025-07-11 |
| 5 | DeepSeek-R1 | DeepSeek | 30.1 | 671B (37B activated) | 2025-01-20 |
| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Gemini 3 Pro | 76.2 | Proprietary | 2025-11-18 | |
| 2 | Claude Sonnet 4 | Anthropic | 72.7 | 2025-05-22 | |
| 3 | Claude Opus 4 | Anthropic | 72.5 | 2025-05-22 | |
| 4 | Claude 3.7 Sonnet | Anthropic | 70.3 | 2025-02-24 | |
| 5 | o3 | OpenAI | 69.1 | 2025-04-16 |
| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Gemini 3 Pro | 90.7 | Proprietary | 2025-11-18 | |
| 2 | Claude 3.7 Sonnet | Anthropic | 81.2 | 2025-02-24 | |
| 3 | Kimi K2 | Moonshot AI | 74.3 | 1T total, 32B activated | 2025-07-11 |
| 4 | o3 | OpenAI | 73.9 | 2025-04-16 | |
| 5 | o4-mini | OpenAI | 71.8 | 2025-04-16 |
| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Gemini 3 Pro | 46.6 | Proprietary | 2025-11-18 | |
| 2 | Claude Opus 4 | Anthropic | 43.2 | 2025-05-22 | |
| 3 | Kimi K2 | Moonshot AI | 33.15 | 1T total, 32B activated | 2025-07-11 |
| 4 | Nemotron 3 Nano | NVIDIA | 8.5 | 31.6B (Total), ~3.2B (Active) | 2025-12-15 |
TruthfulQA Leaderboard
truthfulness| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Gemma 3 | 68.7 | 1B, 4B, 12B, 27B | 2025-03-12 |
| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Grok 4 | xAI | 4,694.15 | Unknown | 2025-07-09 |
| 2 | Claude 3.5 Sonnet | Anthropic | 2,217.93 | 2024-06-20 | |
| 3 | Claude Opus 4 | Anthropic | 2,077.41 | 2025-05-22 | |
| 4 | o3 | OpenAI | 1,843.11 | 2025-04-16 | |
| 5 | Claude 3.7 Sonnet | Anthropic | 1,567.9 | 2025-02-24 |
| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Gemini 2.5 Pro | 65.6 | 2025-05-06 | ||
| 2 | Gemini 2.5 Flash | 65.4 | 2025-05-20 | ||
| 3 | Gemini 2.5 Flash-Lite | 57.5 | 2025-06-17 |
Video-MME Leaderboard
multimodal| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Gemini 2.5 Pro | 84.8 | 2025-05-06 |
WinoGrande Leaderboard
reasoning| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | DeepSeek-V3 | DeepSeek | 84.9 | 671B total, 37B activated | 2024-12-26 |
WMT24 Leaderboard
translation| Rank | Model | Provider | Score | Parameters | Released |
|---|---|---|---|---|---|
| 1 | Gemma 3n | 50.1 | 4B | 2025-05-20 |