American Invitational Mathematics Examination (AIME) 2024 problems.

AIME-2024

GPT-OSS-120B is a state-of-the-art open-weight language model that delivers strong real-world performance at low cost. This 120 billion parameter mixture-of-experts model achieves near-parity with OpenAI o4-mini on core reasoning benchmarks while running efficiently on a single 80 GB GPU. It was trained using reinforcement learning and techniques informed by OpenAI's most advanced internal models, including o3 and other frontier systems.

GPT-OSS-120B

GPT-OSS-20B is a medium-sized open-weight language model that delivers similar results to OpenAI o3‑mini on common benchmarks and can run on edge devices with just 16 GB of memory. This makes it ideal for on-device use cases, local inference, or rapid iteration without costly infrastructure. Despite its smaller size, it demonstrates strong performance on reasoning tasks, tool use, and competition mathematics.

GPT-OSS-20B

A smaller, more efficient version of Grok 3 from xAI. Represents a new frontier in cost-efficient reasoning, particularly strong on STEM tasks that don't require as much world knowledge. Also features test-time compute and reasoning capabilities through the Grok 3 mini (Think) variant, achieving impressive performance on mathematical and coding benchmarks while being more resource-efficient than the full Grok 3 model.

Grok 3 Mini

A smaller model optimized for fast, cost-efficient reasoning. Achieves remarkable performance for its size and cost, particularly in math, coding, and visual tasks. It is the best-performing benchmarked model on AIME 2024 and 2025, with significantly higher usage limits than o3.

o4-mini

Third-generation Qwen model featuring hybrid reasoning capabilities that can switch between thinking and non-thinking modes. Trained on 36 trillion tokens (double that of Qwen2.5), with support for 119 languages and dialects. Available in 6 dense models (0.6B to 32B parameters) and 2 MoE models (30B/3B active and 235B/22B active).

Qwen-3

The o1 model series is trained with large-scale reinforcement learning to reason using chain-of-thought. OpenAI o1 is the next model in this series (previously OpenAI o1-preview), with advanced reasoning capabilities that provide new avenues for improving safety and robustness. The model can reason about safety policies in context when responding to potentially unsafe prompts, through deliberative alignment.

Anthropic's most intelligent model to date and the first hybrid reasoning model on the market. Claude 3.7 Sonnet can produce near-instant responses or extended, step-by-step thinking that is made visible to the user. API users have fine-grained control over how long the model can think for (up to 128K tokens). Priced at $3 per million input tokens and $15 per million output tokens at launch, including thinking tokens.

Claude 3.7 Sonnet

DeepSeek-R1 is a first-generation reasoning model trained via large-scale reinforcement learning. Built on DeepSeek-V3-Base, it incorporates cold-start data before RL to address challenges like endless repetition and poor readability found in DeepSeek-R1-Zero. Achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks through advanced chain-of-thought reasoning capabilities.

DeepSeek-R1

Moonshot's latest Mixture-of-Experts model with 32 billion activated parameters and 1 trillion total parameters. Achieves state-of-the-art performance in frontier knowledge, math, and coding among non-thinking models. Meticulously optimized for agentic tasks with sophisticated tool-use capabilities and multi-turn interactions.

Kimi K2

xAI's most advanced model yet, blending superior reasoning with extensive pretraining knowledge. Trained on the Colossus supercluster with 10x the compute of previous state-of-the-art models. Features test-time compute and reasoning capabilities through reinforcement learning, allowing it to think for seconds to minutes while correcting errors and exploring alternatives. Achieved an Elo score of 1402 in the Chatbot Arena.

Rank	Model	Provider	Score	Parameters	Released	Type
1	GPT-OSS-120B	OpenAI	96.6	117B total (5.1B active per token)	2025-08-05	Text
2	GPT-OSS-20B	OpenAI	96	21B total (3.6B active per token)	2025-08-05	Text
3	Grok 3 Mini	xAI	95.8	Unknown	2025-02-19	Multimodal
4	o4-mini	OpenAI	93.4		2025-04-16	Multimodal
5	Qwen-3	Alibaba	85.7	235B (22B active)	2025-04-29	Text
6	o1	OpenAI	83.3		2024-09-12	Multimodal
7	Claude 3.7 Sonnet	Anthropic	80		2025-02-24	Multimodal
8	DeepSeek-R1	DeepSeek	79.8	671B (37B activated)	2025-01-20	Text
9	Kimi K2	Moonshot AI	69.6	1T total, 32B activated	2025-07-11	Text
10	Grok 3	xAI	52.2	Unknown (multi-trillion estimated)	2025-02-19	Multimodal

AIME-2024

AIME-2024 Leaderboard

About AIME-2024

Methodology

Publication

Related Benchmarks

AIME-2025

AIME

GSM8K