American Invitational Mathematics Examination (AIME) 2025 problems.

AIME-2025

Nemotron 3 Nano is a 31.6B parameter hybrid large language model developed by NVIDIA, released on December 15, 2025. It employs a novel Hybrid Mamba-Transformer Mixture-of-Experts (MoE) architecture, activating only ~3.2B parameters per token for efficient inference. Optimized for agentic AI, reasoning, and coding, it supports a 1 million token context window and features a 'Reasoning ON/OFF' toggle with a configurable thinking budget. NVIDIA has released the model with open weights, training data, and training recipes under the NVIDIA Open Model License.

Nemotron 3 Nano

GPT-OSS-20B is a medium-sized open-weight language model that delivers similar results to OpenAI o3‑mini on common benchmarks and can run on edge devices with just 16 GB of memory. This makes it ideal for on-device use cases, local inference, or rapid iteration without costly infrastructure. Despite its smaller size, it demonstrates strong performance on reasoning tasks, tool use, and competition mathematics.

GPT-OSS-20B

GPT-OSS-120B is a state-of-the-art open-weight language model that delivers strong real-world performance at low cost. This 120 billion parameter mixture-of-experts model achieves near-parity with OpenAI o4-mini on core reasoning benchmarks while running efficiently on a single 80 GB GPU. It was trained using reinforcement learning and techniques informed by OpenAI's most advanced internal models, including o3 and other frontier systems.

GPT-OSS-120B

Advanced coding-focused language model with significant improvements in multilingual agentic coding, terminal-based tasks, UI generation, tool using, and complex reasoning. Features Interleaved Thinking, Preserved Thinking, and Turn-level Thinking capabilities for enhanced stability in complex multi-turn tasks. Achieves substantial gains over GLM-4.6 across coding benchmarks, web browsing, and mathematical reasoning.

GLM-4.7

Gemini 3 Pro is Google's flagship multimodal foundation model, released in November 2025. Built on a sparse Mixture-of-Experts (MoE) Transformer architecture, it features a 1 million token context window and native understanding of text, images, audio, and video. The model introduces 'Deep Think' capabilities for enhanced reasoning, controlled via a 'thinking_level' parameter, and is optimized for 'agentic' workflows and 'vibe coding'—the generation of full applications from natural language. It supports advanced function calling and tool use, making it suitable for complex software engineering and long-context analysis tasks.

Gemini 3 Pro

Moonshot's latest Mixture-of-Experts model with 32 billion activated parameters and 1 trillion total parameters. Achieves state-of-the-art performance in frontier knowledge, math, and coding among non-thinking models. Meticulously optimized for agentic tasks with sophisticated tool-use capabilities and multi-turn interactions.

Kimi K2

xAI's most advanced model yet, blending superior reasoning with extensive pretraining knowledge. Trained on the Colossus supercluster with 10x the compute of previous state-of-the-art models. Features test-time compute and reasoning capabilities through reinforcement learning, allowing it to think for seconds to minutes while correcting errors and exploring alternatives. Achieved an Elo score of 1402 in the Chatbot Arena.

Grok 3

A smaller model optimized for fast, cost-efficient reasoning. Achieves remarkable performance for its size and cost, particularly in math, coding, and visual tasks. It is the best-performing benchmarked model on AIME 2024 and 2025, with significantly higher usage limits than o3.

o4-mini

xAI's latest generation model with enhanced mathematical reasoning capabilities, showing significant improvements in competition-level mathematics benchmarks. Features 2x faster end-to-end latency, supports 5 different voices, and achieves 10x daily user seconds compared to previous models.

Grok 4

A smaller, more efficient version of Grok 3 from xAI. Represents a new frontier in cost-efficient reasoning, particularly strong on STEM tasks that don't require as much world knowledge. Also features test-time compute and reasoning capabilities through the Grok 3 mini (Think) variant, achieving impressive performance on mathematical and coding benchmarks while being more resource-efficient than the full Grok 3 model.

Rank	Model	Provider	Score	Parameters	Released	Type
1	Nemotron 3 Nano	NVIDIA	99.2	31.6B (Total), ~3.2B (Active)	2025-12-15	Text
2	GPT-OSS-20B	OpenAI	98.7	21B total (3.6B active per token)	2025-08-05	Text
3	GPT-OSS-120B	OpenAI	97.9	117B total (5.1B active per token)	2025-08-05	Text
4	GLM-4.7	Z.ai	95.7	Unreleased	2025-12-22	Text
5	Gemini 3 Pro	Google	95	Proprietary	2025-11-18	Multimodal
6	Kimi K2	Moonshot AI	94.5	1T total, 32B activated	2025-07-11	Text
7	Grok 3	xAI	93.3	Unknown (multi-trillion estimated)	2025-02-19	Multimodal
8	o4-mini	OpenAI	92.7		2025-04-16	Multimodal
9	Grok 4	xAI	91.7	Unknown	2025-07-09	Multimodal
10	Grok 3 Mini	xAI	90.8	Unknown	2025-02-19	Multimodal

AIME-2025

AIME-2025 Leaderboard

About AIME-2025

Methodology

Publication

Related Benchmarks

AIME-2024

AIME

GSM8K