A dataset of 12,500 challenging competition mathematics problems requiring multi-step reasoning.

MATH

Moonshot's latest Mixture-of-Experts model with 32 billion activated parameters and 1 trillion total parameters. Achieves state-of-the-art performance in frontier knowledge, math, and coding among non-thinking models. Meticulously optimized for agentic tasks with sophisticated tool-use capabilities and multi-turn interactions.

Kimi K2

Anthropic's most intelligent model to date and the first hybrid reasoning model on the market. Claude 3.7 Sonnet can produce near-instant responses or extended, step-by-step thinking that is made visible to the user. API users have fine-grained control over how long the model can think for (up to 128K tokens). Priced at $3 per million input tokens and $15 per million output tokens at launch, including thinking tokens.

Claude 3.7 Sonnet

The o1 model series is trained with large-scale reinforcement learning to reason using chain-of-thought. OpenAI o1 is the next model in this series (previously OpenAI o1-preview), with advanced reasoning capabilities that provide new avenues for improving safety and robustness. The model can reason about safety policies in context when responding to potentially unsafe prompts, through deliberative alignment.

Google's best model yet for coding performance and complex prompts, with better understanding and reasoning of world knowledge than any previous release. Features a massive 2 million token context window and the ability to call tools like Google Search and code execution.

Gemini 2.0 Pro

Next iteration of Gemini, released in three versions (Flash, Flash-Lite, Pro). Represents Google's state-of-the-art multimodal model, likely incorporating Mixture-of-Experts at unprecedented scale and targeting dominance in both text and image understanding.

Gemini 2.0 Flash

A powerful Mixture-of-Experts (MoE) language model with 671B total parameters and 37B activated for each token. Features Multi-head Latent Attention (MLA) and DeepSeekMoE architectures with innovative auxiliary-loss-free load balancing and multi-token prediction training. Pre-trained on 14.8T high-quality tokens with only 2.788M H800 GPU hours.

DeepSeek-V3

Best for cost-efficient performance. Better quality than 1.5 Flash, at the same speed and cost. Features a 1 million token context window and multimodal input.

Gemini 2.0 Flash-Lite

Nemotron 3 Nano is a 31.6B parameter hybrid large language model developed by NVIDIA, released on December 15, 2025. It employs a novel Hybrid Mamba-Transformer Mixture-of-Experts (MoE) architecture, activating only ~3.2B parameters per token for efficient inference. Optimized for agentic AI, reasoning, and coding, it supports a 1 million token context window and features a 'Reasoning ON/OFF' toggle with a configurable thinking budget. NVIDIA has released the model with open weights, training data, and training recipes under the NVIDIA Open Model License.

Nemotron 3 Nano

GPT-4o ('o' for 'omni') is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in a conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models.

GPT-4o

xAI's Grok-2 is a significant step forward from Grok-1.5, featuring frontier capabilities in chat, coding, and reasoning. It outperformed Claude 3.5 Sonnet and GPT-4-Turbo on the LMSYS leaderboard and shows significant improvements in reasoning with retrieved content and tool use capabilities.

Rank	Model	Provider	Score	Parameters	Released	Type
1	Kimi K2	Moonshot AI	97.4	1T total, 32B activated	2025-07-11	Text
2	Claude 3.7 Sonnet	Anthropic	96.2		2025-02-24	Multimodal
3	o1	OpenAI	94.8		2024-09-12	Multimodal
4	Gemini 2.0 Pro	Google	91.8		2025-02-05	Multimodal
5	Gemini 2.0 Flash	Google	90.9		2025-02-25	Multimodal
6	DeepSeek-V3	DeepSeek	90.2	671B total, 37B activated	2024-12-26	Text
7	Gemini 2.0 Flash-Lite	Google	86.8		2025-02-25	Multimodal
8	Nemotron 3 Nano	NVIDIA	82.88	31.6B (Total), ~3.2B (Active)	2025-12-15	Text
9	GPT-4o	OpenAI	76.6		2024-05-13	Multimodal
10	Grok-2	xAI	76.1	Unknown	2024-08-13	Multimodal

MATH

MATH Leaderboard

About MATH

Methodology

Publication

Related Benchmarks

AIME-2024

AIME-2025

AIME