A benchmark of simple but precise questions to test factual knowledge and reasoning.

SimpleQA

Gemini 2.5 Pro is capable of reasoning through its thoughts before responding, resulting in enhanced performance and improved accuracy. Features Deep Think, an enhanced reasoning mode, and native audio outputs that capture subtle nuances of speech.

Gemini 2.5 Pro

Google's best model yet for coding performance and complex prompts, with better understanding and reasoning of world knowledge than any previous release. Features a massive 2 million token context window and the ability to call tools like Google Search and code execution.

Gemini 2.0 Pro

xAI's most advanced model yet, blending superior reasoning with extensive pretraining knowledge. Trained on the Colossus supercluster with 10x the compute of previous state-of-the-art models. Features test-time compute and reasoning capabilities through reinforcement learning, allowing it to think for seconds to minutes while correcting errors and exploring alternatives. Achieved an Elo score of 1402 in the Chatbot Arena.

Grok 3

Moonshot's latest Mixture-of-Experts model with 32 billion activated parameters and 1 trillion total parameters. Achieves state-of-the-art performance in frontier knowledge, math, and coding among non-thinking models. Meticulously optimized for agentic tasks with sophisticated tool-use capabilities and multi-turn interactions.

Kimi K2

DeepSeek-R1 is a first-generation reasoning model trained via large-scale reinforcement learning. Built on DeepSeek-V3-Base, it incorporates cold-start data before RL to address challenges like endless repetition and poor readability found in DeepSeek-R1-Zero. Achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks through advanced chain-of-thought reasoning capabilities.

DeepSeek-R1

Next iteration of Gemini, released in three versions (Flash, Flash-Lite, Pro). Represents Google's state-of-the-art multimodal model, likely incorporating Mixture-of-Experts at unprecedented scale and targeting dominance in both text and image understanding.

Gemini 2.0 Flash

Improved across key benchmarks for reasoning, multimodality, code and long context while getting even more efficient. Best for fast performance on complex tasks.

Gemini 2.5 Flash

A powerful Mixture-of-Experts (MoE) language model with 671B total parameters and 37B activated for each token. Features Multi-head Latent Attention (MLA) and DeepSeekMoE architectures with innovative auxiliary-loss-free load balancing and multi-token prediction training. Pre-trained on 14.8T high-quality tokens with only 2.788M H800 GPU hours.

DeepSeek-V3

Best for cost-efficient performance. Better quality than 1.5 Flash, at the same speed and cost. Features a 1 million token context window and multimodal input.

Gemini 2.0 Flash-Lite

A smaller, more efficient version of Grok 3 from xAI. Represents a new frontier in cost-efficient reasoning, particularly strong on STEM tasks that don't require as much world knowledge. Also features test-time compute and reasoning capabilities through the Grok 3 mini (Think) variant, achieving impressive performance on mathematical and coding benchmarks while being more resource-efficient than the full Grok 3 model.

Rank	Model	Provider	Score	Parameters	Released	Type
1	Gemini 2.5 Pro	Google	50.8		2025-05-06	Multimodal
2	Gemini 2.0 Pro	Google	44.3		2025-02-05	Multimodal
3	Grok 3	xAI	43.6	Unknown (multi-trillion estimated)	2025-02-19	Multimodal
4	Kimi K2	Moonshot AI	31	1T total, 32B activated	2025-07-11	Text
5	DeepSeek-R1	DeepSeek	30.1	671B (37B activated)	2025-01-20	Text
6	Gemini 2.0 Flash	Google	29.9		2025-02-25	Multimodal
7	Gemini 2.5 Flash	Google	26.9		2025-05-20	Multimodal
8	DeepSeek-V3	DeepSeek	24.9	671B total, 37B activated	2024-12-26	Text
9	Gemini 2.0 Flash-Lite	Google	21.7		2025-02-25	Multimodal
10	Grok 3 Mini	xAI	21.7	Unknown	2025-02-19	Multimodal

SimpleQA

SimpleQA Leaderboard

About SimpleQA

Methodology

Publication

Related Benchmarks

Global-MMLU-Lite

Global-MMLU

MMLU-Pro