Discrete Reasoning Over Paragraphs (DROP) requires models to resolve references in a passage and perform discrete operations.

DROP

DeepSeek-R1 is a first-generation reasoning model trained via large-scale reinforcement learning. Built on DeepSeek-V3-Base, it incorporates cold-start data before RL to address challenges like endless repetition and poor readability found in DeepSeek-R1-Zero. Achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks through advanced chain-of-thought reasoning capabilities.

DeepSeek-R1

A powerful Mixture-of-Experts (MoE) language model with 671B total parameters and 37B activated for each token. Features Multi-head Latent Attention (MLA) and DeepSeekMoE architectures with innovative auxiliary-loss-free load balancing and multi-token prediction training. Pre-trained on 14.8T high-quality tokens with only 2.788M H800 GPU hours.

DeepSeek-V3

Claude 3.5 Sonnet raises the industry bar for intelligence, outperforming competitor models and Claude 3 Opus on a wide range of evaluations, with the speed and cost of a mid-tier model. It operates at twice the speed of Claude 3 Opus.

Claude 3.5 Sonnet

GPT-4o ('o' for 'omni') is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in a conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models.

GPT-4o

Anthropics fastest model at time of launch, delivering intelligence at blazing speeds. Claude 3.5 Haiku is optimized for quick responses while maintaining high quality.

Claude 3.5 Haiku

Anthropic's most intelligent model at release, with best-in-market performance on highly complex tasks. It can navigate open-ended prompts with remarkable fluency and human-like understanding. Pricing at launch: $15 per million input tokens, $75 per million output tokens.

Claude 3 Opus

Anthropic's balanced model that strikes an ideal balance between intelligence and speed for enterprise workloads. Delivers strong performance at a lower cost compared to its peers. Pricing at launch: $3 per million input tokens, $15 per million output tokens.

Claude 3 Sonnet

Anthropic's fastest and most compact model at the time of launch with near-instant responsiveness. Answers simple queries and requests with unmatched speed while maintaining good quality. Pricing at launch: $0.25 per million input tokens, $1.25 per million output tokens.

Claude 3 Haiku

Gemma 3 is a collection of lightweight, state-of-the-art open models built from the same research and technology that powers Gemini 2.0 models. The most capable model you can run on a single GPU or TPU, delivering state-of-the-art performance for its size and outperforming Llama3-405B, DeepSeek-V3 and o3-mini in preliminary human preference evaluations on LMArena's leaderboard. Features advanced text and visual reasoning capabilities, 128K token context window, function calling, and support for over 140 languages.

Rank	Model	Provider	Score	Parameters	Released	Type
1	DeepSeek-R1	DeepSeek	92.2	671B (37B activated)	2025-01-20	Text
2	DeepSeek-V3	DeepSeek	91.6	671B total, 37B activated	2024-12-26	Text
3	Claude 3.5 Sonnet	Anthropic	87.1		2024-06-20	Multimodal
4	GPT-4o	OpenAI	83.4		2024-05-13	Multimodal
5	Claude 3.5 Haiku	Anthropic	83.1		2024-10-22	Multimodal
6	Claude 3 Opus	Anthropic	83.1		2024-03-04	Multimodal
7	Claude 3 Sonnet	Anthropic	78.9		2024-03-04	Multimodal
8	Claude 3 Haiku	Anthropic	78.4		2024-03-04	Multimodal
9	Gemma 3	Google	71.2	1B, 4B, 12B, 27B	2025-03-12	Multimodal

DROP

DROP Leaderboard

About DROP

Methodology

Publication

Related Benchmarks

AGIEval (English)

ARC

GPQA