Tests common sense natural language inference through completion of scenarios.

HellaSwag

Anthropic's most intelligent model at release, with best-in-market performance on highly complex tasks. It can navigate open-ended prompts with remarkable fluency and human-like understanding. Pricing at launch: $15 per million input tokens, $75 per million output tokens.

Claude 3 Opus

Anthropic's balanced model that strikes an ideal balance between intelligence and speed for enterprise workloads. Delivers strong performance at a lower cost compared to its peers. Pricing at launch: $3 per million input tokens, $15 per million output tokens.

Claude 3 Sonnet

A powerful Mixture-of-Experts (MoE) language model with 671B total parameters and 37B activated for each token. Features Multi-head Latent Attention (MLA) and DeepSeekMoE architectures with innovative auxiliary-loss-free load balancing and multi-token prediction training. Pre-trained on 14.8T high-quality tokens with only 2.788M H800 GPU hours.

DeepSeek-V3

A sparse Mixture-of-Experts (SMoE) model that uses only 39B active parameters out of 141B total, offering unparalleled cost efficiency for its size. Sets a new standard for performance and efficiency within the AI community.

Mixtral 8×22B

Anthropic's fastest and most compact model at the time of launch with near-instant responsiveness. Answers simple queries and requests with unmatched speed while maintaining good quality. Pricing at launch: $0.25 per million input tokens, $1.25 per million output tokens.

Claude 3 Haiku

Nemotron 3 Nano is a 31.6B parameter hybrid large language model developed by NVIDIA, released on December 15, 2025. It employs a novel Hybrid Mamba-Transformer Mixture-of-Experts (MoE) architecture, activating only ~3.2B parameters per token for efficient inference. Optimized for agentic AI, reasoning, and coding, it supports a 1 million token context window and features a 'Reasoning ON/OFF' toggle with a configurable thinking budget. NVIDIA has released the model with open weights, training data, and training recipes under the NVIDIA Open Model License.

Nemotron 3 Nano

DeepSeek LLM is an advanced language model comprising 67 billion parameters, trained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese. The model demonstrates superior general capabilities, outperforming Llama2 70B Base in reasoning, coding, math, and Chinese comprehension. Available in both Base and Chat variants with 7B and 67B parameter sizes.

DeepSeek LLM

Gemma 3 is a collection of lightweight, state-of-the-art open models built from the same research and technology that powers Gemini 2.0 models. The most capable model you can run on a single GPU or TPU, delivering state-of-the-art performance for its size and outperforming Llama3-405B, DeepSeek-V3 and o3-mini in preliminary human preference evaluations on LMArena's leaderboard. Features advanced text and visual reasoning capabilities, 128K token context window, function calling, and support for over 140 languages.

Rank	Model	Provider	Score	Parameters	Released	Type
1	Claude 3 Opus	Anthropic	95.4		2024-03-04	Multimodal
2	Claude 3 Sonnet	Anthropic	89		2024-03-04	Multimodal
3	DeepSeek-V3	DeepSeek	88.9	671B total, 37B activated	2024-12-26	Text
4	Mixtral 8×22B	Mistral AI	88	141B (39B active)	2024-04-17	Text
5	Claude 3 Haiku	Anthropic	85.9		2024-03-04	Multimodal
6	Nemotron 3 Nano	NVIDIA	85.56	31.6B (Total), ~3.2B (Active)	2025-12-15	Text
7	DeepSeek LLM	DeepSeek	84	67B	2023-11-01	Text
8	Gemma 3	Google	82.1	1B, 4B, 12B, 27B	2025-03-12	Multimodal

HellaSwag

HellaSwag Leaderboard

About HellaSwag

Methodology

Publication