HellaSwag

common sense

Tests common sense natural language inference through completion of scenarios.

Published: 2019
Scale: 0-100
Top Score: 95.4

HellaSwag Leaderboard

RankModelProviderScoreParametersReleasedType
1Claude 3 OpusAnthropic
95.4
2024-03-04Multimodal
2Claude 3 SonnetAnthropic
89
2024-03-04Multimodal
3DeepSeek-V3DeepSeek
88.9
671B total, 37B activated2024-12-26Text
4Claude 3 HaikuAnthropic
85.9
2024-03-04Multimodal
5DeepSeek LLMDeepSeek
84
67B2023-11-01Text
6Gemma 3Google
82.1
1B, 4B, 12B, 27B2025-03-12Multimodal

About HellaSwag

Description

Tests common sense natural language inference through completion of scenarios.

Methodology

HellaSwag evaluates models on a scale of 0 to 100. Higher scores indicate better performance. For detailed information about the methodology, please refer to the original paper.

Publication

This benchmark was published in 2019.Read the full paper