HellaSwag
common sense
Tests common sense natural language inference through completion of scenarios.
Published: 2019
Scale: 0-100
Top Score: 95.4
HellaSwag Leaderboard
Rank | Model | Provider | Score | Parameters | Released | Type |
---|---|---|---|---|---|---|
1 | Claude 3 Opus | Anthropic | 95.4 | 2024-03-04 | Multimodal | |
2 | Claude 3 Sonnet | Anthropic | 89 | 2024-03-04 | Multimodal | |
3 | DeepSeek-V3 | DeepSeek | 88.9 | 671B total, 37B activated | 2024-12-26 | Text |
4 | Claude 3 Haiku | Anthropic | 85.9 | 2024-03-04 | Multimodal | |
5 | DeepSeek LLM | DeepSeek | 84 | 67B | 2023-11-01 | Text |
6 | Gemma 3 | 82.1 | 1B, 4B, 12B, 27B | 2025-03-12 | Multimodal |
About HellaSwag
Description
Tests common sense natural language inference through completion of scenarios.
Methodology
HellaSwag evaluates models on a scale of 0 to 100. Higher scores indicate better performance. For detailed information about the methodology, please refer to the original paper.
Publication
This benchmark was published in 2019.Read the full paper