HellaSwag

common sensePending Verification

Tests common sense natural language inference through completion of scenarios.

Published: 2019
Score Range: 0-100
Top Score: 95.4

HellaSwag Leaderboard

RankModelProviderScoreParametersReleasedType
1Claude 3 OpusAnthropic
95.4
2024-03-04Multimodal
2Claude 3 SonnetAnthropic
89
2024-03-04Multimodal
3DeepSeek-V3DeepSeek
88.9
671B total, 37B activated2024-12-26Text
4Mixtral 8×22BMistral AI
88
141B (39B active)2024-04-17Text
5Claude 3 HaikuAnthropic
85.9
2024-03-04Multimodal
6DeepSeek LLMDeepSeek
84
67B2023-11-01Text
7Gemma 3Google
82.1
1B, 4B, 12B, 27B2025-03-12Multimodal

About HellaSwag

Methodology

HellaSwag evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2019.Technical Paper