HumanEval
coding
Evaluates code generation capabilities by asking models to complete Python functions based on docstrings and function signatures.
Published: 2021
Scale: 0-100
Top Score: 92.4
HumanEval Leaderboard
Rank | Model | Provider | Score | Parameters | Released | Type |
---|---|---|---|---|---|---|
1 | o1-mini | OpenAI | 92.4 | 2024-09-12 | Text | |
2 | o1-preview | OpenAI | 92.4 | 2024-09-12 | Text | |
3 | Claude 3.5 Sonnet | Anthropic | 92 | 2024-06-20 | Multimodal | |
4 | GPT-4o | OpenAI | 90.2 | 2024-05-13 | Multimodal | |
5 | Gemini Diffusion | 89.6 | 2025-05-20 | Text | ||
6 | Claude 3.5 Haiku | Anthropic | 88.1 | 2024-10-22 | Multimodal | |
7 | Claude 3 Opus | Anthropic | 84.9 | 2024-03-04 | Multimodal | |
8 | DeepSeek-V3 | DeepSeek | 82.6 | 671B total, 37B activated | 2024-12-26 | Text |
9 | Claude 3 Haiku | Anthropic | 75.9 | 2024-03-04 | Multimodal | |
10 | Claude 3 Sonnet | Anthropic | 73 | 2024-03-04 | Multimodal |
About HumanEval
Description
Evaluates code generation capabilities by asking models to complete Python functions based on docstrings and function signatures.
Methodology
HumanEval evaluates models on a scale of 0 to 100. Higher scores indicate better performance. For detailed information about the methodology, please refer to the original paper.
Publication
This benchmark was published in 2021.Read the full paper