HumanEval

coding

Evaluates code generation capabilities by asking models to complete Python functions based on docstrings and function signatures.

Published: 2021
Scale: 0-100
Top Score: 92.4

HumanEval Leaderboard

RankModelProviderScoreParametersReleasedType
1o1-miniOpenAI
92.4
2024-09-12Text
2o1-previewOpenAI
92.4
2024-09-12Text
3Claude 3.5 SonnetAnthropic
92
2024-06-20Multimodal
4GPT-4oOpenAI
90.2
2024-05-13Multimodal
5Gemini DiffusionGoogle
89.6
2025-05-20Text
6Claude 3.5 HaikuAnthropic
88.1
2024-10-22Multimodal
7Claude 3 OpusAnthropic
84.9
2024-03-04Multimodal
8DeepSeek-V3DeepSeek
82.6
671B total, 37B activated2024-12-26Text
9Claude 3 HaikuAnthropic
75.9
2024-03-04Multimodal
10Claude 3 SonnetAnthropic
73
2024-03-04Multimodal

About HumanEval

Description

Evaluates code generation capabilities by asking models to complete Python functions based on docstrings and function signatures.

Methodology

HumanEval evaluates models on a scale of 0 to 100. Higher scores indicate better performance. For detailed information about the methodology, please refer to the original paper.

Publication

This benchmark was published in 2021.Read the full paper

Related Benchmarks