HumanEval

codingPending Verification

Evaluates code generation capabilities by asking models to complete Python functions based on docstrings and function signatures.

Published: 2021
Score Range: 0-100
Top Score: 92.4

HumanEval Leaderboard

RankModelProviderScoreParametersReleasedType
1o1-miniOpenAI
92.4
2024-09-12Text
2o1-previewOpenAI
92.4
2024-09-12Text
3Claude 3.5 SonnetAnthropic
92
2024-06-20Multimodal
4GPT-4oOpenAI
90.2
2024-05-13Multimodal
5Gemini DiffusionGoogle
89.6
2025-05-20Text
6Grok-2xAI
88.4
Unknown2024-08-13Multimodal
7Claude 3.5 HaikuAnthropic
88.1
2024-10-22Multimodal
8Kimi K2Moonshot AI
85.7
1T total, 32B activated2025-07-11Text
9Grok-2 minixAI
85.7
Unknown2024-08-13Multimodal
10Claude 3 OpusAnthropic
84.9
2024-03-04Multimodal

About HumanEval

Methodology

HumanEval evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2021.Technical Paper

Related Benchmarks