SWE-bench
coding
Software Engineering Benchmark (SWE-bench) evaluates models on real-world software engineering tasks.
Published: 2023
Scale: 0-100
Top Score: 72.7
SWE-bench Leaderboard
Rank | Model | Provider | Score | Parameters | Released | Type |
---|---|---|---|---|---|---|
1 | Claude Sonnet 4 | Anthropic | 72.7 | 2025-05-22 | Multimodal | |
2 | Claude Opus 4 | Anthropic | 72.5 | 2025-05-22 | Multimodal | |
3 | Claude 3.7 Sonnet | Anthropic | 70.3 | 2025-02-24 | Multimodal | |
4 | o3 | OpenAI | 69.1 | 2025-04-16 | Multimodal | |
5 | o4-mini | OpenAI | 68.1 | 2025-04-16 | Multimodal | |
6 | Gemini 2.5 Pro | 63.2 | 2025-05-06 | Multimodal | ||
7 | Gemini 2.5 Flash | 60.4 | 2025-05-20 | Multimodal | ||
8 | DeepSeek-R1 | DeepSeek | 49.2 | 671B (37B activated) | 2025-01-20 | Text |
9 | Claude 3.5 Haiku | Anthropic | 40.6 | 2024-10-22 | Multimodal | |
10 | GPT-4.5 | OpenAI | 38 | 2025-02-27 | Multimodal |
About SWE-bench
Description
Software Engineering Benchmark (SWE-bench) evaluates models on real-world software engineering tasks.
Methodology
SWE-bench evaluates models on a scale of 0 to 100. Higher scores indicate better performance. For detailed information about the methodology, please refer to the original paper.
Publication
This benchmark was published in 2023.Read the full paper