SWE-bench
codingPending Verification
Software Engineering Benchmark (SWE-bench) evaluates models on real-world software engineering tasks.
Published: 2023
Score Range: 0-100
Top Score: 72.7
SWE-bench Leaderboard
Rank | Model | Provider | Score | Parameters | Released | Type |
---|---|---|---|---|---|---|
1 | Claude Sonnet 4 | Anthropic | 72.7 | 2025-05-22 | Multimodal | |
2 | Claude Opus 4 | Anthropic | 72.5 | 2025-05-22 | Multimodal | |
3 | Claude 3.7 Sonnet | Anthropic | 70.3 | 2025-02-24 | Multimodal | |
4 | o3 | OpenAI | 69.1 | 2025-04-16 | Multimodal | |
5 | o4-mini | OpenAI | 68.1 | 2025-04-16 | Multimodal | |
6 | Gemini 2.5 Pro | 63.2 | 2025-05-06 | Multimodal | ||
7 | Gemini 2.5 Flash | 60.4 | 2025-05-20 | Multimodal | ||
8 | DeepSeek-R1 | DeepSeek | 49.2 | 671B (37B activated) | 2025-01-20 | Text |
9 | Gemini 2.5 Flash-Lite | 44.9 | 2025-06-17 | Multimodal | ||
10 | Claude 3.5 Haiku | Anthropic | 40.6 | 2024-10-22 | Multimodal |
About SWE-bench
Methodology
SWE-bench evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.
Publication
This benchmark was published in 2023.Technical Paper