SWE-bench
codingPending Human Review
Software Engineering Benchmark (SWE-bench) evaluates models on real-world software engineering tasks.
Published: 2023
Score Range: 0-100
Top Score: 76.2
SWE-bench Leaderboard
| Rank | Model | Provider | Score | Parameters | Released | Type |
|---|---|---|---|---|---|---|
| 1 | Gemini 3 Pro | 76.2 | Proprietary | 2025-11-18 | Multimodal | |
| 2 | Claude Sonnet 4 | Anthropic | 72.7 | 2025-05-22 | Multimodal | |
| 3 | Claude Opus 4 | Anthropic | 72.5 | 2025-05-22 | Multimodal | |
| 4 | Claude 3.7 Sonnet | Anthropic | 70.3 | 2025-02-24 | Multimodal | |
| 5 | o3 | OpenAI | 69.1 | 2025-04-16 | Multimodal | |
| 6 | o4-mini | OpenAI | 68.1 | 2025-04-16 | Multimodal | |
| 7 | Gemini 2.5 Pro | 63.2 | 2025-05-06 | Multimodal | ||
| 8 | Gemini 2.5 Flash | 60.4 | 2025-05-20 | Multimodal | ||
| 9 | DeepSeek-R1 | DeepSeek | 49.2 | 671B (37B activated) | 2025-01-20 | Text |
| 10 | Gemini 2.5 Flash-Lite | 44.9 | 2025-06-17 | Multimodal |
About SWE-bench
Methodology
SWE-bench evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.
Publication
This benchmark was published in 2023.Technical Paper