SWE-bench

codingPending Verification

Software Engineering Benchmark (SWE-bench) evaluates models on real-world software engineering tasks.

Published: 2023
Score Range: 0-100
Top Score: 72.7

SWE-bench Leaderboard

RankModelProviderScoreParametersReleasedType
1Claude Sonnet 4Anthropic
72.7
2025-05-22Multimodal
2Claude Opus 4Anthropic
72.5
2025-05-22Multimodal
3Claude 3.7 SonnetAnthropic
70.3
2025-02-24Multimodal
4o3OpenAI
69.1
2025-04-16Multimodal
5o4-miniOpenAI
68.1
2025-04-16Multimodal
6Gemini 2.5 ProGoogle
63.2
2025-05-06Multimodal
7Gemini 2.5 FlashGoogle
60.4
2025-05-20Multimodal
8DeepSeek-R1DeepSeek
49.2
671B (37B activated)2025-01-20Text
9Gemini 2.5 Flash-LiteGoogle
44.9
2025-06-17Multimodal
10Claude 3.5 HaikuAnthropic
40.6
2024-10-22Multimodal

About SWE-bench

Methodology

SWE-bench evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2023.Technical Paper

Related Benchmarks