SWE-Lancer

codingVerified

A benchmark of over 1,400 freelance software engineering tasks from Upwork, valued at $1 million USD total in real-world payouts. Evaluates frontier LLMs on both individual contributor tasks and software engineering management decisions.

Published: 2025

Score Range: 0-1000000

Top Score: 186,125

Read Paper

SWE-Lancer Leaderboard

Rank	Model	Provider	Score	Released	Type
1	GPT-4.5	OpenAI	186,125	2025-02-27	Multimodal
2	o3	OpenAI	66,250	2025-04-16	Multimodal
3	o4-mini	OpenAI	56,375	2025-04-16	Multimodal
4	GPT-4.1	OpenAI	35.1	2025-04-14	Multimodal

About SWE-Lancer

Description

SWE-Lancer is a comprehensive benchmark that evaluates language models on real-world freelance software engineering tasks sourced from Upwork. The benchmark consists of 1,488 tasks collectively worth $1,000,000 USD in actual payouts, making it the first benchmark to map model performance directly to monetary value. Benchmark Composition Individual Contributor (IC) SWE Tasks: 764 tasks worth $414,775 total. These range from $50 bug fixes to $32,000 feature implementations, requiring models to generate code patches to resolve real-world issues. Tasks are evaluated using end-to-end tests created by professional software engineers that simulate real user workflows. SWE Management Tasks: 724 tasks worth $585,225 total. Models act as technical leads, selecting the best implementation proposal from multiple submissions by freelancers. Performance is assessed against the choices made by original engineering managers. Key Features Real-world payouts: All tasks represent actual payments to freelance engineers, providing a natural market-derived difficulty gradient. Advanced full-stack engineering: Tasks involve whole-codebase context, mobile and web development, API interactions, and complex issue reproduction. End-to-end testing: Uses Playwright browser automation to verify application behavior, making it more resistant to grader hacking than unit tests. Domain diversity: 74% involve Application Logic, 17% UI/UX development, with 88% being bug fixes and 12% new features. Unbiased data collection: Representative sample from Upwork rather than filtering for easily testable problems. Evaluation Results The best performing model, Claude 3.5 Sonnet, achieves 26.2% success rate on IC SWE tasks, 44.9% success rate on SWE Management tasks, total earnings of $208,050 out of $500,800 possible on the public Diamond set, and over $400,000 out of $1,000,000 possible on the full dataset. Public Release SWE-Lancer Diamond is the public evaluation split containing $500,800 worth of tasks (237 IC SWE tasks worth $236,300 and 265 SWE Manager tasks worth $264,500). The remaining tasks are held private to prevent contamination. The benchmark includes a unified Docker image and evaluation harness for reproducible results. Significance SWE-Lancer addresses limitations of existing coding benchmarks by focusing on commercially valuable, full-stack engineering work rather than isolated, self-contained tasks. It provides insights into the economic impact potential of AI automation in software engineering while highlighting that frontier models still struggle with the majority of real-world engineering challenges.

Methodology

SWE-Lancer evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 1000000, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2025.Read the full paper

Related Benchmarks

Aider-Polyglot

coding

Tests models on their ability to write code in multiple programming languages.

Published2024

Scale0-100

Read Paper View Details

Berkeley Function-Calling Leaderboard

coding

The first comprehensive evaluation of LLMs' function calling capabilities, testing various forms including parallel and multiple function calls across diverse programming languages. Evaluates models on execution accuracy and ability to withhold function selection when appropriate.

Published2024

Scale0-100

Read Paper View Details

CodeForces

coding

Advanced competitive programming benchmark for evaluating large language models on algorithmic problem-solving and code generation.

Published2025

Scale0-3500

Read Paper View Details