SWE-Lancer
A benchmark of over 1,400 freelance software engineering tasks from Upwork, valued at $1 million USD total in real-world payouts. Evaluates frontier LLMs on both individual contributor tasks and software engineering management decisions.
SWE-Lancer Leaderboard
About SWE-Lancer
Description
SWE-Lancer is a comprehensive benchmark that evaluates language models on real-world freelance software engineering tasks sourced from Upwork. The benchmark consists of 1,488 tasks collectively worth $1,000,000 USD in actual payouts, making it the first benchmark to map model performance directly to monetary value. Benchmark Composition Individual Contributor (IC) SWE Tasks: 764 tasks worth $414,775 total. These range from $50 bug fixes to $32,000 feature implementations, requiring models to generate code patches to resolve real-world issues. Tasks are evaluated using end-to-end tests created by professional software engineers that simulate real user workflows. SWE Management Tasks: 724 tasks worth $585,225 total. Models act as technical leads, selecting the best implementation proposal from multiple submissions by freelancers. Performance is assessed against the choices made by original engineering managers. Key Features Real-world payouts: All tasks represent actual payments to freelance engineers, providing a natural market-derived difficulty gradient. Advanced full-stack engineering: Tasks involve whole-codebase context, mobile and web development, API interactions, and complex issue reproduction. End-to-end testing: Uses Playwright browser automation to verify application behavior, making it more resistant to grader hacking than unit tests. Domain diversity: 74% involve Application Logic, 17% UI/UX development, with 88% being bug fixes and 12% new features. Unbiased data collection: Representative sample from Upwork rather than filtering for easily testable problems. Evaluation Results The best performing model, Claude 3.5 Sonnet, achieves 26.2% success rate on IC SWE tasks, 44.9% success rate on SWE Management tasks, total earnings of $208,050 out of $500,800 possible on the public Diamond set, and over $400,000 out of $1,000,000 possible on the full dataset. Public Release SWE-Lancer Diamond is the public evaluation split containing $500,800 worth of tasks (237 IC SWE tasks worth $236,300 and 265 SWE Manager tasks worth $264,500). The remaining tasks are held private to prevent contamination. The benchmark includes a unified Docker image and evaluation harness for reproducible results. Significance SWE-Lancer addresses limitations of existing coding benchmarks by focusing on commercially valuable, full-stack engineering work rather than isolated, self-contained tasks. It provides insights into the economic impact potential of AI automation in software engineering while highlighting that frontier models still struggle with the majority of real-world engineering challenges.
Methodology
SWE-Lancer evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 1000000, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.
Publication
This benchmark was published in 2025.Read the full paper