CodeForces
Advanced competitive programming benchmark for evaluating large language models on algorithmic problem-solving and code generation.
CodeForces Leaderboard
Rank | Model | Provider | Score | Parameters | Released | Type |
---|---|---|---|---|---|---|
1 | o4-mini | OpenAI | 2,719 | 2025-04-16 | Multimodal | |
2 | o3 | OpenAI | 2,706 | 2025-04-16 | Multimodal | |
3 | GPT-OSS-120B | OpenAI | 2,622 | 117B total (5.1B active per token) | 2025-08-05 | Text |
4 | GPT-OSS-20B | OpenAI | 2,516 | 21B total (3.6B active per token) | 2025-08-05 | Text |
5 | Qwen-3 | Alibaba | 2,056 | 235B (22B active) | 2025-04-29 | Text |
6 | DeepSeek-R1 | DeepSeek | 2,029 | 671B (37B activated) | 2025-01-20 | Text |
7 | o1 | OpenAI | 1,673 | 2024-09-12 | Multimodal | |
8 | o1-mini | OpenAI | 1,650 | 2024-09-12 | Text | |
9 | o1-preview | OpenAI | 1,258 | 2024-09-12 | Text | |
10 | DeepSeek-V3 | DeepSeek | 51.6 | 671B total, 37B activated | 2024-12-26 | Text |
About CodeForces
Description
The CodeForces benchmark represents a comprehensive evaluation framework for assessing large language models' capabilities in competitive programming and algorithmic problem-solving. This benchmark features 387 carefully curated CodeForces contest problems spanning multiple difficulty divisions (Div. 1-4), providing a rigorous test of models' ability to understand complex problem statements, design efficient algorithms, and generate correct code implementations. The benchmark covers a wide range of algorithmic concepts including dynamic programming, graph algorithms, tree structures, data structures, number theory, combinatorics, and geometry. Each problem is tagged with specific algorithm categories, allowing for detailed analysis of model performance across 35 different algorithmic domains. What sets this benchmark apart is its unique evaluation methodology that submits solutions directly to the CodeForces platform, ensuring zero false positives and maintaining the same rigorous standards used for human competitive programmers. The system supports special judge problems with multiple valid outputs and maintains aligned execution environments for fair comparison. The benchmark provides standardized ELO ratings that are directly comparable to human participants on the CodeForces platform, offering meaningful insights into how AI models perform relative to human competitive programmers. Testing has shown a wide range of performance, from basic implementations to advanced reasoning models like o1-mini achieving 1578 ELO rating. Supporting both C++ and Python programming languages, the benchmark enables comprehensive analysis of code generation quality, algorithmic thinking, and problem-solving strategies across different programming paradigms.
Methodology
CodeForces evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 3500, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.
Publication
This benchmark was published in 2025.Technical Paper