CodeForces

codingVerified

Advanced competitive programming benchmark for evaluating large language models on algorithmic problem-solving and code generation.

Published: 2025
Score Range: 0-3500
Top Score: 2,719

CodeForces Leaderboard

RankModelProviderScoreParametersReleasedType
1o4-miniOpenAI
2,719
2025-04-16Multimodal
2o3OpenAI
2,706
2025-04-16Multimodal
3GPT-OSS-120BOpenAI
2,622
117B total (5.1B active per token)2025-08-05Text
4GPT-OSS-20BOpenAI
2,516
21B total (3.6B active per token)2025-08-05Text
5Qwen-3Alibaba
2,056
235B (22B active)2025-04-29Text
6DeepSeek-R1DeepSeek
2,029
671B (37B activated)2025-01-20Text
7o1OpenAI
1,673
2024-09-12Multimodal
8o1-miniOpenAI
1,650
2024-09-12Text
9o1-previewOpenAI
1,258
2024-09-12Text
10DeepSeek-V3DeepSeek
51.6
671B total, 37B activated2024-12-26Text

About CodeForces

Description

The CodeForces benchmark represents a comprehensive evaluation framework for assessing large language models' capabilities in competitive programming and algorithmic problem-solving. This benchmark features 387 carefully curated CodeForces contest problems spanning multiple difficulty divisions (Div. 1-4), providing a rigorous test of models' ability to understand complex problem statements, design efficient algorithms, and generate correct code implementations. The benchmark covers a wide range of algorithmic concepts including dynamic programming, graph algorithms, tree structures, data structures, number theory, combinatorics, and geometry. Each problem is tagged with specific algorithm categories, allowing for detailed analysis of model performance across 35 different algorithmic domains. What sets this benchmark apart is its unique evaluation methodology that submits solutions directly to the CodeForces platform, ensuring zero false positives and maintaining the same rigorous standards used for human competitive programmers. The system supports special judge problems with multiple valid outputs and maintains aligned execution environments for fair comparison. The benchmark provides standardized ELO ratings that are directly comparable to human participants on the CodeForces platform, offering meaningful insights into how AI models perform relative to human competitive programmers. Testing has shown a wide range of performance, from basic implementations to advanced reasoning models like o1-mini achieving 1578 ELO rating. Supporting both C++ and Python programming languages, the benchmark enables comprehensive analysis of code generation quality, algorithmic thinking, and problem-solving strategies across different programming paradigms.

Methodology

CodeForces evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 3500, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2025.Technical Paper

Related Benchmarks