CodeForces

codingVerified

Advanced competitive programming benchmark for evaluating large language models on algorithmic problem-solving and code generation.

Published: 2025

Score Range: 0-3500

Top Score: 2,719

Technical Paper

CodeForces Leaderboard

Rank	Model	Provider	Score	Parameters	Released	Type
1	o4-mini	OpenAI	2,719		2025-04-16	Multimodal
2	o3	OpenAI	2,706		2025-04-16	Multimodal
3	GPT-OSS-120B	OpenAI	2,622	117B total (5.1B active per token)	2025-08-05	Text
4	GPT-OSS-20B	OpenAI	2,516	21B total (3.6B active per token)	2025-08-05	Text
5	Qwen-3	Alibaba	2,056	235B (22B active)	2025-04-29	Text
6	DeepSeek-R1	DeepSeek	2,029	671B (37B activated)	2025-01-20	Text
7	o1	OpenAI	1,673		2024-09-12	Multimodal
8	o1-mini	OpenAI	1,650		2024-09-12	Text
9	o1-preview	OpenAI	1,258		2024-09-12	Text
10	DeepSeek-V3	DeepSeek	51.6	671B total, 37B activated	2024-12-26	Text

About CodeForces

Description

The CodeForces benchmark represents a comprehensive evaluation framework for assessing large language models' capabilities in competitive programming and algorithmic problem-solving. This benchmark features 387 carefully curated CodeForces contest problems spanning multiple difficulty divisions (Div. 1-4), providing a rigorous test of models' ability to understand complex problem statements, design efficient algorithms, and generate correct code implementations. The benchmark covers a wide range of algorithmic concepts including dynamic programming, graph algorithms, tree structures, data structures, number theory, combinatorics, and geometry. Each problem is tagged with specific algorithm categories, allowing for detailed analysis of model performance across 35 different algorithmic domains. What sets this benchmark apart is its unique evaluation methodology that submits solutions directly to the CodeForces platform, ensuring zero false positives and maintaining the same rigorous standards used for human competitive programmers. The system supports special judge problems with multiple valid outputs and maintains aligned execution environments for fair comparison. The benchmark provides standardized ELO ratings that are directly comparable to human participants on the CodeForces platform, offering meaningful insights into how AI models perform relative to human competitive programmers. Testing has shown a wide range of performance, from basic implementations to advanced reasoning models like o1-mini achieving 1578 ELO rating. Supporting both C++ and Python programming languages, the benchmark enables comprehensive analysis of code generation quality, algorithmic thinking, and problem-solving strategies across different programming paradigms.

Methodology

CodeForces evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 3500, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2025.Technical Paper

Related Benchmarks

Aider Polyglot

coding

A comprehensive code editing benchmark based on Exercism coding exercises across 6 programming languages (C++, Go, Java, JavaScript, Python, and Rust). Focuses on the most challenging 225 exercises out of 697 available, designed to re-calibrate scoring so top coding LLMs occupy a range between 5-50%. Replaces the original Python-only benchmark that was saturating at 80%+ scores, providing better differentiation between frontier models through increased difficulty and language diversity.

Published2024

Scale0-100

aider.chat View Details

Berkeley Function-Calling Leaderboard

coding

The first comprehensive evaluation of LLMs' function calling capabilities, testing various forms including parallel and multiple function calls across diverse programming languages. Evaluates models on execution accuracy and ability to withhold function selection when appropriate.

Published2024

Scale0-100

Technical Paper View Details

HumanEval

coding

Evaluates code generation capabilities by asking models to complete Python functions based on docstrings and function signatures.

Published2021

Scale0-100

Technical Paper View Details