Benchmark for evaluating LLMs on code generation tasks from contests.

LiveCodeBench v6

Gemini 3 Pro is Google's flagship multimodal foundation model, released in November 2025. Built on a sparse Mixture-of-Experts (MoE) Transformer architecture, it features a 1 million token context window and native understanding of text, images, audio, and video. The model introduces 'Deep Think' capabilities for enhanced reasoning, controlled via a 'thinking_level' parameter, and is optimized for 'agentic' workflows and 'vibe coding'—the generation of full applications from natural language. It supports advanced function calling and tool use, making it suitable for complex software engineering and long-context analysis tasks.

Gemini 3 Pro

Moonshot's latest Mixture-of-Experts model with 32 billion activated parameters and 1 trillion total parameters. Achieves state-of-the-art performance in frontier knowledge, math, and coding among non-thinking models. Meticulously optimized for agentic tasks with sophisticated tool-use capabilities and multi-turn interactions.

Kimi K2

A powerful Mixture-of-Experts (MoE) language model with 671B total parameters and 37B activated for each token. Features Multi-head Latent Attention (MLA) and DeepSeekMoE architectures with innovative auxiliary-loss-free load balancing and multi-token prediction training. Pre-trained on 14.8T high-quality tokens with only 2.788M H800 GPU hours.

DeepSeek-V3

Claude Opus 4 is Anthropic's most powerful model and the best coding model in the world, leading on SWE-bench (72.5%) and Terminal-bench (43.2%). It delivers sustained performance on long-running tasks that require focused effort and thousands of steps, with the ability to work continuously for several hours. Features include extended thinking with tool use, parallel tool execution, improved memory capabilities, and significantly reduced shortcut behaviors.

Claude Opus 4

Improved across key benchmarks for reasoning, multimodality, code and long context while getting even more efficient. Best for fast performance on complex tasks.

Gemini 2.5 Flash

Flagship GPT model for complex tasks. It is well suited for problem solving across domains. Features major improvements in coding, instruction following, and long context comprehension.

GPT-4.1

Third-generation Qwen model featuring hybrid reasoning capabilities that can switch between thinking and non-thinking modes. Trained on 36 trillion tokens (double that of Qwen2.5), with support for 119 languages and dialects. Available in 6 dense models (0.6B to 32B parameters) and 2 MoE models (30B/3B active and 235B/22B active).

Rank	Model	Provider	Score	Parameters	Released	Type
1	Gemini 3 Pro	Google	90.7	Proprietary	2025-11-18	Multimodal
2	Kimi K2	Moonshot AI	83.1	1T total, 32B activated	2025-07-11	Text
3	DeepSeek-V3	DeepSeek	46.9	671B total, 37B activated	2024-12-26	Text
4	Claude Opus 4	Anthropic	44.7		2025-05-22	Multimodal
5	Gemini 2.5 Flash	Google	44.7		2025-05-20	Multimodal
6	GPT-4.1	OpenAI	44.7		2025-04-14	Multimodal
7	Qwen-3	Alibaba	37	235B (22B active)	2025-04-29	Text

LiveCodeBench v6

LiveCodeBench v6 Leaderboard

About LiveCodeBench v6

Methodology

Publication

Related Benchmarks

Aider Polyglot

Berkeley Function-Calling Leaderboard

CodeForces