Aider Polyglot

codingPending Human Review

A comprehensive code editing benchmark based on Exercism coding exercises across 6 programming languages (C++, Go, Java, JavaScript, Python, and Rust). Focuses on the most challenging 225 exercises out of 697 available, designed to re-calibrate scoring so top coding LLMs occupy a range between 5-50%. Replaces the original Python-only benchmark that was saturating at 80%+ scores, providing better differentiation between frontier models through increased difficulty and language diversity.

Published: 2024

Score Range: 0-100

Top Score: 81.3

aider.chat GitHub Repo

Aider Polyglot Leaderboard

Rank	Model	Provider	Score	Parameters	Released	Type
1	o3	OpenAI	81.3		2025-04-16	Multimodal
2	Gemini 2.5 Pro	Google	76.5		2025-05-06	Multimodal
3	o4-mini	OpenAI	68.9		2025-04-16	Multimodal
4	Gemini 2.5 Flash	Google	61.9		2025-05-20	Multimodal
5	Qwen-3	Alibaba	61.8	235B (22B active)	2025-04-29	Text
6	Kimi K2	Moonshot AI	60	1T total, 32B activated	2025-07-11	Text
7	DeepSeek-R1	DeepSeek	53.3	671B (37B activated)	2025-01-20	Text
8	DeepSeek-V3	DeepSeek	49.6	671B total, 37B activated	2024-12-26	Text
9	Gemini 2.5 Flash-Lite	Google	27.1		2025-06-17	Multimodal

About Aider Polyglot

Methodology

Aider Polyglot evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2024.aider.chat

Related Benchmarks

Berkeley Function-Calling Leaderboard

coding

The first comprehensive evaluation of LLMs' function calling capabilities, testing various forms including parallel and multiple function calls across diverse programming languages. Evaluates models on execution accuracy and ability to withhold function selection when appropriate.

Published2024

Scale0-100

Technical Paper View Details

CodeForces

coding

Advanced competitive programming benchmark for evaluating large language models on algorithmic problem-solving and code generation.

Published2025

Scale0-3500

Technical Paper View Details

HumanEval

coding

Evaluates code generation capabilities by asking models to complete Python functions based on docstrings and function signatures.

Published2021

Scale0-100

Technical Paper View Details