Aider Polyglot
A comprehensive code editing benchmark based on Exercism coding exercises across 6 programming languages (C++, Go, Java, JavaScript, Python, and Rust). Focuses on the most challenging 225 exercises out of 697 available, designed to re-calibrate scoring so top coding LLMs occupy a range between 5-50%. Replaces the original Python-only benchmark that was saturating at 80%+ scores, providing better differentiation between frontier models through increased difficulty and language diversity.
Aider Polyglot Leaderboard
Rank | Model | Provider | Score | Parameters | Released | Type |
---|---|---|---|---|---|---|
1 | o3 | OpenAI | 81.3 | 2025-04-16 | Multimodal | |
2 | Gemini 2.5 Pro | 76.5 | 2025-05-06 | Multimodal | ||
3 | o4-mini | OpenAI | 68.9 | 2025-04-16 | Multimodal | |
4 | Gemini 2.5 Flash | 61.9 | 2025-05-20 | Multimodal | ||
5 | Qwen-3 | Alibaba | 61.8 | 235B (22B active) | 2025-04-29 | Text |
6 | Kimi K2 | Moonshot AI | 60 | 1T total, 32B activated | 2025-07-11 | Text |
7 | DeepSeek-R1 | DeepSeek | 53.3 | 671B (37B activated) | 2025-01-20 | Text |
8 | DeepSeek-V3 | DeepSeek | 49.6 | 671B total, 37B activated | 2024-12-26 | Text |
9 | Gemini 2.5 Flash-Lite | 27.1 | 2025-06-17 | Multimodal |
About Aider Polyglot
Methodology
Aider Polyglot evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.
Publication
This benchmark was published in 2024.aider.chat