Aider Polyglot

codingPending Verification

A comprehensive code editing benchmark based on Exercism coding exercises across 6 programming languages (C++, Go, Java, JavaScript, Python, and Rust). Focuses on the most challenging 225 exercises out of 697 available, designed to re-calibrate scoring so top coding LLMs occupy a range between 5-50%. Replaces the original Python-only benchmark that was saturating at 80%+ scores, providing better differentiation between frontier models through increased difficulty and language diversity.

Published: 2024
Score Range: 0-100
Top Score: 81.3

Aider Polyglot Leaderboard

RankModelProviderScoreParametersReleasedType
1o3OpenAI
81.3
2025-04-16Multimodal
2Gemini 2.5 ProGoogle
76.5
2025-05-06Multimodal
3o4-miniOpenAI
68.9
2025-04-16Multimodal
4Gemini 2.5 FlashGoogle
61.9
2025-05-20Multimodal
5Qwen-3Alibaba
61.8
235B (22B active)2025-04-29Text
6Kimi K2Moonshot AI
60
1T total, 32B activated2025-07-11Text
7DeepSeek-R1DeepSeek
53.3
671B (37B activated)2025-01-20Text
8DeepSeek-V3DeepSeek
49.6
671B total, 37B activated2024-12-26Text
9Gemini 2.5 Flash-LiteGoogle
27.1
2025-06-17Multimodal

About Aider Polyglot

Methodology

Aider Polyglot evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2024.aider.chat

Related Benchmarks