IMO-AnswerBench

mathematicsPending Human Review

A large-scale benchmark of 400 International Mathematical Olympiad-level problems with verifiable answers, spanning Algebra, Combinatorics, Geometry, and Number Theory across four difficulty levels.

Published: 2025

Score Range: 0-100

Top Score: 83.3

Website Technical Paper GitHub

IMO-AnswerBench Leaderboard

Rank	Model	Provider	Score	Parameters	Released	Type
1	Gemini 3 Pro	Google	83.3	Proprietary	2025-11-18	Multimodal
2	Kimi K2	Moonshot AI	78.6	1T total, 32B activated	2025-07-11	Text

About IMO-AnswerBench

Description

IMO-AnswerBench is part of the IMO-Bench suite introduced at EMNLP 2025 by Google DeepMind. It consists of 400 carefully chosen problems from past Olympiad competitions, altered by IMO medalists and mathematicians to avoid memorization. ## Problem Distribution The benchmark spans four IMO categories: - **Algebra**: 100 problems (11 pre-IMO, 46 IMO-Easy, 32 IMO-Medium, 11 IMO-Hard) - **Combinatorics**: 100 problems (4 pre-IMO, 19 IMO-Easy, 31 IMO-Medium, 46 IMO-Hard) - **Geometry**: 100 problems (13 pre-IMO, 44 IMO-Easy, 32 IMO-Medium, 11 IMO-Hard) - **Number Theory**: 100 problems (2 pre-IMO, 20 IMO-Easy, 31 IMO-Medium, 47 IMO-Hard) ## Difficulty Levels - **Pre-IMO**: Middle school or pre-Math Olympiad problems - **IMO-Easy**: Equivalent to Problem 1 or Problem 4 at the IMO - **IMO-Medium**: Equivalent to Problem 2 or Problem 5 at the IMO - **IMO-Hard**: Equivalent to Problem 3 or Problem 6 at the IMO or post-Math Olympiad problems ## Key Features - Vetted by a panel of IMO medalists and mathematicians (10 gold and 5 silver medals combined) - Problems require rigorous multi-step reasoning and creativity beyond simple formula application - Diverse representation of topics, ideas, and domain knowledge across all categories - Designed to push the frontiers of mathematical reasoning in AI systems - Part of the benchmark suite that contributed to achieving gold-medal standard at IMO IMO-AnswerBench focuses on getting the right answer and serves as a comprehensive test of mathematical problem-solving abilities at the International Mathematical Olympiad level.

Methodology

IMO-AnswerBench evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.