Michelangelo Long-Context Reasoning (1M)

reasoningVerified

MRCR (Multi-Round Coreference Resolution) is part of the Michelangelo benchmark suite that evaluates long-context reasoning in LLMs. It tests models' ability to track identities and references in adversarial conversation histories up to 1M tokens. Unlike retrieval-based benchmarks, MRCR requires synthesis, reasoning, and contextual understanding over extended contexts. The benchmark is synthetically generated to avoid pretraining contamination and provides unambiguous scoring metrics. MRCR is recommended as a superior alternative to traditional retrieval benchmarks for evaluating long-context reasoning capabilities.

Published: 2024

Score Range: 0-100

Top Score: 93

Technical Paper

Michelangelo Long-Context Reasoning (1M) Leaderboard

Rank	Model	Provider	Score	Released	Type
1	Gemini 2.5 Pro	Google	93	2025-05-06	Multimodal
2	Gemini 2.0 Pro	Google	74.7	2025-02-05	Multimodal
3	Gemini 2.0 Flash	Google	70.5	2025-02-25	Multimodal
4	Gemini 2.0 Flash-Lite	Google	58	2025-02-25	Multimodal
5	Gemini 2.5 Flash	Google	32	2025-05-20	Multimodal
6	Gemini 2.5 Flash-Lite	Google	5.4	2025-06-17	Multimodal

About Michelangelo Long-Context Reasoning (1M)

Methodology

Michelangelo Long-Context Reasoning (1M) evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.