Michelangelo Long-Context Reasoning (1M)
MRCR (Multi-Round Coreference Resolution) is part of the Michelangelo benchmark suite that evaluates long-context reasoning in LLMs. It tests models' ability to track identities and references in adversarial conversation histories up to 1M tokens. Unlike retrieval-based benchmarks, MRCR requires synthesis, reasoning, and contextual understanding over extended contexts. The benchmark is synthetically generated to avoid pretraining contamination and provides unambiguous scoring metrics. MRCR is recommended as a superior alternative to traditional retrieval benchmarks for evaluating long-context reasoning capabilities.
Michelangelo Long-Context Reasoning (1M) Leaderboard
Rank | Model | Provider | Score | Parameters | Released | Type |
---|---|---|---|---|---|---|
1 | Gemini 2.5 Pro | 93 | 2025-05-06 | Multimodal | ||
2 | Gemini 2.0 Pro | 74.7 | 2025-02-05 | Multimodal | ||
3 | Gemini 2.0 Flash | 70.5 | 2025-02-25 | Multimodal | ||
4 | Gemini 2.0 Flash-Lite | 58 | 2025-02-25 | Multimodal | ||
5 | Gemini 2.5 Flash | 32 | 2025-05-20 | Multimodal | ||
6 | Gemini 2.5 Flash-Lite | 5.4 | 2025-06-17 | Multimodal |
About Michelangelo Long-Context Reasoning (1M)
Methodology
Michelangelo Long-Context Reasoning (1M) evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.
Publication
This benchmark was published in 2024.Technical Paper