MMMU
A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI with 11.5K college-level questions across 6 disciplines and 30 subjects.
MMMU Leaderboard
Rank | Model | Provider | Score | Parameters | Released | Type |
---|---|---|---|---|---|---|
1 | o3 | OpenAI | 82.9 | 2025-04-16 | Multimodal | |
2 | o4-mini | OpenAI | 81.6 | 2025-04-16 | Multimodal | |
3 | Gemini 2.5 Flash | 79.7 | 2025-05-20 | Multimodal | ||
4 | Gemini 2.5 Pro | 79.6 | 2025-05-06 | Multimodal | ||
5 | o1 | OpenAI | 78.2 | 2024-09-12 | Multimodal | |
6 | Claude 3.7 Sonnet | Anthropic | 75 | 2025-02-24 | Multimodal | |
7 | GPT-4.1 | OpenAI | 74.8 | 2025-04-14 | Multimodal | |
8 | GPT-4.5 | OpenAI | 74.4 | 2025-02-27 | Multimodal | |
9 | Claude Opus 4 | Anthropic | 73.7 | 2025-05-22 | Multimodal | |
10 | Grok 3 | xAI | 73.2 | Unknown (multi-trillion estimated) | 2025-02-19 | Multimodal |
About MMMU
Description
MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning) is a comprehensive benchmark designed to evaluate multimodal models on college-level tasks demanding expert-level subject knowledge and deliberate reasoning. The benchmark includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks. Key Features Comprehensive Coverage: - 6 Core Disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering - 30 Subjects: Spanning from Art to Computer Science to Clinical Medicine - 183 Subfields: Providing granular coverage across academic domains - 30 Image Types: Including diagrams, tables, charts, chemical structures, photographs, paintings, geometric shapes, music sheets, medical images, and more Advanced Challenges: 1. Expert-level Visual Perception: Requires understanding of domain-specific visual elements 2. Deep Subject Knowledge: Demands college-level expertise across multiple disciplines 3. Complex Reasoning: Tests logical, spatial, commonsense, and mathematical reasoning 4. Interleaved Text and Images: Evaluates joint understanding of multimodal content Dataset Statistics - Total Questions: 11,550 - Split: 150 dev / 900 validation / 10,500 test - Question Types: 94% multiple-choice, 6% open-ended - Difficulty Distribution: 28% Easy, 45% Medium, 27% Hard - Images per Question: 97.5% include images, 7.4% have multiple images - Average Question Length: 59.33 words Discipline Breakdown - Tech & Engineering (26%): Agriculture, Architecture, Computer Science, Electronics, Energy & Power, Materials, Mechanical Engineering - Science (23%): Biology, Chemistry, Geography, Math, Physics - Health & Medicine (17%): Basic Medical Science, Clinical Medicine, Diagnostics, Pharmacy, Public Health - Business (14%): Accounting, Economics, Finance, Management, Marketing - Art & Design (11%): Art, Design, Music, Art Theory - Humanities & Social Science (9%): History, Literature, Psychology, Sociology Performance Benchmarks Human Expert Performance: - Best Human Expert: 88.6% (validation) - Medium Human Expert: 82.6% (validation) - Worst Human Expert: 76.2% (validation) Model Performance (Test Set): - GPT-4o: 69.1% (validation only) - Gemini 1.5 Pro: 62.2% (validation only) - Claude 3 Opus: 59.4% (validation only) - GPT-4V(ision): 55.7% - SenseChat-Vision: 50.3% - LLaVA-1.6-34B: 44.7% - InternVL-Chat-V1.2: 46.2% - BLIP-2 FLAN-T5-XXL: 34.0% Error Analysis Analysis of 150 GPT-4V errors reveals: - Perceptual Errors (35%): Basic visual interpretation failures and domain-specific perception issues - Lack of Knowledge (29%): Missing specialized domain knowledge - Reasoning Errors (26%): Flawed logical and mathematical reasoning - Other Errors (10%): Textual understanding, answer extraction, and annotation issues Evaluation Methodology Zero-shot Evaluation: All models are evaluated in a zero-shot setting without fine-tuning or few-shot demonstrations to assess their inherent capabilities. Scoring: - Metric: Micro-averaged accuracy - Multiple-choice: Systematic rule-based evaluation with robust regex patterns - Open questions: Direct answer matching with fallback to random selection for invalid responses Baselines: - Random Choice: 23.9% (test set) - Frequent Choice: 25.8% (test set) Key Findings 1. Significant Challenge: Even advanced models like GPT-4V achieve only 55.7% accuracy, indicating substantial room for improvement 2. Performance Gap: Large disparity between open-source models (~34%) and proprietary models (~56%) 3. OCR/Caption Limitations: Text-only LLMs with OCR or caption augmentation show minimal improvement 4. Domain Variation: Higher performance in Art & Design and Humanities; lower in Science, Medicine, and Engineering 5. Image Type Dependency: Models perform better on common image types (photos, paintings) but struggle with specialized formats (geometric shapes, music sheets, chemical structures) Significance for AGI Research MMMU serves as a critical benchmark for measuring progress toward Expert AGI, as defined by Morris et al.'s taxonomy. Strong performance on MMMU should be considered necessary (though not sufficient) for any system claiming expert-level artificial general intelligence, as it tests: - Breadth: Coverage across multiple academic disciplines - Depth: College-level expertise and reasoning - Multimodal Integration: Joint understanding of text and diverse visual content - Expert-level Skills: Domain-specific knowledge application Data Collection and Quality The benchmark was created through a rigorous process: - Manual Curation: 50+ college students collected questions from textbooks and online resources - Quality Control: Multi-stage review process with duplicate detection and format standardization - Difficulty Assessment: Questions categorized into four difficulty levels with ~10% excluded as too easy - Copyright Compliance: Strict adherence to licensing regulations - Contamination Prevention: Selection of questions with non-obvious answers to reduce data leakage risks MMMU represents a significant step forward in evaluating multimodal AI systems, providing a comprehensive and challenging benchmark that pushes the boundaries of what current models can achieve.
Methodology
MMMU evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.
Publication
This benchmark was published in 2023.Technical Paper