MMMU

multimodalVerified

A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI with 11.5K college-level questions across 6 disciplines and 30 subjects.

Published: 2023

Score Range: 0-100

Top Score: 82.9

Technical Paper

MMMU Leaderboard

Rank	Model	Provider	Score	Parameters	Released	Type
1	o3	OpenAI	82.9		2025-04-16	Multimodal
2	o4-mini	OpenAI	81.6		2025-04-16	Multimodal
3	Gemini 2.5 Flash	Google	79.7		2025-05-20	Multimodal
4	Gemini 2.5 Pro	Google	79.6		2025-05-06	Multimodal
5	o1	OpenAI	78.2		2024-09-12	Multimodal
6	Claude 3.7 Sonnet	Anthropic	75		2025-02-24	Multimodal
7	GPT-4.1	OpenAI	74.8		2025-04-14	Multimodal
8	GPT-4.5	OpenAI	74.4		2025-02-27	Multimodal
9	Claude Opus 4	Anthropic	73.7		2025-05-22	Multimodal
10	Grok 3	xAI	73.2	Unknown (multi-trillion estimated)	2025-02-19	Multimodal

About MMMU

Description

MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning) is a comprehensive benchmark designed to evaluate multimodal models on college-level tasks demanding expert-level subject knowledge and deliberate reasoning. The benchmark includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks. Key Features Comprehensive Coverage: - 6 Core Disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering - 30 Subjects: Spanning from Art to Computer Science to Clinical Medicine - 183 Subfields: Providing granular coverage across academic domains - 30 Image Types: Including diagrams, tables, charts, chemical structures, photographs, paintings, geometric shapes, music sheets, medical images, and more Advanced Challenges: 1. Expert-level Visual Perception: Requires understanding of domain-specific visual elements 2. Deep Subject Knowledge: Demands college-level expertise across multiple disciplines 3. Complex Reasoning: Tests logical, spatial, commonsense, and mathematical reasoning 4. Interleaved Text and Images: Evaluates joint understanding of multimodal content Dataset Statistics - Total Questions: 11,550 - Split: 150 dev / 900 validation / 10,500 test - Question Types: 94% multiple-choice, 6% open-ended - Difficulty Distribution: 28% Easy, 45% Medium, 27% Hard - Images per Question: 97.5% include images, 7.4% have multiple images - Average Question Length: 59.33 words Discipline Breakdown - Tech & Engineering (26%): Agriculture, Architecture, Computer Science, Electronics, Energy & Power, Materials, Mechanical Engineering - Science (23%): Biology, Chemistry, Geography, Math, Physics - Health & Medicine (17%): Basic Medical Science, Clinical Medicine, Diagnostics, Pharmacy, Public Health - Business (14%): Accounting, Economics, Finance, Management, Marketing - Art & Design (11%): Art, Design, Music, Art Theory - Humanities & Social Science (9%): History, Literature, Psychology, Sociology Performance Benchmarks Human Expert Performance: - Best Human Expert: 88.6% (validation) - Medium Human Expert: 82.6% (validation) - Worst Human Expert: 76.2% (validation) Model Performance (Test Set): - GPT-4o: 69.1% (validation only) - Gemini 1.5 Pro: 62.2% (validation only) - Claude 3 Opus: 59.4% (validation only) - GPT-4V(ision): 55.7% - SenseChat-Vision: 50.3% - LLaVA-1.6-34B: 44.7% - InternVL-Chat-V1.2: 46.2% - BLIP-2 FLAN-T5-XXL: 34.0% Error Analysis Analysis of 150 GPT-4V errors reveals: - Perceptual Errors (35%): Basic visual interpretation failures and domain-specific perception issues - Lack of Knowledge (29%): Missing specialized domain knowledge - Reasoning Errors (26%): Flawed logical and mathematical reasoning - Other Errors (10%): Textual understanding, answer extraction, and annotation issues Evaluation Methodology Zero-shot Evaluation: All models are evaluated in a zero-shot setting without fine-tuning or few-shot demonstrations to assess their inherent capabilities. Scoring: - Metric: Micro-averaged accuracy - Multiple-choice: Systematic rule-based evaluation with robust regex patterns - Open questions: Direct answer matching with fallback to random selection for invalid responses Baselines: - Random Choice: 23.9% (test set) - Frequent Choice: 25.8% (test set) Key Findings 1. Significant Challenge: Even advanced models like GPT-4V achieve only 55.7% accuracy, indicating substantial room for improvement 2. Performance Gap: Large disparity between open-source models (~34%) and proprietary models (~56%) 3. OCR/Caption Limitations: Text-only LLMs with OCR or caption augmentation show minimal improvement 4. Domain Variation: Higher performance in Art & Design and Humanities; lower in Science, Medicine, and Engineering 5. Image Type Dependency: Models perform better on common image types (photos, paintings) but struggle with specialized formats (geometric shapes, music sheets, chemical structures) Significance for AGI Research MMMU serves as a critical benchmark for measuring progress toward Expert AGI, as defined by Morris et al.'s taxonomy. Strong performance on MMMU should be considered necessary (though not sufficient) for any system claiming expert-level artificial general intelligence, as it tests: - Breadth: Coverage across multiple academic disciplines - Depth: College-level expertise and reasoning - Multimodal Integration: Joint understanding of text and diverse visual content - Expert-level Skills: Domain-specific knowledge application Data Collection and Quality The benchmark was created through a rigorous process: - Manual Curation: 50+ college students collected questions from textbooks and online resources - Quality Control: Multi-stage review process with duplicate detection and format standardization - Difficulty Assessment: Questions categorized into four difficulty levels with ~10% excluded as too easy - Copyright Compliance: Strict adherence to licensing regulations - Contamination Prevention: Selection of questions with non-obvious answers to reduce data leakage risks MMMU represents a significant step forward in evaluating multimodal AI systems, providing a comprehensive and challenging benchmark that pushes the boundaries of what current models can achieve.

Methodology

MMMU evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2023.Technical Paper

Related Benchmarks

EgoSchema

multimodal

EgoSchema is a very long-form video question-answering dataset and benchmark for evaluating long video understanding capabilities of vision and language systems. Derived from Ego4D, it consists of over 5000 human-curated multiple-choice question-answer pairs spanning more than 250 hours of real video data covering a broad range of natural human activity and behavior. Each question requires selecting the correct answer from five options based on a three-minute-long video clip. EgoSchema introduces temporal certificate sets to measure intrinsic temporal understanding length, showing 5.7x longer temporal lengths than the second closest dataset and 10-100x longer than other video understanding datasets. Current state-of-the-art video and language models achieve less than 33% accuracy (random is 20%) while humans achieve about 76%, highlighting significant gaps in long-term video understanding capabilities.

Published2023

Scale0-100

Technical Paper View Details

MathVista

multimodal

Evaluates mathematical reasoning in visual contexts, combining vision and mathematical problem-solving.

Published2023

Scale0-100

Technical Paper View Details

Video-MME

multimodal

Video-MME is the first comprehensive Multi-Modal Evaluation benchmark for assessing Multi-modal Large Language Models (MLLMs) in video analysis. It features four key distinctions: 1) Diversity across 6 primary visual domains with 30 subfields; 2) Duration spanning short-, medium-, and long-term videos ranging from 11 seconds to 1 hour; 3) Breadth in data modalities including video frames, subtitles, and audio; and 4) Quality annotations from expert annotators. The benchmark comprises 900 manually selected videos totaling 254 hours, with 2,700 question-answer pairs. Evaluations of state-of-the-art MLLMs revealed Gemini 1.5 Pro as the best-performing commercial model, significantly outperforming open-source alternatives, while highlighting the need for improvements in handling longer sequences and multi-modal data.

Published2024

Scale0-100

Technical Paper View Details