Video-MME
Video-MME is the first comprehensive Multi-Modal Evaluation benchmark for assessing Multi-modal Large Language Models (MLLMs) in video analysis. It features four key distinctions: 1) Diversity across 6 primary visual domains with 30 subfields; 2) Duration spanning short-, medium-, and long-term videos ranging from 11 seconds to 1 hour; 3) Breadth in data modalities including video frames, subtitles, and audio; and 4) Quality annotations from expert annotators. The benchmark comprises 900 manually selected videos totaling 254 hours, with 2,700 question-answer pairs. Evaluations of state-of-the-art MLLMs revealed Gemini 1.5 Pro as the best-performing commercial model, significantly outperforming open-source alternatives, while highlighting the need for improvements in handling longer sequences and multi-modal data.
Video-MME Leaderboard
Rank | Model | Provider | Score | Parameters | Released | Type |
---|---|---|---|---|---|---|
1 | Gemini 2.5 Pro | 84.8 | 2025-05-06 | Multimodal |
About Video-MME
Description
Video-MME is the first comprehensive Multi-Modal Evaluation benchmark for assessing Multi-modal Large Language Models (MLLMs) in video analysis. It features four key distinctions: 1) Diversity across 6 primary visual domains with 30 subfields; 2) Duration spanning short-, medium-, and long-term videos ranging from 11 seconds to 1 hour; 3) Breadth in data modalities including video frames, subtitles, and audio; and 4) Quality annotations from expert annotators. The benchmark comprises 900 manually selected videos totaling 254 hours, with 2,700 question-answer pairs. Evaluations of state-of-the-art MLLMs revealed Gemini 1.5 Pro as the best-performing commercial model, significantly outperforming open-source alternatives, while highlighting the need for improvements in handling longer sequences and multi-modal data.
Methodology
Video-MME evaluates models on a scale of 0 to 100. Higher scores indicate better performance. For detailed information about the methodology, please refer to the original paper.
Publication
This benchmark was published in 2024.Read the full paper