Video-MME

multimodalVerified

Video-MME is the first comprehensive Multi-Modal Evaluation benchmark for assessing Multi-modal Large Language Models (MLLMs) in video analysis. It features four key distinctions: 1) Diversity across 6 primary visual domains with 30 subfields; 2) Duration spanning short-, medium-, and long-term videos ranging from 11 seconds to 1 hour; 3) Breadth in data modalities including video frames, subtitles, and audio; and 4) Quality annotations from expert annotators. The benchmark comprises 900 manually selected videos totaling 254 hours, with 2,700 question-answer pairs. Evaluations of state-of-the-art MLLMs revealed Gemini 1.5 Pro as the best-performing commercial model, significantly outperforming open-source alternatives, while highlighting the need for improvements in handling longer sequences and multi-modal data.

Published: 2024
Score Range: 0-100
Top Score: 84.8

Video-MME Leaderboard

RankModelProviderScoreParametersReleasedType
1Gemini 2.5 ProGoogle
84.8
2025-05-06Multimodal

About Video-MME

Methodology

Video-MME evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2024.Technical Paper

Related Benchmarks