Video-MME

multimodalVerified

Video-MME is the first comprehensive Multi-Modal Evaluation benchmark for assessing Multi-modal Large Language Models (MLLMs) in video analysis. It features four key distinctions: 1) Diversity across 6 primary visual domains with 30 subfields; 2) Duration spanning short-, medium-, and long-term videos ranging from 11 seconds to 1 hour; 3) Breadth in data modalities including video frames, subtitles, and audio; and 4) Quality annotations from expert annotators. The benchmark comprises 900 manually selected videos totaling 254 hours, with 2,700 question-answer pairs. Evaluations of state-of-the-art MLLMs revealed Gemini 1.5 Pro as the best-performing commercial model, significantly outperforming open-source alternatives, while highlighting the need for improvements in handling longer sequences and multi-modal data.

Published: 2024
Scale: 0-100
Top Score: 84.8

Video-MME Leaderboard

RankModelProviderScoreParametersReleasedType
1Gemini 2.5 ProGoogle
84.8
2025-05-06Multimodal

About Video-MME

Description

Video-MME is the first comprehensive Multi-Modal Evaluation benchmark for assessing Multi-modal Large Language Models (MLLMs) in video analysis. It features four key distinctions: 1) Diversity across 6 primary visual domains with 30 subfields; 2) Duration spanning short-, medium-, and long-term videos ranging from 11 seconds to 1 hour; 3) Breadth in data modalities including video frames, subtitles, and audio; and 4) Quality annotations from expert annotators. The benchmark comprises 900 manually selected videos totaling 254 hours, with 2,700 question-answer pairs. Evaluations of state-of-the-art MLLMs revealed Gemini 1.5 Pro as the best-performing commercial model, significantly outperforming open-source alternatives, while highlighting the need for improvements in handling longer sequences and multi-modal data.

Methodology

Video-MME evaluates models on a scale of 0 to 100. Higher scores indicate better performance. For detailed information about the methodology, please refer to the original paper.

Publication

This benchmark was published in 2024.Read the full paper

Related Benchmarks