EgoSchema

multimodalVerified

EgoSchema is a very long-form video question-answering dataset and benchmark for evaluating long video understanding capabilities of vision and language systems. Derived from Ego4D, it consists of over 5000 human-curated multiple-choice question-answer pairs spanning more than 250 hours of real video data covering a broad range of natural human activity and behavior. Each question requires selecting the correct answer from five options based on a three-minute-long video clip. EgoSchema introduces temporal certificate sets to measure intrinsic temporal understanding length, showing 5.7x longer temporal lengths than the second closest dataset and 10-100x longer than other video understanding datasets. Current state-of-the-art video and language models achieve less than 33% accuracy (random is 20%) while humans achieve about 76%, highlighting significant gaps in long-term video understanding capabilities.

Published: 2023

Score Range: 0-100

Top Score: 74.5

Technical Paper

EgoSchema Leaderboard

Rank	Model	Provider	Score	Parameters	Released	Type
1	Grok 3	xAI	74.5	Unknown (multi-trillion estimated)	2025-02-19	Multimodal
2	Grok 3 Mini	xAI	74.3	Unknown	2025-02-19	Multimodal
3	Gemini 2.0 Pro	Google	71.9		2025-02-05	Multimodal
4	Gemini 2.0 Flash	Google	71.1		2025-02-25	Multimodal
5	Gemini 2.0 Flash-Lite	Google	67.2		2025-02-25	Multimodal

About EgoSchema

Methodology

EgoSchema evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2023.Technical Paper

Related Benchmarks

MathVista

multimodal

Evaluates mathematical reasoning in visual contexts, combining vision and mathematical problem-solving.

Published2023

Scale0-100

Technical Paper View Details

MMMU

multimodal

A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI with 11.5K college-level questions across 6 disciplines and 30 subjects.

Published2023

Scale0-100

Technical Paper View Details

Video-MME

multimodal

Video-MME is the first comprehensive Multi-Modal Evaluation benchmark for assessing Multi-modal Large Language Models (MLLMs) in video analysis. It features four key distinctions: 1) Diversity across 6 primary visual domains with 30 subfields; 2) Duration spanning short-, medium-, and long-term videos ranging from 11 seconds to 1 hour; 3) Breadth in data modalities including video frames, subtitles, and audio; and 4) Quality annotations from expert annotators. The benchmark comprises 900 manually selected videos totaling 254 hours, with 2,700 question-answer pairs. Evaluations of state-of-the-art MLLMs revealed Gemini 1.5 Pro as the best-performing commercial model, significantly outperforming open-source alternatives, while highlighting the need for improvements in handling longer sequences and multi-modal data.

Published2024

Scale0-100

Technical Paper View Details