EgoSchema

multimodalVerified

EgoSchema is a very long-form video question-answering dataset and benchmark for evaluating long video understanding capabilities of vision and language systems. Derived from Ego4D, it consists of over 5000 human-curated multiple-choice question-answer pairs spanning more than 250 hours of real video data covering a broad range of natural human activity and behavior. Each question requires selecting the correct answer from five options based on a three-minute-long video clip. EgoSchema introduces temporal certificate sets to measure intrinsic temporal understanding length, showing 5.7x longer temporal lengths than the second closest dataset and 10-100x longer than other video understanding datasets. Current state-of-the-art video and language models achieve less than 33% accuracy (random is 20%) while humans achieve about 76%, highlighting significant gaps in long-term video understanding capabilities.

Published: 2023
Score Range: 0-100
Top Score: 74.5

EgoSchema Leaderboard

RankModelProviderScoreParametersReleasedType
1Grok 3xAI
74.5
Unknown (multi-trillion estimated)2025-02-19Multimodal
2Grok 3 MinixAI
74.3
Unknown2025-02-19Multimodal
3Gemini 2.0 ProGoogle
71.9
2025-02-05Multimodal
4Gemini 2.0 FlashGoogle
71.1
2025-02-25Multimodal
5Gemini 2.0 Flash-LiteGoogle
67.2
2025-02-25Multimodal

About EgoSchema

Methodology

EgoSchema evaluates model performance using a standardized scoring methodology. Scores are reported on a scale of 0 to 100, where higher scores indicate better performance. For detailed information about the scoring system and methodology, please refer to the original paper.

Publication

This benchmark was published in 2023.Technical Paper

Related Benchmarks