EgoSchema
EgoSchema is a very long-form video question-answering dataset and benchmark for evaluating long video understanding capabilities of vision and language systems. Derived from Ego4D, it consists of over 5000 human-curated multiple-choice question-answer pairs spanning more than 250 hours of real video data covering a broad range of natural human activity and behavior. Each question requires selecting the correct answer from five options based on a three-minute-long video clip. EgoSchema introduces temporal certificate sets to measure intrinsic temporal understanding length, showing 5.7x longer temporal lengths than the second closest dataset and 10-100x longer than other video understanding datasets. Current state-of-the-art video and language models achieve less than 33% accuracy (random is 20%) while humans achieve about 76%, highlighting significant gaps in long-term video understanding capabilities.
EgoSchema Leaderboard
Rank | Model | Provider | Score | Parameters | Released | Type |
---|---|---|---|---|---|---|
1 | Gemini 2.0 Pro | 71.9 | 2025-02-05 | Multimodal | ||
2 | Gemini 2.0 Flash | 71.1 | 2025-02-25 | Multimodal | ||
3 | Gemini 2.0 Flash-Lite | 67.2 | 2025-02-25 | Multimodal |
About EgoSchema
Description
EgoSchema is a very long-form video question-answering dataset and benchmark for evaluating long video understanding capabilities of vision and language systems. Derived from Ego4D, it consists of over 5000 human-curated multiple-choice question-answer pairs spanning more than 250 hours of real video data covering a broad range of natural human activity and behavior. Each question requires selecting the correct answer from five options based on a three-minute-long video clip. EgoSchema introduces temporal certificate sets to measure intrinsic temporal understanding length, showing 5.7x longer temporal lengths than the second closest dataset and 10-100x longer than other video understanding datasets. Current state-of-the-art video and language models achieve less than 33% accuracy (random is 20%) while humans achieve about 76%, highlighting significant gaps in long-term video understanding capabilities.
Methodology
EgoSchema evaluates models on a scale of 0 to 100. Higher scores indicate better performance. For detailed information about the methodology, please refer to the original paper.
Publication
This benchmark was published in 2023.Read the full paper