InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows

Kirolos Ataallah; Eslam Mohamed Bakr; Mahmoud Ahmed; Chenhui Gou; Khushbu Pahwa; Jian Ding; Mohamed Elhoseiny

doi:10.18653/v1/2025.emnlp-main.984

InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows

Kirolos Ataallah, Eslam Mohamed Bakr, Mahmoud Ahmed, Chenhui Gou, Khushbu Pahwa, Jian Ding, Mohamed Elhoseiny

Abstract

Understanding long-form videos, such as movies and TV episodes ranging from tens of minutes to two hours, remains a significant challenge for multi-modal models. Existing benchmarks often fail to test the full range of cognitive skills needed to process these temporally rich and narratively complex inputs. Therefore, we introduce InfiniBench, a comprehensive benchmark designed to evaluate the capabilities of models in long video understanding rigorously.InfiniBench offers:(1) Over 1,000 hours of video content, with an average video length of 53 minutes.(2) The largest set of question-answer pairs for long video comprehension, totaling around 87.7 K.(3) Eight diverse skills that span both grounding-based (e.g., scene transitions, character actions) and reasoning-based (e.g., deep context understanding, multi-event linking).(4) Rich annotation formats, including both multiple-choice and open-ended questions.We conducted an in-depth evaluation across both commercial (GPT-4o, Gemini 2.0 Flash) and most recent open-source vision-language models, such as Qwen2.5-VL, InternVL3.0). Results reveal that:(1) Models struggle across the board: Even the best model, GPT-4o, achieves only 47.1% on grounding-based skills, with most models performing near or just above random chance.(2) Strong reliance on world knowledge: Models achieve surprisingly high scores using only metadata (e.g., video titles), highlighting a tendency to rely on pre-trained knowledge rather than actual visual or temporal understanding.(3) Multi-Modal Importance: When provided with full video and subtitle context, however, models show substantial improvements, confirming the critical role of multimodal input in video understanding.Our findings underscore the inherent challenges in long-video comprehension and point to the need for substantial advancements in both grounding and reasoning capabilities in MLLMs.

Anthology ID:: 2025.emnlp-main.984
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19485–19512
Language:
URL:: https://aclanthology.org/2025.emnlp-main.984/
DOI:: 10.18653/v1/2025.emnlp-main.984
Bibkey:
Cite (ACL):: Kirolos Ataallah, Eslam Mohamed Bakr, Mahmoud Ahmed, Chenhui Gou, Khushbu Pahwa, Jian Ding, and Mohamed Elhoseiny. 2025. InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19485–19512, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows (Ataallah et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.984.pdf
Checklist:: 2025.emnlp-main.984.checklist.pdf

PDF Cite Search Checklist Fix data