LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding.

ZhaoYang Han; Qihan Lin; Hao Liang; Bowen Chen; Zhou Liu; Wentao Zhang

LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding.

ZhaoYang Han, Qihan Lin, Hao Liang, Bowen Chen, Zhou Liu, Wentao Zhang

Abstract

We introduce LongInsightBench, the first benchmark designed to assess models’ ability to understand long videos, with a focus on human language, viewpoints, actions, and other contextual elements, while integrating visual, audio, and text modalities. Our benchmark excels in three key areas: a) Long-Duration, Human-Centric Videos: We carefully selected approximately 1,000 videos from open-source datasets FineVideo based on duration limit and multi-modal information density, focusing on content like lectures, interviews, and vlogs, which contain rich human-centric semantic and contextual attributes. b) Diverse and Challenging Task Scenarios: We have designed six challenging task scenarios, including both Intra-Event and Inter-Event Tasks. c) Rigorous and Comprehensive Quality Assurance Pipelines: We have developed a three-step, semi-automated data quality assurance pipeline to ensure the difficulty and validity of the synthesized questions and answer options. Based on LongInsightBench, we designed a series of experiments. which shows that Omni-modal models(OLMs) still face challenge in tasks requiring precise temporal localization (T-Loc) and long-range causal inference (CE-Caus). Surprisingly, extended experiments reveal the information loss in modal fusion of OLMs, which we called the Fusion Deficit Paradox.

Anthology ID:: 2026.findings-acl.965
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19332–19358
Language:
URL:: https://aclanthology.org/2026.findings-acl.965/
DOI:
Bibkey:
Cite (ACL):: ZhaoYang Han, Qihan Lin, Hao Liang, Bowen Chen, Zhou Liu, and Wentao Zhang. 2026. LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding.. In Findings of the Association for Computational Linguistics: ACL 2026, pages 19332–19358, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding. (Han et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.965.pdf
Checklist:: 2026.findings-acl.965.checklist.pdf

PDF Cite Search Checklist Fix data