Generative Frame Sampler for Long Video Understanding

Linli Yao; Haoning Wu; Kun Ouyang; Yuanxing Zhang; Caiming Xiong; Bei Chen; Xu Sun; Junnan Li

doi:10.18653/v1/2025.findings-acl.921

Generative Frame Sampler for Long Video Understanding

Linli Yao, Haoning Wu, Kun Ouyang, Yuanxing Zhang, Caiming Xiong, Bei Chen, Xu Sun, Junnan Li

Abstract

Despite recent advances in Video Large Language Models (VideoLLMs), effectively understanding long-form videos remains a significant challenge. Perceiving lengthy videos containing thousands of frames poses substantial computational burden. To mitigate this issue, this paper introduces Generative Frame Sampler (GenS), a plug-and-play module integrated with VideoLLMs to facilitate efficient lengthy video perception. Built upon a lightweight VideoLLM, GenS leverages its inherent vision-language capabilities to identify question-relevant frames. To facilitate effective retrieval, we construct GenS-Video-150K, a large-scale video instruction dataset with dense frame relevance annotations. Extensive experiments demonstrate that GenS consistently boosts the performance of various VideoLLMs, including open-source models (Qwen2-VL-7B, Aria-25B, LLaVA-Video-7B/72B) and proprietary assistants (GPT-4o, Gemini). When equipped with GenS, open-source VideoLLMs achieve impressive state-of-the-art results on long-form video benchmarks: LLaVA-Video-72B reaches 66.8 (+4.3) on LongVideoBench and 77.0 (+2.7) on MLVU, while Aria obtains 39.2 on HourVideo surpassing the Gemini-1.5-pro by 1.9 points.

Anthology ID:: 2025.findings-acl.921
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 17900–17917
Language:
URL:: https://aclanthology.org/2025.findings-acl.921/
DOI:: 10.18653/v1/2025.findings-acl.921
Bibkey:
Cite (ACL):: Linli Yao, Haoning Wu, Kun Ouyang, Yuanxing Zhang, Caiming Xiong, Bei Chen, Xu Sun, and Junnan Li. 2025. Generative Frame Sampler for Long Video Understanding. In Findings of the Association for Computational Linguistics: ACL 2025, pages 17900–17917, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Generative Frame Sampler for Long Video Understanding (Yao et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.921.pdf

PDF Cite Search Fix data