Detecting and Mitigating Challenges in Zero-Shot Video Summarization with Video LLMs

Luca Cagliero; Lorenzo Vaiani; Eliana Pastor; Alkis Koudounas; Elena Baralis; Vittorio Mazzia; Sandro Pollastrini; Thomas Gueudre; Manuel Giollo; Daniele Amberti; Yue Wu

doi:10.18653/v1/2025.findings-acl.16

Detecting and Mitigating Challenges in Zero-Shot Video Summarization with Video LLMs

Luca Cagliero, Lorenzo Vaiani, Eliana Pastor, Alkis Koudounas, Elena Baralis, Vittorio Mazzia, Sandro Pollastrini, Thomas Gueudre, Manuel Giollo, Daniele Amberti, Yue Wu

Abstract

Video summarization aims to generate a condensed textual version of an original video. Summaries may consist of either plain text or a shortlist of salient events, possibly including temporal or spatial references. Video Large Language Models (VLLMs) exhibit impressive zero-shot capabilities in video analysis. However, their performance varies significantly according to the LLM prompt, the characteristics of the video, and the properties of the training data and LLM architecture.In this work, we thoroughly evaluate the zero-shot summarization performance of four state-of-the-art open-source VLLMs specifically designed to address spatial and temporal reasoning. In light of the detected summarization issues, we propose different cost-effective mitigation strategies, based on Chain-of-Thought prompting, that involve the injection of knowledge extracted by external, lightweight models. To perform the VLLM evaluation, we design a new video summarization benchmark consisting of 100 videos with varying characteristics in terms of domain, duration, and spatio-temporal properties. Videos are manually annotated by three independent human experts with plain text, event-based, and spatio-temporal summaries. The experimental evaluation shows that VLLMs significantly benefit from prompting a list of recognized actions, whereas injecting automatically recognized objects and scene changes respectively improve spatially contextualized and event-based summaries in specific cases.

Anthology ID:: 2025.findings-acl.16
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 286–301
Language:
URL:: https://aclanthology.org/2025.findings-acl.16/
DOI:: 10.18653/v1/2025.findings-acl.16
Bibkey:
Cite (ACL):: Luca Cagliero, Lorenzo Vaiani, Eliana Pastor, Alkis Koudounas, Elena Baralis, Vittorio Mazzia, Sandro Pollastrini, Thomas Gueudre, Manuel Giollo, Daniele Amberti, and Yue Wu. 2025. Detecting and Mitigating Challenges in Zero-Shot Video Summarization with Video LLMs. In Findings of the Association for Computational Linguistics: ACL 2025, pages 286–301, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Detecting and Mitigating Challenges in Zero-Shot Video Summarization with Video LLMs (Cagliero et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.16.pdf

PDF Cite Search Fix data