Sali4Vid: Saliency-Aware Video Reweighting and Adaptive Caption Retrieval for Dense Video Captioning

MinJu Jeon; Si-Woo Kim; Ye-Chan Kim; HyunGee Kim; Dong-Jin Kim

doi:10.18653/v1/2025.emnlp-main.1308

Sali4Vid: Saliency-Aware Video Reweighting and Adaptive Caption Retrieval for Dense Video Captioning

MinJu Jeon, Si-Woo Kim, Ye-Chan Kim, HyunGee Kim, Dong-Jin Kim

Abstract

Dense video captioning aims to temporally localize events in video and generate captions for each event. While recent works propose end-to-end models, they suffer from two limitations: (1) applying timestamp supervision only to text while treating all video frames equally, and (2) retrieving captions from fixed-size video chunks, overlooking scene transitions. To address these, we propose **Sali4Vid**, a simple yet effective saliency-aware framework. We introduce Saliency-aware Video Reweighting, which converts timestamp annotations into sigmoid-based frame importance weights, and Semantic-based Adaptive Caption Retrieval, which segments videos by frame similarity to capture scene transitions and improve caption retrieval. Sali4Vid achieves state-of-the-art results on YouCook2 and ViTT, demonstrating the benefit of jointly improving video weighting and retrieval for dense video captioning.

Anthology ID:: 2025.emnlp-main.1308
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 25777–25790
Language:
URL:: https://aclanthology.org/2025.emnlp-main.1308/
DOI:: 10.18653/v1/2025.emnlp-main.1308
Bibkey:
Cite (ACL):: MinJu Jeon, Si-Woo Kim, Ye-Chan Kim, HyunGee Kim, and Dong-Jin Kim. 2025. Sali4Vid: Saliency-Aware Video Reweighting and Adaptive Caption Retrieval for Dense Video Captioning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25777–25790, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Sali4Vid: Saliency-Aware Video Reweighting and Adaptive Caption Retrieval for Dense Video Captioning (Jeon et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.1308.pdf
Checklist:: 2025.emnlp-main.1308.checklist.pdf

PDF Cite Search Checklist Fix data