CVT5: Using Compressed Video Encoder and UMT5 for Dense Video Captioning

Mohammad Javad Pirhadi; Motahhare Mirzaei; Sauleh Eetemadi

CVT5: Using Compressed Video Encoder and UMT5 for Dense Video Captioning

Mohammad Javad Pirhadi, Motahhare Mirzaei, Sauleh Eetemadi

Abstract

The dense video captioning task aims to detect all events occurring in a video and describe each event using natural language. Unlike most other video processing tasks, where it is typically assumed that videos contain only a single main event, this task deals with long, untrimmed videos. Consequently, the speed of processing videos in dense video captioning is a critical aspect of the system. To the best of our knowledge, all published work on this task uses RGB frames to encode input videos. In this work, we introduce the use of compressed videos for the first time in this task. Our experiments on the SoccerNet challenge demonstrate significant improvements in both processing speed and GPU memory footprint while achieving competitive results. Additionally, we leverage multilingual transcripts, which seems to be effective. The encoder in our proposed method achieves approximately 5.4× higher speed and 5.1× lower GPU memory usage during training, and 4.7× higher speed and 7.8× lower GPU memory usage during inference, compared to its RGB-based counterpart. The code is publicly available at https://github.com/mohammadjavadpirhadi/CVT5.

Anthology ID:: 2025.evalmg-1.2
Volume:: Proceedings of the First Workshop of Evaluation of Multi-Modal Generation
Month:: Jan
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Wei Emma Zhang, Xiang Dai, Desmond Elliot, Byron Fang, Mongyuan Sim, Haojie Zhuang, Weitong Chen
Venues:: EvalMG | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10–23
Language:
URL:: https://aclanthology.org/2025.evalmg-1.2/
DOI:
Bibkey:
Cite (ACL):: Mohammad Javad Pirhadi, Motahhare Mirzaei, and Sauleh Eetemadi. 2025. CVT5: Using Compressed Video Encoder and UMT5 for Dense Video Captioning. In Proceedings of the First Workshop of Evaluation of Multi-Modal Generation, pages 10–23, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: CVT5: Using Compressed Video Encoder and UMT5 for Dense Video Captioning (Pirhadi et al., EvalMG 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.evalmg-1.2.pdf

PDF Cite Search Fix data