Abstractive Multi-Video Captioning: Benchmark Dataset Construction and Extensive Evaluation

Rikito Takahashi; Hirokazu Kiyomaru; Chenhui Chu; Sadao Kurohashi

Abstractive Multi-Video Captioning: Benchmark Dataset Construction and Extensive Evaluation

Rikito Takahashi, Hirokazu Kiyomaru, Chenhui Chu, Sadao Kurohashi

Abstract

This paper introduces a new task, abstractive multi-video captioning, which focuses on abstracting multiple videos with natural language. Unlike conventional video captioning tasks generating a specific caption for a video, our task generates an abstract caption of the shared content in a video group containing multiple videos. To address our task, models must learn to understand each video in detail and have strong abstraction abilities to find commonalities among videos. We construct a benchmark dataset for abstractive multi-video captioning named AbstrActs. AbstrActs contains 13.5k video groups and corresponding abstract captions. As abstractive multi-video captioning models, we explore two approaches: end-to-end and cascade. For evaluation, we proposed a new metric, CocoA, which can evaluate the model performance based on the abstractness of the generated captions. In experiments, we report the impact of the way of combining multiple video features, the overall model architecture, and the number of input videos.

Anthology ID:: 2024.lrec-main.5
Volume:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:: LREC | COLING
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 57–69
Language:
URL:: https://aclanthology.org/2024.lrec-main.5/
DOI:
Bibkey:
Cite (ACL):: Rikito Takahashi, Hirokazu Kiyomaru, Chenhui Chu, and Sadao Kurohashi. 2024. Abstractive Multi-Video Captioning: Benchmark Dataset Construction and Extensive Evaluation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 57–69, Torino, Italia. ELRA and ICCL.
Cite (Informal):: Abstractive Multi-Video Captioning: Benchmark Dataset Construction and Extensive Evaluation (Takahashi et al., LREC-COLING 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.lrec-main.5.pdf

PDF Cite Search Fix data