Retrieval-augmented Video Encoding for Instructional Captioning

Yeonjoon Jung, Minsoo Kim, Seungtaek Choi, Jihyuk Kim, Minji Seo, Seung-won Hwang


Abstract
Instructional videos make learning knowledge more efficient, by providing a detailed multimodal context of each procedure in instruction.A unique challenge posed by instructional videos is key-object degeneracy, where any single modality fails to sufficiently capture the key objects referred to in the procedure. For machine systems, such degeneracy can disturb the performance of a downstream task such as dense video captioning, leading to the generation of incorrect captions omitting key objects. To repair degeneracy, we propose a retrieval-based framework to augment the model representations in the presence of such key-object degeneracy. We validate the effectiveness and generalizability of our proposed framework over baselines using modalities with key-object degeneracy.
Anthology ID:
2023.findings-acl.543
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8554–8568
Language:
URL:
https://aclanthology.org/2023.findings-acl.543
DOI:
10.18653/v1/2023.findings-acl.543
Bibkey:
Cite (ACL):
Yeonjoon Jung, Minsoo Kim, Seungtaek Choi, Jihyuk Kim, Minji Seo, and Seung-won Hwang. 2023. Retrieval-augmented Video Encoding for Instructional Captioning. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8554–8568, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Retrieval-augmented Video Encoding for Instructional Captioning (Jung et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.543.pdf