Dense Procedure Captioning in Narrated Instructional Videos

Botian Shi; Lei Ji; Yaobo Liang; Nan Duan; Peng Chen; Zhendong Niu; Ming Zhou

doi:10.18653/v1/P19-1641

Dense Procedure Captioning in Narrated Instructional Videos

Botian Shi, Lei Ji, Yaobo Liang, Nan Duan, Peng Chen, Zhendong Niu, Ming Zhou

Abstract

Understanding narrated instructional videos is important for both research and real-world web applications. Motivated by video dense captioning, we propose a model to generate procedure captions from narrated instructional videos which are a sequence of step-wise clips with description. Previous works on video dense captioning learn video segments and generate captions without considering transcripts. We argue that transcripts in narrated instructional videos can enhance video representation by providing fine-grained complimentary and semantic textual information. In this paper, we introduce a framework to (1) extract procedures by a cross-modality module, which fuses video content with the entire transcript; and (2) generate captions by encoding video frames as well as a snippet of transcripts within each extracted procedure. Experiments show that our model can achieve state-of-the-art performance in procedure extraction and captioning, and the ablation studies demonstrate that both the video frames and the transcripts are important for the task.

Anthology ID:: P19-1641
Volume:: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:: July
Year:: 2019
Address:: Florence, Italy
Editors:: Anna Korhonen, David Traum, Lluís Màrquez
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6382–6391
Language:
URL:: https://aclanthology.org/P19-1641/
DOI:: 10.18653/v1/P19-1641
Bibkey:
Cite (ACL):: Botian Shi, Lei Ji, Yaobo Liang, Nan Duan, Peng Chen, Zhendong Niu, and Ming Zhou. 2019. Dense Procedure Captioning in Narrated Instructional Videos. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6382–6391, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):: Dense Procedure Captioning in Narrated Instructional Videos (Shi et al., ACL 2019)
Copy Citation:
PDF:: https://aclanthology.org/P19-1641.pdf

PDF Cite Search Fix data