Hierarchical Context-aware Network for Dense Video Event Captioning

Lei Ji; Xianglin Guo; Haoyang Huang; Xilin Chen

doi:10.18653/v1/2021.acl-long.156

Hierarchical Context-aware Network for Dense Video Event Captioning

Lei Ji, Xianglin Guo, Haoyang Huang, Xilin Chen

Abstract

Dense video event captioning aims to generate a sequence of descriptive captions for each event in a long untrimmed video. Video-level context provides important information and facilities the model to generate consistent and less redundant captions between events. In this paper, we introduce a novel Hierarchical Context-aware Network for dense video event captioning (HCN) to capture context from various aspects. In detail, the model leverages local and global context with different mechanisms to jointly learn to generate coherent captions. The local context module performs full interaction between neighbor frames and the global context module selectively attends to previous or future events. According to our extensive experiment on both Youcook2 and Activitynet Captioning datasets, the video-level HCN model outperforms the event-level context-agnostic model by a large margin. The code is available at https://github.com/KirkGuo/HCN.

Anthology ID:: 2021.acl-long.156
Volume:: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Month:: August
Year:: 2021
Address:: Online
Editors:: Chengqing Zong, Fei Xia, Wenjie Li, Roberto Navigli
Venues:: ACL | IJCNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2004–2013
Language:
URL:: https://aclanthology.org/2021.acl-long.156/
DOI:: 10.18653/v1/2021.acl-long.156
Bibkey:
Cite (ACL):: Lei Ji, Xianglin Guo, Haoyang Huang, and Xilin Chen. 2021. Hierarchical Context-aware Network for Dense Video Event Captioning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2004–2013, Online. Association for Computational Linguistics.
Cite (Informal):: Hierarchical Context-aware Network for Dense Video Event Captioning (Ji et al., ACL-IJCNLP 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.acl-long.156.pdf
Video:: https://aclanthology.org/2021.acl-long.156.mp4

PDF Cite Search Video Fix data