MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

Jie Lei, Liwei Wang, Yelong Shen, Dong Yu, Tamara Berg, Mohit Bansal


Abstract
Generating multi-sentence descriptions for videos is one of the most challenging captioning tasks due to its high requirements for not only visual relevance but also discourse-based coherence across the sentences in the paragraph. Towards this goal, we propose a new approach called Memory-Augmented Recurrent Transformer (MART), which uses a memory module to augment the transformer architecture. The memory module generates a highly summarized memory state from the video segments and the sentence history so as to help better prediction of the next sentence (w.r.t. coreference and repetition aspects), thus encouraging coherent paragraph generation. Extensive experiments, human evaluations, and qualitative analyses on two popular datasets ActivityNet Captions and YouCookII show that MART generates more coherent and less repetitive paragraph captions than baseline methods, while maintaining relevance to the input video events.
Anthology ID:
2020.acl-main.233
Volume:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2020
Address:
Online
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2603–2614
Language:
URL:
https://aclanthology.org/2020.acl-main.233
DOI:
10.18653/v1/2020.acl-main.233
Bibkey:
Cite (ACL):
Jie Lei, Liwei Wang, Yelong Shen, Dong Yu, Tamara Berg, and Mohit Bansal. 2020. MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2603–2614, Online. Association for Computational Linguistics.
Cite (Informal):
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning (Lei et al., ACL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.acl-main.233.pdf
Video:
 http://slideslive.com/38929078
Code
 jayleicn/recurrent-transformer
Data
ActivityNetActivityNet Captions