Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video

Haoran Li; Junnan Zhu; Cong Ma; Jiajun Zhang; Chengqing Zong

doi:10.18653/v1/D17-1114

Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video

Haoran Li, Junnan Zhu, Cong Ma, Jiajun Zhang, Chengqing Zong

Abstract

The rapid increase of the multimedia data over the Internet necessitates multi-modal summarization from collections of text, image, audio and video. In this work, we propose an extractive Multi-modal Summarization (MMS) method which can automatically generate a textual summary given a set of documents, images, audios and videos related to a specific topic. The key idea is to bridge the semantic gaps between multi-modal contents. For audio information, we design an approach to selectively use its transcription. For vision information, we learn joint representations of texts and images using a neural network. Finally, all the multi-modal aspects are considered to generate the textural summary by maximizing the salience, non-redundancy, readability and coverage through budgeted optimization of submodular functions. We further introduce an MMS corpus in English and Chinese. The experimental results on this dataset demonstrate that our method outperforms other competitive baseline methods.

Anthology ID:: D17-1114
Volume:: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Month:: September
Year:: 2017
Address:: Copenhagen, Denmark
Editors:: Martha Palmer, Rebecca Hwa, Sebastian Riedel
Venue:: EMNLP
SIG:: SIGDAT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1092–1102
Language:
URL:: https://aclanthology.org/D17-1114/
DOI:: 10.18653/v1/D17-1114
Bibkey:
Cite (ACL):: Haoran Li, Junnan Zhu, Cong Ma, Jiajun Zhang, and Chengqing Zong. 2017. Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1092–1102, Copenhagen, Denmark. Association for Computational Linguistics.
Cite (Informal):: Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video (Li et al., EMNLP 2017)
Copy Citation:
PDF:: https://aclanthology.org/D17-1114.pdf
Video:: https://aclanthology.org/D17-1114.mp4

PDF Cite Search Video Fix data