Abstractive Text-Image Summarization Using Multi-Modal Attentional Hierarchical RNN

Jingqiang Chen, Hai Zhuge


Abstract
Rapid growth of multi-modal documents on the Internet makes multi-modal summarization research necessary. Most previous research summarizes texts or images separately. Recent neural summarization research shows the strength of the Encoder-Decoder model in text summarization. This paper proposes an abstractive text-image summarization model using the attentional hierarchical Encoder-Decoder model to summarize a text document and its accompanying images simultaneously, and then to align the sentences and images in summaries. A multi-modal attentional mechanism is proposed to attend original sentences, images, and captions when decoding. The DailyMail dataset is extended by collecting images and captions from the Web. Experiments show our model outperforms the neural abstractive and extractive text summarization methods that do not consider images. In addition, our model can generate informative summaries of images.
Anthology ID:
D18-1438
Volume:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Month:
October-November
Year:
2018
Address:
Brussels, Belgium
Editors:
Ellen Riloff, David Chiang, Julia Hockenmaier, Jun’ichi Tsujii
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
4046–4056
Language:
URL:
https://aclanthology.org/D18-1438
DOI:
10.18653/v1/D18-1438
Bibkey:
Cite (ACL):
Jingqiang Chen and Hai Zhuge. 2018. Abstractive Text-Image Summarization Using Multi-Modal Attentional Hierarchical RNN. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4046–4056, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Abstractive Text-Image Summarization Using Multi-Modal Attentional Hierarchical RNN (Chen & Zhuge, EMNLP 2018)
Copy Citation:
PDF:
https://aclanthology.org/D18-1438.pdf