Summary-Oriented Vision Modeling for Multimodal Abstractive Summarization

Yunlong Liang; Fandong Meng; Jinan Xu; Jiaan Wang; Yufeng Chen; Jie Zhou

doi:10.18653/v1/2023.acl-long.165

Summary-Oriented Vision Modeling for Multimodal Abstractive Summarization

Yunlong Liang, Fandong Meng, Jinan Xu, Jiaan Wang, Yufeng Chen, Jie Zhou

Abstract

The goal of multimodal abstractive summarization (MAS) is to produce a concise summary given the multimodal data (text and vision). Existing studies on MAS mainly focus on how to effectively use the extracted visual features, having achieved impressive success on the high-resource English dataset. However, less attention has been paid to the quality of the visual features to the summary, which may limit the model performance, especially in the low- and zero-resource scenarios. In this paper, we propose to improve the summary quality through summary-oriented visual features. To this end, we devise two auxiliary tasks including vision to summary task and masked image modeling task. Together with the main summarization task, we optimize the MAS model via the training objectives of all these tasks. By these means, the MAS model can be enhanced by capturing the summary-oriented visual features, thereby yielding more accurate summaries. Experiments on 44 languages, covering mid-high-, low-, and zero-resource scenarios, verify the effectiveness and superiority of the proposed approach, which achieves state-of-the-art performance under all scenarios. Additionally, we will contribute a large-scale multilingual multimodal abstractive summarization (MM-Sum) dataset to the research community.

Anthology ID:: 2023.acl-long.165
Volume:: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2934–2951
Language:
URL:: https://aclanthology.org/2023.acl-long.165
DOI:: 10.18653/v1/2023.acl-long.165
Bibkey:
Cite (ACL):: Yunlong Liang, Fandong Meng, Jinan Xu, Jiaan Wang, Yufeng Chen, and Jie Zhou. 2023. Summary-Oriented Vision Modeling for Multimodal Abstractive Summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2934–2951, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: Summary-Oriented Vision Modeling for Multimodal Abstractive Summarization (Liang et al., ACL 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.acl-long.165.pdf
Video:: https://aclanthology.org/2023.acl-long.165.mp4

PDF Cite Search Video