Exploiting Pseudo Image Captions for Multimodal Summarization

Chaoya Jiang, Rui Xie, Wei Ye, Jinan Sun, Shikun Zhang


Abstract
Multimodal summarization with multimodal output (MSMO) faces a challenging semantic gap between visual and textual modalities due to the lack of reference images for training. Our pilot investigation indicates that image captions, which naturally connect texts and images, can significantly benefit MSMO. However, exposure of image captions during training is inconsistent with MSMO’s task settings, where prior cross-modal alignment information is excluded to guarantee the generalization of cross-modal semantic modeling. To this end, we propose a novel coarse-to-fine image-text alignment mechanism to identify the most relevant sentence of each image in a document, resembling the role of image captions in capturing visual knowledge and bridging the cross-modal semantic gap. Equipped with this alignment mechanism, our method easily yet impressively sets up state-of-the-art performances on all intermodality and intramodality metrics (e.g., more than 10% relative improvement on image recommendation precision). Further experiments reveal the correlation between image captions and text summaries, and prove that the pseudo image captions we generated are even better than the original ones in terms of promoting multimodal summarization.
Anthology ID:
2023.findings-acl.12
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
161–175
Language:
URL:
https://aclanthology.org/2023.findings-acl.12
DOI:
10.18653/v1/2023.findings-acl.12
Bibkey:
Cite (ACL):
Chaoya Jiang, Rui Xie, Wei Ye, Jinan Sun, and Shikun Zhang. 2023. Exploiting Pseudo Image Captions for Multimodal Summarization. In Findings of the Association for Computational Linguistics: ACL 2023, pages 161–175, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Exploiting Pseudo Image Captions for Multimodal Summarization (Jiang et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.12.pdf