Visual Enhanced Entity-Level Interaction Network for Multimodal Summarization

Haolong Yan, Binghao Tang, Boda Lin, Gang Zhao, Si Li


Abstract
MultiModal Summarization (MMS) aims to generate a concise summary based on multimodal data like texts and images and has wide application in multimodal fields.Previous works mainly focus on the coarse-level textual and visual features in which the overall features of the image interact with the whole sentence.However, the entities of the input text and the objects of the image may be underutilized, limiting the performance of current MMS models.In this paper, we propose a novel Visual Enhanced Entity-Level Interaction Network (VE-ELIN) to address the problem of underutilization of multimodal inputs at a fine-grained level in two ways.We first design a cross-modal entity interaction module to better fuse the entity information in text and the object information in vision.Then, we design an object-guided visual enhancement module to fully extract the visual features and enhance the focus of the image on the object area.We evaluate VE-ELIN on two MMS datasets and propose new metrics to measure the factual consistency of entities in the output.Finally, experimental results demonstrate that VE-ELIN is effective and outperforms previous methods under both traditional metrics and ours.The source code is available at https://github.com/summoneryhl/VE-ELIN.
Anthology ID:
2024.findings-naacl.206
Volume:
Findings of the Association for Computational Linguistics: NAACL 2024
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3248–3260
Language:
URL:
https://aclanthology.org/2024.findings-naacl.206
DOI:
Bibkey:
Cite (ACL):
Haolong Yan, Binghao Tang, Boda Lin, Gang Zhao, and Si Li. 2024. Visual Enhanced Entity-Level Interaction Network for Multimodal Summarization. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 3248–3260, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Visual Enhanced Entity-Level Interaction Network for Multimodal Summarization (Yan et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-naacl.206.pdf
Copyright:
 2024.findings-naacl.206.copyright.pdf