Generating Faithful and Salient Text from Multimodal Data

Tahsina Hashem; Weiqing Wang; Derry Tanti Wijaya; Mohammed Eunus Ali; Yuan-Fang Li

doi:10.18653/v1/2024.inlg-main.50

Generating Faithful and Salient Text from Multimodal Data

Tahsina Hashem, Weiqing Wang, Derry Tanti Wijaya, Mohammed Eunus Ali, Yuan-Fang Li

Abstract

While large multimodal models (LMMs) have obtained strong performance on many multimodal tasks, they may still hallucinate while generating text. Their performance on detecting salient features from visual data is also unclear. In this paper, we develop a framework to generate faithful and salient text from mixed-modal data, which includes images and structured data ( represented in knowledge graphs or tables). Specifically, we train a vision critic model to identify hallucinated and non-salient features from the image modality. The critic model also generates a list of salient image features. This information is used in the post editing step to improve the generation quality. Experiments on two datasets show that our framework improves LMMs’ generation quality on both faithfulness and saliency, outperforming recent techniques aimed at reducing hallucination. The dataset and code are available at https://github.com/TahsinaHashem/FaithD2T.

Anthology ID:: 2024.inlg-main.50
Volume:: Proceedings of the 17th International Natural Language Generation Conference
Month:: September
Year:: 2024
Address:: Tokyo, Japan
Editors:: Saad Mahamood, Nguyen Le Minh, Daphne Ippolito
Venue:: INLG
SIG:: SIGGEN
Publisher:: Association for Computational Linguistics
Note:
Pages:: 646–662
Language:
URL:: https://aclanthology.org/2024.inlg-main.50/
DOI:: 10.18653/v1/2024.inlg-main.50
Bibkey:
Cite (ACL):: Tahsina Hashem, Weiqing Wang, Derry Tanti Wijaya, Mohammed Eunus Ali, and Yuan-Fang Li. 2024. Generating Faithful and Salient Text from Multimodal Data. In Proceedings of the 17th International Natural Language Generation Conference, pages 646–662, Tokyo, Japan. Association for Computational Linguistics.
Cite (Informal):: Generating Faithful and Salient Text from Multimodal Data (Hashem et al., INLG 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.inlg-main.50.pdf
Supplementary attachment:: 2024.inlg-main.50.Supplementary_Attachment.pdf

PDF Cite Search Supplementary attachment Fix data