Incorporating Probing Signals into Multimodal Machine Translation via Visual Question-Answering Pairs

Yuxin Zuo; Bei Li; Chuanhao Lv; Tong Zheng; Tong Xiao (肖桐); Jingbo Zhu

doi:10.18653/v1/2023.findings-emnlp.978

Incorporating Probing Signals into Multimodal Machine Translation via Visual Question-Answering Pairs

Yuxin Zuo, Bei Li, Chuanhao Lv, Tong Zheng, Tong Xiao, JingBo Zhu

Abstract

This paper presents an in-depth study of multimodal machine translation (MMT), examining the prevailing understanding that MMT systems exhibit decreased sensitivity to visual information when text inputs are complete. Instead, we attribute this phenomenon to insufficient cross-modal interaction, rather than image information redundancy. A novel approach is proposed to generate parallel Visual Question-Answering (VQA) style pairs from the source text, fostering more robust cross-modal interaction. Using Large Language Models (LLMs), we explicitly model the probing signal in MMT to convert it into VQA-style data to create the Multi30K-VQA dataset. An MMT-VQA multitask learning framework is introduced to incorporate explicit probing signals from the dataset into the MMT training process. Experimental results on two widely-used benchmarks demonstrate the effectiveness of this novel approach. Our code and data would be available at: https://github.com/libeineu/MMT-VQA.

Anthology ID:: 2023.findings-emnlp.978
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2023
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14689–14701
Language:
URL:: https://aclanthology.org/2023.findings-emnlp.978/
DOI:: 10.18653/v1/2023.findings-emnlp.978
Bibkey:
Cite (ACL):: Yuxin Zuo, Bei Li, Chuanhao Lv, Tong Zheng, Tong Xiao, and JingBo Zhu. 2023. Incorporating Probing Signals into Multimodal Machine Translation via Visual Question-Answering Pairs. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14689–14701, Singapore. Association for Computational Linguistics.
Cite (Informal):: Incorporating Probing Signals into Multimodal Machine Translation via Visual Question-Answering Pairs (Zuo et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-emnlp.978.pdf

PDF Cite Search Fix data