VQA-Augmented Machine Translation with Cross-Modal Contrastive Learning

Zhihui Zhang; Shiliang Sun; Jing Zhao; Tengfei Song; Hao Yang (杨浩)

VQA-Augmented Machine Translation with Cross-Modal Contrastive Learning

Zhihui Zhang, Shiliang Sun, Jing Zhao, Tengfei Song, Hao Yang

Abstract

Multimodal machine translation (MMT) aims to enhance translation quality by integrating visual information. However, existing methods often extract visual features using pre-trained models while learning text features from scratch, leading to representation imbalance. These methods are also prone to being misled by redundant visual information, which results in suboptimal performance. To address these challenges, we propose CAMT, a novel cross-modal VQA-augmented MMT method. CAMT aligns image-source text pairs and image-question text pairs through dual-text contrastive learning, thereby improving semantic consistency across modalities. Additionally, we design an effective strategy for generating question–answer pairs to enhance fine-grained alignment and filter out irrelevant visual noise, while also addressing the scarcity of VQA annotations. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of the proposed CAMT framework, which consistently outperforms state-of-the-art MMT methods across multiple evaluation metrics.

Anthology ID:: 2025.findings-emnlp.536
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10113–10124
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.536/
DOI:
Bibkey:
Cite (ACL):: Zhihui Zhang, Shiliang Sun, Jing Zhao, Tengfei Song, and Hao Yang. 2025. VQA-Augmented Machine Translation with Cross-Modal Contrastive Learning. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 10113–10124, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: VQA-Augmented Machine Translation with Cross-Modal Contrastive Learning (Zhang et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.536.pdf
Checklist:: 2025.findings-emnlp.536.checklist.pdf

PDF Cite Search Checklist Fix data