Assessing the Role of Imagery in Multimodal Machine Translation

Nicholas Kashani Motlagh, Jim Davis, Jeremy Gwinnup, Grant Erdmann, Tim Anderson


Abstract
In Multimodal Machine Translation (MMT), the use of visual data has shown only marginal improvements compared to text-only models. Previously, the CoMMuTE dataset and associated metric were proposed to score models on tasks where the imagery is necessary to disambiguate between two possible translations for each ambiguous source sentence. In this work, we introduce new metrics within the CoMMuTE domain to provide deeper insights into image-aware translation models. Our proposed metrics differ from the previous CoMMuTE scoring method by 1) assessing the impact of multiple images on individual translations and 2) evaluating a model’s ability to jointly select each translation for each image context. Our results challenge the conventional views of poor visual comprehension capabilities of MMT models and show that models can indeed meaningfully interpret visual information, though they may not leverage it sufficiently in the final decision.
Anthology ID:
2024.wmt-1.130
Volume:
Proceedings of the Ninth Conference on Machine Translation
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:
WMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1428–1439
Language:
URL:
https://aclanthology.org/2024.wmt-1.130
DOI:
Bibkey:
Cite (ACL):
Nicholas Kashani Motlagh, Jim Davis, Jeremy Gwinnup, Grant Erdmann, and Tim Anderson. 2024. Assessing the Role of Imagery in Multimodal Machine Translation. In Proceedings of the Ninth Conference on Machine Translation, pages 1428–1439, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Assessing the Role of Imagery in Multimodal Machine Translation (Kashani Motlagh et al., WMT 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.wmt-1.130.pdf