Nicholas Kashani Motlagh
2024
Assessing the Role of Imagery in Multimodal Machine Translation
Nicholas Kashani Motlagh
|
Jim Davis
|
Jeremy Gwinnup
|
Grant Erdmann
|
Tim Anderson
Proceedings of the Ninth Conference on Machine Translation
In Multimodal Machine Translation (MMT), the use of visual data has shown only marginal improvements compared to text-only models. Previously, the CoMMuTE dataset and associated metric were proposed to score models on tasks where the imagery is necessary to disambiguate between two possible translations for each ambiguous source sentence. In this work, we introduce new metrics within the CoMMuTE domain to provide deeper insights into image-aware translation models. Our proposed metrics differ from the previous CoMMuTE scoring method by 1) assessing the impact of multiple images on individual translations and 2) evaluating a model’s ability to jointly select each translation for each image context. Our results challenge the conventional views of poor visual comprehension capabilities of MMT models and show that models can indeed meaningfully interpret visual information, though they may not leverage it sufficiently in the final decision.