%0 Conference Proceedings
%T Supervised Visual Attention for Multimodal Neural Machine Translation
%A Nishihara, Tetsuro
%A Tamura, Akihiro
%A Ninomiya, Takashi
%A Omote, Yutaro
%A Nakayama, Hideki
%Y Scott, Donia
%Y Bel, Nuria
%Y Zong, Chengqing
%S Proceedings of the 28th International Conference on Computational Linguistics
%D 2020
%8 December
%I International Committee on Computational Linguistics
%C Barcelona, Spain (Online)
%F nishihara-etal-2020-supervised
%X This paper proposed a supervised visual attention mechanism for multimodal neural machine translation (MNMT), trained with constraints based on manual alignments between words in a sentence and their corresponding regions of an image. The proposed visual attention mechanism captures the relationship between a word and an image region more precisely than a conventional visual attention mechanism trained through MNMT in an unsupervised manner. Our experiments on English-German and German-English translation tasks using the Multi30k dataset and on English-Japanese and Japanese-English translation tasks using the Flickr30k Entities JP dataset show that a Transformer-based MNMT model can be improved by incorporating our proposed supervised visual attention mechanism and that further improvements can be achieved by combining it with a supervised cross-lingual attention mechanism (up to +1.61 BLEU, +1.7 METEOR).
%R 10.18653/v1/2020.coling-main.380
%U https://aclanthology.org/2020.coling-main.380
%U https://doi.org/10.18653/v1/2020.coling-main.380
%P 4304-4314