%0 Conference Proceedings %T Supervised Visual Attention for Multimodal Neural Machine Translation %A Nishihara, Tetsuro %A Tamura, Akihiro %A Ninomiya, Takashi %A Omote, Yutaro %A Nakayama, Hideki %Y Scott, Donia %Y Bel, Nuria %Y Zong, Chengqing %S Proceedings of the 28th International Conference on Computational Linguistics %D 2020 %8 December %I International Committee on Computational Linguistics %C Barcelona, Spain (Online) %F nishihara-etal-2020-supervised %X This paper proposed a supervised visual attention mechanism for multimodal neural machine translation (MNMT), trained with constraints based on manual alignments between words in a sentence and their corresponding regions of an image. The proposed visual attention mechanism captures the relationship between a word and an image region more precisely than a conventional visual attention mechanism trained through MNMT in an unsupervised manner. Our experiments on English-German and German-English translation tasks using the Multi30k dataset and on English-Japanese and Japanese-English translation tasks using the Flickr30k Entities JP dataset show that a Transformer-based MNMT model can be improved by incorporating our proposed supervised visual attention mechanism and that further improvements can be achieved by combining it with a supervised cross-lingual attention mechanism (up to +1.61 BLEU, +1.7 METEOR). %R 10.18653/v1/2020.coling-main.380 %U https://aclanthology.org/2020.coling-main.380 %U https://doi.org/10.18653/v1/2020.coling-main.380 %P 4304-4314