Runxiang Cheng
A Visual Attention Grounding Neural Model for Multimodal Machine Translation
Mingyang Zhou
Runxiang Cheng
Yong Jae Lee
Zhou Yu
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
We introduce a novel multimodal machine translation model that utilizes parallel visual and textual information. Our model jointly optimizes the learning of a shared visual-language embedding and a translator. The model leverages a visual attention grounding mechanism that links the visual semantics with the corresponding textual semantics. Our approach achieves competitive state-of-the-art results on the Multi30K and the Ambiguous COCO datasets. We also collected a new multilingual multimodal product description dataset to simulate a real-world international online shopping scenario. On this dataset, our visual attention grounding model outperforms other methods by a large margin.