Image Pivoting for Learning Multilingual Multimodal Representations

Spandana Gella, Rico Sennrich, Frank Keller, Mirella Lapata


Abstract
In this paper we propose a model to learn multimodal multilingual representations for matching images and sentences in different languages, with the aim of advancing multilingual versions of image search and image understanding. Our model learns a common representation for images and their descriptions in two different languages (which need not be parallel) by considering the image as a pivot between two languages. We introduce a new pairwise ranking loss function which can handle both symmetric and asymmetric similarity between the two modalities. We evaluate our models on image-description ranking for German and English, and on semantic textual similarity of image descriptions in English. In both cases we achieve state-of-the-art performance.
Anthology ID:
D17-1303
Volume:
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Month:
September
Year:
2017
Address:
Copenhagen, Denmark
Editors:
Martha Palmer, Rebecca Hwa, Sebastian Riedel
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
2839–2845
Language:
URL:
https://aclanthology.org/D17-1303/
DOI:
10.18653/v1/D17-1303
Bibkey:
Cite (ACL):
Spandana Gella, Rico Sennrich, Frank Keller, and Mirella Lapata. 2017. Image Pivoting for Learning Multilingual Multimodal Representations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2839–2845, Copenhagen, Denmark. Association for Computational Linguistics.
Cite (Informal):
Image Pivoting for Learning Multilingual Multimodal Representations (Gella et al., EMNLP 2017)
Copy Citation:
PDF:
https://aclanthology.org/D17-1303.pdf
Video:
 https://aclanthology.org/D17-1303.mp4
Data
Flickr30k