Learning Joint Multilingual Sentence Representations with Neural Machine Translation

Holger Schwenk, Matthijs Douze


Abstract
In this paper, we use the framework of neural machine translation to learn joint sentence representations across six very different languages. Our aim is that a representation which is independent of the language, is likely to capture the underlying semantics. We define a new cross-lingual similarity measure, compare up to 1.4M sentence representations and study the characteristics of close sentences. We provide experimental evidence that sentences that are close in embedding space are indeed semantically highly related, but often have quite different structure and syntax. These relations also hold when comparing sentences in different languages.
Anthology ID:
W17-2619
Volume:
Proceedings of the 2nd Workshop on Representation Learning for NLP
Month:
August
Year:
2017
Address:
Vancouver, Canada
Editors:
Phil Blunsom, Antoine Bordes, Kyunghyun Cho, Shay Cohen, Chris Dyer, Edward Grefenstette, Karl Moritz Hermann, Laura Rimell, Jason Weston, Scott Yih
Venue:
RepL4NLP
SIG:
SIGREP
Publisher:
Association for Computational Linguistics
Note:
Pages:
157–167
Language:
URL:
https://aclanthology.org/W17-2619
DOI:
10.18653/v1/W17-2619
Bibkey:
Cite (ACL):
Holger Schwenk and Matthijs Douze. 2017. Learning Joint Multilingual Sentence Representations with Neural Machine Translation. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 157–167, Vancouver, Canada. Association for Computational Linguistics.
Cite (Informal):
Learning Joint Multilingual Sentence Representations with Neural Machine Translation (Schwenk & Douze, RepL4NLP 2017)
Copy Citation:
PDF:
https://aclanthology.org/W17-2619.pdf