Bilingual Lexicon Induction through Unsupervised Machine Translation

Mikel Artetxe, Gorka Labaka, Eneko Agirre


Abstract
A recent research line has obtained strong results on bilingual lexicon induction by aligning independently trained word embeddings in two languages and using the resulting cross-lingual embeddings to induce word translation pairs through nearest neighbor or related retrieval methods. In this paper, we propose an alternative approach to this problem that builds on the recent work on unsupervised machine translation. This way, instead of directly inducing a bilingual lexicon from cross-lingual embeddings, we use them to build a phrase-table, combine it with a language model, and use the resulting machine translation system to generate a synthetic parallel corpus, from which we extract the bilingual lexicon using statistical word alignment techniques. As such, our method can work with any word embedding and cross-lingual mapping technique, and it does not require any additional resource besides the monolingual corpus used to train the embeddings. When evaluated on the exact same cross-lingual embeddings, our proposed method obtains an average improvement of 6 accuracy points over nearest neighbor and 4 points over CSLS retrieval, establishing a new state-of-the-art in the standard MUSE dataset.
Anthology ID:
P19-1494
Volume:
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2019
Address:
Florence, Italy
Editors:
Anna Korhonen, David Traum, Lluís Màrquez
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5002–5007
Language:
URL:
https://aclanthology.org/P19-1494
DOI:
10.18653/v1/P19-1494
Bibkey:
Cite (ACL):
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2019. Bilingual Lexicon Induction through Unsupervised Machine Translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5002–5007, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Bilingual Lexicon Induction through Unsupervised Machine Translation (Artetxe et al., ACL 2019)
Copy Citation:
PDF:
https://aclanthology.org/P19-1494.pdf
Video:
 https://aclanthology.org/P19-1494.mp4
Code
 artetxem/monoses