Learning bilingual word embeddings with (almost) no bilingual data

Mikel Artetxe, Gorka Labaka, Eneko Agirre


Abstract
Most methods to learn bilingual word embeddings rely on large parallel corpora, which is difficult to obtain for most language pairs. This has motivated an active research line to relax this requirement, with methods that use document-aligned corpora or bilingual dictionaries of a few thousand words instead. In this work, we further reduce the need of bilingual resources using a very simple self-learning approach that can be combined with any dictionary-based mapping technique. Our method exploits the structural similarity of embedding spaces, and works with as little bilingual evidence as a 25 word dictionary or even an automatically generated list of numerals, obtaining results comparable to those of systems that use richer resources.
Anthology ID:
P17-1042
Volume:
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2017
Address:
Vancouver, Canada
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
451–462
Language:
URL:
https://aclanthology.org/P17-1042
DOI:
10.18653/v1/P17-1042
Bibkey:
Cite (ACL):
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 451–462, Vancouver, Canada. Association for Computational Linguistics.
Cite (Informal):
Learning bilingual word embeddings with (almost) no bilingual data (Artetxe et al., ACL 2017)
Copy Citation:
PDF:
https://aclanthology.org/P17-1042.pdf
Presentation:
 P17-1042.Presentation.pdf
Video:
 https://vimeo.com/234954663