Learning Unsupervised Multilingual Word Embeddings with Incremental Multilingual Hubs

Geert Heyman, Bregt Verreet, Ivan Vulić, Marie-Francine Moens


Abstract
Recent research has discovered that a shared bilingual word embedding space can be induced by projecting monolingual word embedding spaces from two languages using a self-learning paradigm without any bilingual supervision. However, it has also been shown that for distant language pairs such fully unsupervised self-learning methods are unstable and often get stuck in poor local optima due to reduced isomorphism between starting monolingual spaces. In this work, we propose a new robust framework for learning unsupervised multilingual word embeddings that mitigates the instability issues. We learn a shared multilingual embedding space for a variable number of languages by incrementally adding new languages one by one to the current multilingual space. Through the gradual language addition the method can leverage the interdependencies between the new language and all other languages in the current multilingual space. We find that it is beneficial to project more distant languages later in the iterative process. Our fully unsupervised multilingual embedding spaces yield results that are on par with the state-of-the-art methods in the bilingual lexicon induction (BLI) task, and simultaneously obtain state-of-the-art scores on two downstream tasks: multilingual document classification and multilingual dependency parsing, outperforming even supervised baselines. This finding also accentuates the need to establish evaluation protocols for cross-lingual word embeddings beyond the omnipresent intrinsic BLI task in future work.
Anthology ID:
N19-1188
Volume:
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
Month:
June
Year:
2019
Address:
Minneapolis, Minnesota
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1890–1902
Language:
URL:
https://aclanthology.org/N19-1188
DOI:
10.18653/v1/N19-1188
Bibkey:
Cite (ACL):
Geert Heyman, Bregt Verreet, Ivan Vulić, and Marie-Francine Moens. 2019. Learning Unsupervised Multilingual Word Embeddings with Incremental Multilingual Hubs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1890–1902, Minneapolis, Minnesota. Association for Computational Linguistics.
Cite (Informal):
Learning Unsupervised Multilingual Word Embeddings with Incremental Multilingual Hubs (Heyman et al., NAACL 2019)
Copy Citation:
PDF:
https://aclanthology.org/N19-1188.pdf