Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora

Takashi Wada; Tomoharu Iwata; Yuji Matsumoto; Timothy Baldwin; Jey Han Lau

doi:10.18653/v1/2021.mrl-1.2

Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora

Takashi Wada, Tomoharu Iwata, Yuji Matsumoto, Timothy Baldwin, Jey Han Lau

Abstract

We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus (e.g. a few hundred sentence pairs). Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence. Through sharing model parameters among different languages, our model jointly trains the word embeddings in a common cross-lingual space. We also propose to combine word and subword embeddings to make use of orthographic similarities across different languages. We base our experiments on real-world data from endangered languages, namely Yongning Na, Shipibo-Konibo, and Griko. Our experiments on bilingual lexicon induction and word alignment tasks show that our model outperforms existing methods by a large margin for most language pairs. These results demonstrate that, contrary to common belief, an encoder-decoder translation model is beneficial for learning cross-lingual representations even in extremely low-resource conditions. Furthermore, our model also works well on high-resource conditions, achieving state-of-the-art performance on a German-English word-alignment task.

Anthology ID:: 2021.mrl-1.2
Volume:: Proceedings of the 1st Workshop on Multilingual Representation Learning
Month:: November
Year:: 2021
Address:: Punta Cana, Dominican Republic
Editors:: Duygu Ataman, Alexandra Birch, Alexis Conneau, Orhan Firat, Sebastian Ruder, Gozde Gul Sahin
Venue:: MRL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 16–31
Language:
URL:: https://aclanthology.org/2021.mrl-1.2
DOI:: 10.18653/v1/2021.mrl-1.2
Bibkey:
Cite (ACL):: Takashi Wada, Tomoharu Iwata, Yuji Matsumoto, Timothy Baldwin, and Jey Han Lau. 2021. Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 16–31, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):: Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora (Wada et al., MRL 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.mrl-1.2.pdf
Video:: https://aclanthology.org/2021.mrl-1.2.mp4
Code: twadada/multilingual-nlm

PDF Cite Search Code Video