Do not neglect related languages: The case of low-resource Occitan cross-lingual word embeddings

Lisa Woller, Viktor Hangya, Alexander Fraser


Abstract
Cross-lingual word embeddings (CLWEs) have proven indispensable for various natural language processing tasks, e.g., bilingual lexicon induction (BLI). However, the lack of data often impairs the quality of representations. Various approaches requiring only weak cross-lingual supervision were proposed, but current methods still fail to learn good CLWEs for languages with only a small monolingual corpus. We therefore claim that it is necessary to explore further datasets to improve CLWEs in low-resource setups. In this paper we propose to incorporate data of related high-resource languages. In contrast to previous approaches which leverage independently pre-trained embeddings of languages, we (i) train CLWEs for the low-resource and a related language jointly and (ii) map them to the target language to build the final multilingual space. In our experiments we focus on Occitan, a low-resource Romance language which is often neglected due to lack of resources. We leverage data from French, Spanish and Catalan for training and evaluate on the Occitan-English BLI task. By incorporating supporting languages our method outperforms previous approaches by a large margin. Furthermore, our analysis shows that the degree of relatedness between an incorporated language and the low-resource language is critically important.
Anthology ID:
2021.mrl-1.4
Volume:
Proceedings of the 1st Workshop on Multilingual Representation Learning
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Editors:
Duygu Ataman, Alexandra Birch, Alexis Conneau, Orhan Firat, Sebastian Ruder, Gozde Gul Sahin
Venue:
MRL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
41–50
Language:
URL:
https://aclanthology.org/2021.mrl-1.4
DOI:
10.18653/v1/2021.mrl-1.4
Bibkey:
Cite (ACL):
Lisa Woller, Viktor Hangya, and Alexander Fraser. 2021. Do not neglect related languages: The case of low-resource Occitan cross-lingual word embeddings. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 41–50, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Do not neglect related languages: The case of low-resource Occitan cross-lingual word embeddings (Woller et al., MRL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.mrl-1.4.pdf
Video:
 https://aclanthology.org/2021.mrl-1.4.mp4