Subword Evenness (SuE) as a Predictor of Cross-lingual Transfer to Low-resource Languages

Olga Pelloni, Anastassia Shaitarova, Tanja Samardzic


Abstract
Pre-trained multilingual models, such as mBERT, XLM-R and mT5, are used to improve the performance on various tasks in low-resource languages via cross-lingual transfer. In this framework, English is usually seen as the most natural choice for a transfer language (for fine-tuning or continued training of a multilingual pre-trained model), but it has been revealed recently that this is often not the best choice. The success of cross-lingual transfer seems to depend on some properties of languages, which are currently hard to explain. Successful transfer often happens between unrelated languages and it often cannot be explained by data-dependent factors.In this study, we show that languages written in non-Latin and non-alphabetic scripts (mostly Asian languages) are the best choices for improving performance on the task of Masked Language Modelling (MLM) in a diverse set of 30 low-resource languages and that the success of the transfer is well predicted by our novel measure of Subword Evenness (SuE). Transferring language models over the languages that score low on our measure results in the lowest average perplexity over target low-resource languages. Our correlation coefficients obtained with three different pre-trained multilingual models are consistently higher than all the other predictors, including text-based measures (type-token ratio, entropy) and linguistically motivated choice (genealogical and typological proximity).
Anthology ID:
2022.emnlp-main.503
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7428–7445
Language:
URL:
https://aclanthology.org/2022.emnlp-main.503
DOI:
10.18653/v1/2022.emnlp-main.503
Bibkey:
Cite (ACL):
Olga Pelloni, Anastassia Shaitarova, and Tanja Samardzic. 2022. Subword Evenness (SuE) as a Predictor of Cross-lingual Transfer to Low-resource Languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7428–7445, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Subword Evenness (SuE) as a Predictor of Cross-lingual Transfer to Low-resource Languages (Pelloni et al., EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-main.503.pdf