Mini But Mighty: Efficient Multilingual Pretraining with Linguistically-Informed Data Selection

Tolulope Ogunremi, Dan Jurafsky, Christopher Manning


Abstract
With the prominence of large pretrained language models, low-resource languages are rarely modelled monolingually and become victims of the “curse of multilinguality” in massively multilingual models. Recently, AfriBERTa showed that training transformer models from scratch on 1GB of data from many unrelated African languages outperforms massively multilingual models on downstream NLP tasks. Here we extend this direction, focusing on the use of related languages. We propose that training on smaller amounts of data but from related languages could match the performance of models trained on large, unrelated data. We test our hypothesis on the Niger-Congo family and its Bantu and Volta-Niger sub-families, pretraining models with data solely from Niger-Congo languages and finetuning on 4 downstream tasks: NER, part-of-speech tagging, sentiment analysis and text classification. We find that models trained on genetically related languages achieve equal performance on downstream tasks in low-resource languages despite using less training data. We recommend selecting training data based on language-relatedness when pretraining language models for low-resource languages.
Anthology ID:
2023.findings-eacl.93
Volume:
Findings of the Association for Computational Linguistics: EACL 2023
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Andreas Vlachos, Isabelle Augenstein
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1251–1266
Language:
URL:
https://aclanthology.org/2023.findings-eacl.93
DOI:
10.18653/v1/2023.findings-eacl.93
Bibkey:
Cite (ACL):
Tolulope Ogunremi, Dan Jurafsky, and Christopher Manning. 2023. Mini But Mighty: Efficient Multilingual Pretraining with Linguistically-Informed Data Selection. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1251–1266, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
Mini But Mighty: Efficient Multilingual Pretraining with Linguistically-Informed Data Selection (Ogunremi et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-eacl.93.pdf
Video:
 https://aclanthology.org/2023.findings-eacl.93.mp4