SERENGETI: Massively Multilingual Language Models for Africa

Ife Adebara; Abdelrahim Elmadany; Muhammad Abdul-Mageed; Alcides Alcoba Inciarte

doi:10.18653/v1/2023.findings-acl.97

SERENGETI: Massively Multilingual Language Models for Africa

Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed, Alcides Alcoba Inciarte

Abstract

Multilingual pretrained language models (mPLMs) acquire valuable, generalizable linguistic information during pretraining and have advanced the state of the art on task-specific finetuning. To date, only ~31 out of ~2,000 African languages are covered in existing language models. We ameliorate this limitation by developing SERENGETI, a set of massively multilingual language model that covers 517 African languages and language varieties. We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to 4 mPLMs that cover 4-23 African languages. SERENGETI outperforms other models on 11 datasets across the eights tasks, achieving 82.27 average F_1. We also perform analyses of errors from our models, which allows us to investigate the influence of language genealogy and linguistic similarity when the models are applied under zero-shot settings. We will publicly release our models for research. Anonymous link

Anthology ID:: 2023.findings-acl.97
Volume:: Findings of the Association for Computational Linguistics: ACL 2023
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1498–1537
Language:
URL:: https://aclanthology.org/2023.findings-acl.97/
DOI:: 10.18653/v1/2023.findings-acl.97
Bibkey:
Cite (ACL):: Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed, and Alcides Alcoba Inciarte. 2023. SERENGETI: Massively Multilingual Language Models for Africa. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1498–1537, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: SERENGETI: Massively Multilingual Language Models for Africa (Adebara et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-acl.97.pdf
Video:: https://aclanthology.org/2023.findings-acl.97.mp4

PDF Cite Search Video Fix data