SwahBERT: Language Model of Swahili

Gati Martin, Medard Edmund Mswahili, Young-Seob Jeong, Jiyoung Woo


Abstract
The rapid development of social networks, electronic commerce, mobile Internet, and other technologies, has influenced the growth of Web data. Social media and Internet forums are valuable sources of citizens’ opinions, which can be analyzed for community development and user behavior analysis. Unfortunately, the scarcity of resources (i.e., datasets or language models) become a barrier to the development of natural language processing applications in low-resource languages. Thanks to the recent growth of online forums and news platforms of Swahili, we introduce two datasets of Swahili in this paper: a pre-training dataset of approximately 105MB with 16M words and annotated dataset of 13K instances for the emotion classification task. The emotion classification dataset is manually annotated by two native Swahili speakers. We pre-trained a new monolingual language model for Swahili, namely SwahBERT, using our collected pre-training data, and tested it with four downstream tasks including emotion classification. We found that SwahBERT outperforms multilingual BERT, a well-known existing language model, in almost all downstream tasks.
Anthology ID:
2022.naacl-main.23
Volume:
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
July
Year:
2022
Address:
Seattle, United States
Editors:
Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
303–313
Language:
URL:
https://aclanthology.org/2022.naacl-main.23
DOI:
10.18653/v1/2022.naacl-main.23
Bibkey:
Cite (ACL):
Gati Martin, Medard Edmund Mswahili, Young-Seob Jeong, and Jiyoung Woo. 2022. SwahBERT: Language Model of Swahili. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 303–313, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
SwahBERT: Language Model of Swahili (Martin et al., NAACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.naacl-main.23.pdf
Video:
 https://aclanthology.org/2022.naacl-main.23.mp4
Data
DailyDialogISEARMasakhaNER