MasakhaNEWS: News Topic Classification for African languages

African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.


ABSTRACT
African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks.While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g.named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages.In this paper, we develop MasakhaNEWS -a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa.We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models.Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameterefficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer finetuning (SetFit and Cohere Embedding API).Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in lowresource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X.In few-shot setting, we

INTRODUCTION
News topic classification is a text classification task in NLP that involves categorizing news articles into different categories like sports, business, entertainment or politics.It has shaped the development of several machine learning algorithms over the years such as topic modeling (Blei et al., 2001;Dieng et al., 2020) and deep learning models (Zhang et al., 2015;Joulin et al., 2017).Similarly, news topic classification is a popular downstream task for evaluating the performance of large language models (LLMs) in both fine-tuning, and prompt-tuning setups (Yang et al., 2019;Sun et al., 2019;Brown et al., 2020;Liu et al., 2023).
In the recent "prompting" paradigm, it has been shown that with as little as 5 or 10 labelled examples, one can obtain an impressive predictive performance for text classification by leveraging LLMs (Schick & Schütze, 2021a;Sanh et al., 2022;Scao et al., 2022).However, most of the evaluation have only been performed in English language and a few other high-resource languages.It is unclear how this approach extends to pre-trained multilingual language models for low-resource languages.For instance, BLOOM (Scao et al., 2022) was pre-trained on 46 languages, including 22 African languages (mostly from the Niger-Congo family).However, extensive evaluation on these set of African languages was not performed due to lack of evaluation datasets.In general, only a handful of NLP tasks such as machine translation (Adelani et al., 2022a;NLLB-Team et al., 2022), named entity recognition (Adelani et al., 2021;2022b), and sentiment classification (Muhammad et al., 2023) have standardized benchmark datasets covering several geographical and typologicallydiverse African languages.Another popular task that can be used for evaluating the downstream performance of language models is news topic classification, but human-annotated datasets for benchmarking topic classification using language models for African languages are scarce.
In this paper, we address two problems of lack of evaluation datasets, and lack of extensive evaluation of LLMs for African languages.We create MasakhaNEWS -a large-scale news topic classification dataset covering 16 typologically-diverse languages widely spoken in Africa including English and French, with the same label categories across all languages.We provide several baseline models using both classical machine learning approaches and fine-tuning LLMs.Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning (e.g.5-examples per label) such as cross-lingual parameterefficient fine-tuning (like MAD-X (Pfeiffer et al., 2020)), pattern exploiting training (PET) (Schick & Schütze, 2021a), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit (Tunstall et al., 2022a) and Cohere Embedding API).
Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X.In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90% (i.e.86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.We hope that MasakhaNEWS encourages the NLP community to benchmark and evaluate LLMs on more low-resource languages.For reproducibility, the data and code are available on Github1 .

RELATED WORK
Topic classification, an application of text classification, is a popular task in natural language processing.For this task, several datasets for various languages (Zhang et al., 2015), including African languages, have been created using either manual or automatic annotation techniques.However, these efforts are currently limited to a small number of African languages.For example, Hedderich et al. ( 2020) created a dataset that was manually annotated for Hausa and Yoruba languages, sourced from VOA Hausa and the BBC Yoruba, with 7 and 5 categories respectively.Niyongabo et al. (2020) also developed a moderately large news topic classification dataset for Kinyarwanda and Kirundi, using human annotators to reclassify news from various Rwandan news websites into 14 categories for Kinyarwanda and 12 categories for Kirundi, from the initial 48 and 26 categories.Similarly, Azime & Mohammed (2021) curated a 6-category topic classification dataset for Amharic by gathering topics and their predefined labels from several websites, then manually reviewing and removing any inconsistencies.Another news topic classification dataset is the ANTC dataset (Alabi et al., 2022), an automatically created dataset collected from various sources such as VOA, BBC, Global Voices, and Isolezwe newspapers.It contains five African languages: Lingala, Somali, Naija, Malagasy, and isiZulu and uses the predefined labels from the different websites.
To the best of our knowledge, these are the few publicly available topic classification datasets for African languages, covering approximately 11 languages.These datasets, however, have limitations due to the fact that they were created with little or no human supervision and using different labeling schemes.In contrast, in this work, we present news topic classification data for 16 typologically diverse African languages, with a consistent labeling scheme applied across all languages.
Prompting Language Models using manually designed prompts to guide text generation have recently been applied to a myriad of NLP tasks including topic classification.Models such as GPT3 (Brown et al., 2020) and T5 (Raffel et al., 2020) are able to learn more structural and semantic relationships between words and have shown impressive results even in multilingual scenarios when tuned for different tasks.One approach to prompt-tuning a language model for topic classification is to design a "template" for classification and insert a sequence of text into template.This is then used to condition the language model to generate the corresponding class for that span of text.Using this approach Le Scao & Rush (2021) show that effectiveness of prompting is heavily dependent on the quality of the designed prompts and that a prompt is potentially worth 100 data points.This means that prompting might represent a new approach to learning in low-resource settings, this is commonly known as few-shot learning.
There are some other exciting approaches to few-shot learning without prompting.One of them is SetFit (Tunstall et al., 2022a), which takes advantage of sentence transformers to generate dense representations for input sequences.These representations are then passed through a classifier to predict class labels.The sentence transformers are trained on a few examples using contrastive learning where positive and negative training pairs are sampled by in-class and out-class sampling.Another common approach is Pattern-Exploiting Training also known as PET (Schick & Schütze, 2021a).PET is a semi-supervised training approach that used restructured input sequences to condition language models to better understand a given task, while iPET (Schick & Schütze, 2021b) is an iterative variant of PET that is also shown to perform well in few-shot scenarios.In this work, we benchmark the performance of all these approaches for topic classification in African languages.

LANGUAGES
Table 1 presents the languages covered in MasakhaNEWS along with information on their language families, their primary geographic regions in Africa, and the number of speakers.Our dataset consists of a total of 16 typologically-diverse languages, and they were selected based on the availability of publicly available news corpora in each language, the availability of native-speaking annotators, geographical diversity and most importantly, because they are widely spoken in Africa.English and French are official languages in 42 African countries, Swahili is native to 12 countries, and Hausa is native to 6 countries.In terms of geographical diversity, we have four languages spoken in West Africa, seven languages spoken in East Africa, two languages spoken in Central Africa (i.e.Lingala and Kiswahili), and two spoken in Southern Africa (i.e chiShona and isiXhosa).Also, we cover four language families, Niger-Congo (8) Afro-Asiatic (5), Indo-European (2), and English Creole (1).The only English creole language is Nigerian-Pidgin, also known as Naija.Each language is spoken by at least 10 million people, according to Ehnologue (Eberhard et al., 2021).

DATA SOURCE
The data used in this research study were sourced from multiple reputable news outlets.The collection process involved crawling the British Broadcasting Corporation (BBC) and Voice of America (VOA) websites.We crawled between 2k-12k articles depending on the number of articles available on the websites.Some of the websites already have some pre-defined categories, we make use of this to additionally filter articles that do not belong to categories we plan to annotate.We took inspiration of news categorization from BBC English with six (6) pre-defined and well-defined categories ("business", "entertainment", "health", "politics", "sports", and "technology") with over 500 articles in each category.For English, we only crawled articles belonging to these categories while for the other languages, we crawled all articles.Our target is to have around 3,000 articles for annotation but three languages (Lingala, Rundi, and Somali) have less than that.Table 2 shows the news source per language and the number of articles crawled.

DATA ANNOTATION
We recruited volunteers from the Masakhane community -an African grassroots community focused on advancing NLP for African languages.The annotators were asked to label 3k articles into eight categories: "business", "entertainment", "health", "politics", "religion", "sports", "technology", and "uncategorized".Six of the categories are based on BBC English major news categories, the "religion" label was added since many African news websites frequently cover this topic.Other articles that do not belong to the first seven categories, are assigned to the "uncategorized" label.
For each language, the annotation followed two stages.In the first stage, we randomly shuffled the entire dataset and ask annotators to label the first 200 articles manually.In the second stage, we make use of active learning by combining the first 200 annotated articles with articles with pre-defined labels from news websites when available, and trained a classifier (i.e. by fine-tuning AfroXLMR-base LLM (Alabi et al., 2022)).We ran predictions on the rest of the articles, and ask annotators to correct the mistakes of the classifier.This approach helped to speed up the annotation process.
Annotation tool We make use of an in-house annotation tool built for text classification to label the articles.Appendix A shows an example of the interface of the tool.To further simplify the annotator effort, we ask annotators to label articles based on the headlines instead of the entire article.However, since some headlines are not very descriptive, we decided to concatenate the headline and the first two sentences of the news text to provide an additional context to annotators.
Inter-agreement score We report Fleiss Kappa score (Fleiss et al., 1971) to measure the agreement of annotation.(i.e.0.55 -0.85), which shows a high agreement among the annotators recruited for each language.
Languages with only one annotator (i.e.Luganda and Rundi) were excluded in the evaluation.
Deciding a single label per article After annotation, we assign the final label to each article by majority voting.Each label of an article needs to be agreed by a minimum of two annotators to be assigned the label.We only had exceptions for Luganda and Rundi, since they had one annotator.
Our final dataset for each language consist of a minimum of 72 articles per topic, and a maximum of 500, except for English language where the classes are roughly balanced.We excluded the infrequent labels so we do not have a highly unbalanced dataset.The choice of a minimum of 72 articles ensures a minimum of 50 articles in the training set.Our target is to have at least four topics per language with a minimum of 72 articles.This approach worked smoothly except for two languages: Lingala ("politics", "health" and "sports") and chiShona ("business", "health" and "politics"), where we had only three topics with more than 72 articles.To ensure we have more articles per class, we had to resolve the conflict in annotation between Lingala annotators to ensure we have more labels for the "business" category.This approach still results in infrequent classes for chiShona.We had to crawl additional "sports" articles from a local chiShona website (Kwayedza), followed by manual filtering of unrelated sports news.
Data Split Table 2 provides the data split for MasakhaNEWS languages.We also provide the distribution of articles by topics.We divided the annotated data into TRAIN, DEV and TEST split following 70% / 10% / 20% split ratio.

BASELINE EXPERIMENTS
We trained baseline text classification models by concatenating the news headline and news text using different approaches.

BASELINE MODELS
We trained three classical ML models: Naive Bayes, multi-layer perceptron, and XGBoost using the popular sklearn tool2 .We employed the "CountVectorizer" method to represent the text data, which converts a collection of text documents to a matrix of token counts.This method allows us to convert text data into numerical feature vectors.
Furthermore, we fine-tune nine kinds of multilingual text encoders, seven of them are BERT/RoBERTa-based i.e.XLM-R(base & large) (Conneau et al., 2020)  Table 4: Baseline results on MasakhaNEWS.We compare several ML approaches using both classical ML and LLMs.Average is over 5 runs.Evaluation is based on weighted F1-score.Africacentric models are in gray color AfroLM (Dossou et al., 2022), the other two are mDeBERTaV3 (He et al., 2021a), and LaBSE (Feng et al., 2022).mDeBERTaV3 pre-trained a DeBERTa-style model (He et al., 2021b) with replaced token detection objective proposed in ELECTRA (Clark et al., 2020).On the other hand, LaBSE is a multilingual sentence transformer model that is popular for mining parallel corpus for machine translation.
The LLMs evaluated were both massively multilingual (i.e.typically trained on over 100 languages around the world) and African-centric (i.e.trained mostly on languages spoken in Africa).The African-centric multilinual text encoders are all modeled after XLM-R.AfriBERTa was pretrained from scratch on 11 African languages, AfroXLMR was adapted to African languages through finetuning the original XLM-R model on 17 African languages and 3 languages commonly spoken in Africa, while AfroLM was pretrained on 23 African languages utilizing active learning.Similar to the PLMs, the T2T models used in this study were pretrained on hundreds of languages, and they are all based on the T5 model (Raffel et al., 2020), which is an encoder-decoder model trained with the span-mask denoising objective.mT5 is a multilingual version of T5, and Flan-T5 was fine-tuned on multiple tasks using T5 as a base.The study also included adaptations of the original models, such as AfriMT5-base, as well as AfriTeVA-base, a T5 model pre-trained on 10 African languages.

BASELINE RESULTS
Table 4 shows the result of training several models on MasakhaNEWS TRAIN split and evaluation on the TEST split for each language.Our evaluation shows that classical ML models are worse in general than fine-tuning multilingual LLMs on average, however, the drop in performance is sometimes comparable to LLMs if the language was not covered during the pre-training of the The best result achieved is by AfroXLMR-base/large with over 4.0 F1 improvement over AfriB-ERTa.The larger variant gave the overall best result due to the size.AfroXLMR models benefited from being pre-trained on most of the languages we evaluated on.We also tried multilingual text-totext models, but none of the models reach the performance of AfroXLMR-large despite their larger size.We observe the same trend that the adapted mT5 model (i.e.AfriMT5) gave better result compared to mT5 similar to how AfroXLMR gave better result than XLM-R.We found FlanT5-base to be competitive to AfriMT5 despite seeing few African languages, however, the performance was very low for amh and tir probably due to the model not supporting the Ge'ez script.

Headline-only training
We compare our results using headline+text (as shown in Table 4) with training on the article headline -with shorter content, we find out that fine-tuned LLMs gave impressive performance with only headlines while classical ML methods struggle due to shorter content.Figure 1 shows the result of our comparison.AfroXLMR-base and AfroXLMR-large both improve by (2.3) and (1.5) F1 points respectively if we use headline+text instead of headline.Classical ML models improve the most when we make use of headline+text instead of headline; MLP, NaiveBayes and XGBoost improve by large F1 points (i.e.7.4 − 9.7).Thus, for the remainder of this paper, we make use of headline+text.Appendix B provides the breakdown of the result by languages for the comparison of headline and headline+text.
6 ZERO AND FEW-SHOT LEARNING

METHODS
Here, we compare different zero-shot and few-shot methods 1. Fine-tune (Fine-tune on a source language, and evaluate on a target language) using AfroXLMR-base.This is only used in the zero-shot setting.
2. MAD-X 2.0 (Pfeiffer et al., 2020;2021) -a parameter efficient approach for cross-lingual transfer leveraging the modularity, and portability of adapters (Houlsby et al., 2019).We followed the same zero-shot setup as Alabi et al. ( 2022), however, we make use of hau and swa as source languages since they cover all the news topics used by all languages.
3. PET/iPET (Schick & Schütze, 2021a;b), also known as (Iterative) Pattern Exploiting Training is a semi-supervised approach that makes use of few labelled examples and a prompt/pattern to a LLM for few-shot learning.It involves three steps.(1) designing of a prompt/pattern and a verbalizer (that maps each label to a word from LLM vocabulary).
(2) train an LLM on each pattern based on few labelled examples (3) distill the knowledge of the LLM on unlabelled data.Therefore, PET leverages unlabelled examples to improve few-shot learning.iPET on the other hand, repeats step 2 and 3 iteratively.We make use of the same set of patterns used for AGNEWS English dataset provided by the PET/iPET authors.The patterns are (a) ]ab, where a is the news headline and b is the news text.In evaluation, we take average over all patterns.4. SetFit (Tunstall et al., 2022b) is a few-shot learning framework based on sentence transformer models (Reimers & Gurevych, 2019)  Step 2, fine-tuned sentence transformer models is used to extract rich sentence representation for each labelled example, followed by logistic regression for classification.The advantage of this approach is that it is faster and requires no prompt unlike PET/iPET.We use this in both zero-and few-shot setting.For the zero-shot setting, SetFiT creates dummy example N -times (we set N = 8) like "this sentence is {}" where {} can be any news topic like "sports".

5.
Co:here multilingual sentence transformer Co:here4 introduced a multilingual embedding model multilingual-22-125 , which supports over a hundred languages, including most of the languages included in MasakhaNEWS .This is only for the few-shot setting.
6. OpenAI ChatGPT API6 is an LLM trained on a large chunk of texts to predict the next word like GPT-3 (Brown et al., 2020), followed by a set of instructions in a prompt based on human feedback.It leverages Reinforcement Learning from Human Feedback (RLHF), similar to InstructGPT (Ouyang et al., 2022) to make the LLM to interact in a conversational way.We prompt the OpenAI API7 based on GPT-3.5 Turbo-0301 to categorize articles into news topics.Our initial experiments shows that it did not work for Ge'ez script, thus, we make use of NLLB8 (NLLB-Team et al., 2022) open-sourced machine translation model to translate Amharic and Tigrinya articles to English before evaluation.
For the prompting, we make use of a simple template from Sanh et al. (2022): 'Is this a piece of news regarding {{"business, entertainment, health, politics, religion, sports or technology"}}?{{INPUT}}'.We make use of the first 100 tokens of headline+text as {{INPUT}}.The completion of the LLM can be a single word, a sentence, or multiple sentences.We check if a descriptive word relating to any of the news topics has been predicted.For example, "economy", "economic", "finance" is mapped to "business" news.We provide more details on the ChatGPT evaluation in Appendix C.
For all few-shot settings, we tried K samples/shots per class where K = 5, 10, 20, 50.We make use of LaBSE as the sentence transformer for SetFit, and AfroXLMR-large as the LLM for PET/iPET.

ZERO-SHOT EVALUATION
Table 5 shows the result of zero-shot evaluation using FINETUNE, MAD-X, PET, SETFIT and CHATGPT.Our result shows that cross-lingual zero-shot transfer from a source language with same showing that superior capabilities of instruction-tuned LLMs over smaller LLMs.Surprisingly, the results were comparable to the FINETUNE approach for some languages (Amharic, English, Luganda, Oromo, Naija, Somali, isiXhosa, and Yorùbá), without leveraging any additional technique apart from prompting the LLM.
In general, it may be advantageous to consider leveraging knowledge from other languages with available training data when no labelled data is available for the target language.Also, we observe that Swahili (swa) achieves better result as a source language than Hausa (hau) especially when transferring to fra (+13.8),lug (+9.0), and eng (+3.6).The reason for the impressive performance from Swahili to Luganda might be due to both languages belonging to the same Greater Lake Bantu language sub-group, but it is unclear why Hausa gave worse results than Swahili when adapting to English or French.However, with few examples, PET and SetFit methods are powerful without leveraging training data and models from other languages.

FEW-SHOT EVALUATION
Table 6 shows the result of the few-shot learning approaches.With only 5-shots, we find all the few-shot approaches to be better than the usual FINE-TUNE baselines for most languages.However, as the number of shots increases, they have comparable results with SETFIT and COHERE API especially for K = 20, 50 shots.However, we found that PET achieved very impressive results even with 5-shots (81.9 on average), matching the performance of SETFIT/COHERE API with 50shots.The results are even better with more shots i.e (k = 10, 86.0 F1), (k = 20, 87.9 F1), and (k = 50, 89.9 F1).Surprisingly, with 50-shots, PET gave competitive result to the full-supervised setting (i.e.fine-tuning all TRAIN data) that achieved (92.6 F1) (see Table 4).It's important to note that PET/iPET make use of additional unlabelled data while SetFit and Cohere API does not.In general, our result highlight the importance of getting few labelled examples for a new language we are adapting to even if it is as little as 10 examples per class, which is not time-consuming to obtain by native speakers (Lauscher et al., 2020;Hedderich et al., 2020).

CONCLUSION
In this paper, created the largest news topic classification dataset for 16 typologically diverse languages spoken in Africa.We provide an extensive evaluation using both full-supervised and fewshot learning settings.Furthermore, we study different techniques of adapting prompt-based tuning and non-prompt methods of LLMs to African languages.Our experimental results show the potential of prompt-based few-shot learning approaches like PET/iPET for African languages.In the future, we plan to extend this dataset to more African languages, include bigger multilingual LLMs like BLOOM, mT0 (Muennighoff et al., 2022) and XGLM (Lin et al., 2022) in our evaluation, and extend analysis to other text classification tasks like sentiment classification (Shode et al., 2022;Muhammad et al., 2023).

B COMPARING DIFFERENT ARTICLE CONTENT TYPES
Table 7 provides the comparison between using only news headline and headline+text for training.We find significantly improvement on average when we make use of headline+text for training across all models and languages especially for classical ML methods (MLP, NaiveBayes, and XGBoost).

C CHATGPT EVALUATION
We prompted ChatGPT for news topic classification using the following template: 'Is this a piece of news regarding {{"business, entertainment, health, politics, religion, sports or technology"}}?{{INPUT}}'.The completion may take different forms e.g. a single word, sentence or multiple sentences.Examples of such predictions are: 1. sports 2. This is a piece of news regarding sports.
3. This is a piece of sports news regarding the CHAN 2021 football tournament in Cameroon.
It reports that the Mali national football team has advanced to the semi-finals after defeating the Congo national team in a match that ended in a penalty shootout.
4. This is a piece of news regarding sports.It talks about the recent match between Tunisia and Angola in the African Cup of Nations.Both teams scored a goal, and the article mentions some of the details of the game, such as the penalty and missed chances.
5. I'm sorry, but I'm having trouble understanding this piece of news as it appears to be in a language I don't recognize.Can you please provide me with news in English so I can assist you better?
To extract the right category, we make use of a simple verbalizer that maps the news topic to several indicative words (capitalization ignored) for the category like: When the right category is not obvious, like (5 : "I'm sorry, but I'm having trouble understanding this piece of news as it appears to be in a language I don't recognize."), we choose a random category before computing F1-score.
like LaBSE following two steps.Step 1 finetunes the sentence transformer model using a few labelled examples with contrastive learning -where positive examples, are K-examples from a class c, and negative examples pairs are labelled examples with random labels from other classes.Contrastive learning approach enlarges the size of training data in few-shot scenarios.In Figure 2 provides an example of the interface of our in-house annotation tool.

Table 1 :
Languages covered in MasakhaNEWS and Data Source: including language family, region, number of L1 & L2 speakers, and number of articles from each news source.

Table 2 :
Table2shows that all languages have a moderate to perfect Fleiss Kappa score MasakhaNEWS dataset.We provide the data size of the annotated data, news topics, and number of annotators .The topics are labelled by their prefixes in the table (topics): business, entertainment, health, politics, religion, sport, technology.

Table 3 :
Languages covered by different multilingual Models and their sizes Comparison of article content type used for training news topic classification models.We report the average across all languages when either headline or headline+text is used LLMs.For example, MLP, NaiveBayes and XGBoost have better performance than AfriBERTa on fra and sna since they were not seen during pre-training of the LLM.Similarly, AfroLM had worse result for fra for the same reason.On average, XLM-R-base, AfroLM, mDeBERTaV3, XLM-Rlarge gave 83.0 F1, 86.1 F1, 86.0 F1, and 86.1 F1 respectively, with worse performance compared to the other LLMs (87.8 − 92.6 F1) because they do not cover some of the African languages during pre-training (see Table3) or they have been pre-trained on a small data (e.g.AfroLM pretrained on less than 0.8GB despite seeing 23 African languages during pre-training).Larger models such as LABSE and RemBERT that cover more languages performed better than the smaller models, for example, LABSE achieved over of 2.5 F1 points over AfriBERTa.

Table 5 :
Zero-shot learning on MasakhaNEWS .We compare several approaches such as using MAD-X, PET and SetFit.We excluded the source languages hau and swa from the average (AVG src ).ChatGPT results with † are based on translated texts from Amharic/Tigrinya to English.