MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition

African languages are spoken by over a billion people, but they are under-represented in NLP research and development. Multiple challenges exist, including the limited availability of annotated training and evaluation datasets as well as the lack of understanding of which settings, languages, and recently proposed methods like cross-lingual transfer will be effective. In this paper, we aim to move towards solutions for these challenges, focusing on the task of named entity recognition (NER). We present the creation of the largest to-date human-annotated NER dataset for 20 African languages. We study the behaviour of state-of-the-art cross-lingual transfer methods in an Africa-centric setting, empirically demonstrating that the choice of source transfer language significantly affects performance. While much previous work defaults to using English as the source language, our results show that choosing the best transfer language improves zero-shot F1 scores by an average of 14% over 20 languages as compared to using English.


Introduction
Many African languages are spoken by millions or tens of millions of speakers.However, these languages are poorly represented in NLP research, and the development of NLP systems for African languages is often limited by the lack of datasets for training and evaluation (Adelani et al., 2021b).Additionally, while there has been much recent work in using zero-shot cross-lingual transfer (Ponti et al., 2020;Pfeiffer et al., 2020;Ebrahimi et al., 2022) to improve performance on tasks for low-resource languages with multilingual pretrained language models (PLMs) (Devlin et al., 2019a;Conneau et al., 2020), the settings under which contemporary transfer learning methods work best are still unclear (Pruksachatkun et al., 2020;Lauscher et al., 2020;Xia et al., 2020).For example, several methods use English as the source language because of the availability of training data across many tasks (Hu et al., 2020;Ruder et al., 2021), but there is evidence that English is often not the best transfer language (Lin et al., 2019;de Vries et al., 2022;Oladipo et al., 2022), and the process of choosing the best source language to transfer from remains an open question.
There has been recent progress in creating benchmark datasets for training and evaluating models in African languages for several tasks such as machine translation (∀ et al., 2020;Reid et al., 2021;Adelani et al., 2021aAdelani et al., , 2022;;Abdulmumin et al., 2022), and sentiment analysis (Yimam et al., 2020;Muhammad et al., 2022).In this paper, we focus on the standard NLP task of named entity recognition (NER) because of its utility in downstream applications such as question answering and information extraction.For NER, annotated datasets exist only in a few African languages (Adelani et al., 2021b;Yohannes and Amagasa, 2022), the largest of which is the MasakhaNER dataset (Adelani et al., 2021b) (which we call MasakhaNER 1.0 in the remainder of the paper).While MasakhaNER 1.0 covers 10 African languages spoken mostly in West and East Africa, it does not include any languages spoken in Southern Africa, which have distinct syntactic and morphological characteristics and are spoken by 40 million people.
In this paper, we tackle two current challenges in developing NER models for African languages: (1) the lack of typologically-and geographicallydiverse evaluation datasets for African languages; and (2) choosing the best transfer language for NER in an Africa-centric setting, which has not been previously explored in the literature.
To address the first challenge, we create the MasakhaNER 2.0 corpus, the largest humanannotated NER dataset for African languages.MasakhaNER 2.0 contains annotated text data from 20 languages widely spoken in Sub-Saharan Africa and is complementary to the languages present in previously existing datasets (e.g., Adelani et al., 2021b).We discuss our annotation methodology as well as perform benchmarking experiments on our dataset with state-of-the-art NER models based on multilingual PLMs.
In addition, to better understand the effect of source language on transfer learning, we extensively analyze different features that contribute to cross-lingual transfer, including linguistic characteristics of the languages (i.e., typological, geographical, and phylogenetic features) as well as data-dependent features such as entity overlap across source and target languages (Lin et al., 2019).We demonstrate that choosing the best transfer language(s) in both single-source and cotraining setups leads to large improvements in NER performance in zero-shot settings; our experiments show an average of a 14 point increase in F1 score as compared to using English as source language across 20 target African languages.We release the data, code, and models on Github 1

Related Work
African NER Datasets There are some humanannotated NER datasets for African languages 1 https://github.com/masakhane-io/masakhane-ner/tree/main/MasakhaNER2.0 such as the SaDiLAR NER corpus (Eiselen, 2016) covering 10 South African languages, LORELEI (Strassel and Tracey, 2016), which covers nine African languages but is not open-sourced, and some individual language efforts for Amharic (Jibril and Tantug, 2022), Yorùbá (Alabi et al., 2020), Hausa (Hedderich et al., 2020), andTigrinya (Yohannes andAmagasa, 2022).Closest to our work is the MasakhaNER 1.0 corpus (Adelani et al., 2021b), which covers 10 widely spoken languages in the news domain, but excludes languages from the southern region of Africa like isiZulu, isiXhosa, and chiShona with distinct syntactic features (e.g., noun prefixes and capitalization in between words) which limits transfer learning from other languages.We include five languages from Southern Africa in our new corpus.
Cross-lingual Transfer Leveraging crosslingual transfer has the potential to drastically improve model performance without requiring large amounts of data in the target language (Conneau et al., 2020) but it is not always clear from which language we must transfer from (Lin et al., 2019;de Vries et al., 2022).To this end, recent work investigates methods for selecting good transfer languages and informative features.For instance, token overlap between the source and target language is a useful predictor of transfer performance for some tasks (Lin et al., 2019;Wu and Dredze, 2019).Linguistic distance (Lin et al., 2019;de Vries et al., 2022), word order (K et al., 2020;Pires et al., 2019) and script differences (de Vries et al., 2022), and syntactic similarity (Karamolegkou and Stymne, 2021) have also been shown to impact performance.Another research direction attempts to build models of transfer performance that predicts the best transfer language for a target language by using some linguistic and data-dependent features (Lin et al., 2019;Ahuja et al., 2022).
3 Languages and Their Characteristics

Focus Languages
Table 1 provides an overview of the languages in our MasakhaNER 2.0 corpus.We focus on 20 Sub-Saharan African languages2 with varying numbers of speakers (between 1M-100M) that are spoken by over 500M people in around 27 countries in  the Western, Eastern, Central and Southern regions of Africa.The selected languages cover four language families.17 languages belong to the Niger-Congo language family, and one language belongs to each of the Afro-Asiatic (Hausa), Nilo-Saharan (Luo), and English Creole (Naija) families.Although many languages belong to the Niger-Congo language family, they have different linguistic characteristics.For instance, Bantu languages (eight in our selection) make extensive use of affixes, unlike many languages of non-Bantu subgroups such as Gur, Kwa, and Volta-Niger.

Language Characteristics
Script and Word Order African languages mainly employ four major writing scripts: Latin, Arabic, N'ko and Ge'ez.Our focus languages mostly make use of the Latin script.While N'ko is still actively used by the Mande languages like Bambara, the most widely used writing script for the language is Latin.However, some languages use additional letters that go beyond the standard Latin script, e.g., "E", "O", "N", "e .", and more than one character letters like "bv", "gb", "mpf", "ntsh".17 of the languages are tonal except for Naija, Kiswahili and Wolof.Nine of the languages make use of diacritics (e.g., é, ë, ñ).All languages use the SVO word order, while Bambara additionally uses the SOV word order.
Morphology and Noun classes Many African languages are morphologically rich.According to the World Atlas of Language Structures (WALS; Nichols and Bickel, 2013), 16 of our languages employ strong prefixing or suffixing inflections.
Niger-Congo languages are known for their system of noun classification.12 of the languages actively make use of between 6-20 noun classes, including all Bantu languages, Ghomálá', Mossi, Akan and Wolof (Nurse and Philippson, 2006;Payne et al., 2017;Bodomo and Marfo, 2002;Babou and Loporcaro, 2016).While noun classes are often marked using affixes on the head word in Bantu languages, some non-Bantu languages, e.g., Wolof make use of a dependent such as a determiner that is not attached to the head word.For the other Niger-Congo languages such as Fon, Ewe, Igbo and Yorùbá, the use of noun classes is merely vestigial (Konoshenko and Shavarina, 2019).Three of our languages from the Southern Bantu family (chiShona, isiXhosa and isiZulu) capitalize proper names after the noun class prefix as in the language names themselves.This characteristic may limit transfer from languages without this feature as NER models overfit on capitalization (Mayhew et al., 2019).Appendix B provides more details regarding the languages' linguistic characteristics.

Data source and collection
We annotate news articles from local sources.The choice of the news domain is based on the availability of data for many African languages and the variety of named entities types (e.g., person names and locations) as illustrated by popular datasets such as CoNLL-03 (Tjong Kim Sang and De Meulder, 2003). 3Table 1 shows the sources and sizes of the data we use for annotation.Overall, we collected between 4.8K-11K sentences per language from either a monolingual or a translation corpus.

Monolingual corpus
We collect a large monolingual corpus for nine languages, mostly from local news articles except for chiShona and Kiswahili texts, which were crawled from Voice of America (VOA) websites. 4As Yorùbá text was missing diacritics, we asked native speakers to manually add diacritics before annotation.During data collection, we ensured that the articles are from a variety of topics e.g.politics, sports, culture, technology, society, and education.In total, we collected between 8K-11K sentences per language.
Translation corpus For the remaining languages for which we were unable to obtain sufficient amounts of monolingual data, we use a translation corpus, MAFAND-MT (Adelani et al., 2022), which consists of French and English news articles translated into 11 languages.We note that translationese may lead to undesired properties, e.g., unnaturalness.However, we did not observe serious issues during the annotation.The number of sentences is constrained by the size of the MAFAND-MT corpus, which is between 4,800-8,000.

NER Annotation Methodology
We annotated the collected monolingual texts with the ELISA annotation tool (Lin et al., 2018) with four entity types: Personal name (PER), Location (LOC), Organization (ORG), and date and time (DATE), similar to MasakhaNER 1.0 (Adelani et al., 2021b).We made use of the MUC-6 annotation guide. 5The annotation was carried out by three native speakers per language recruited from AI/NLP communities in Africa.To ensure high-quality annotation, we recruited a language coordinator to supervise annotation in each language.We organized two online workshops to train language coordinators on the NER annotation.As part of the training, each coordinator annotated 100 English sentences, which were verified.Each coordinator then trained three annotators in their team using both English and African language texts with the support of the workshop organizers.All annotators and language coordinators received appropriate remuneration. 6t the end of annotation, language coordinators worked with their team to resolve disagreements using the adjudication function of ELISA, which ensures a high inter-annotator agreement score.

Quality Control
As discussed in subsection 4.2, language coordinators helped resolve several disagreements in annotation prior to quality control.Table 2 reports the Fleiss Kappa score after the intervention of language coordinators (i.e.post-intervention score).
The pre-intervention Fleiss Kappa score was much lower.For example, for pcm, the pre-intervention Fleiss Kappa score was 0.648 and improved to 0.966 after the language coordinator discussed the disagreements with the annotators.
For the quality control, annotations were automatically adjudicated when there was agreement, but were flagged for further review when annotators disagreed on mention spans or types.The process for reviewing and fixing quality control issues was voluntary and so not all languages were further reviewed (see Table 2).
We automatically identified positions in the annotation that were more likely to be annotation errors and flagged them for further review and correction.The automatic process flags tokens that are commonly annotated as a named entity but were not marked as a named entity in a specific position.For example, the token Province may appear commonly as part of a named entity and infrequently not as a named entity, so when it is seen as not marked it was flagged.Similarly, we flagged tokens that had near-zero entropy with regard to a certain entity type, for example a token almost always annotated as ORG but very rarely annotated as PER.We also flagged potential sentence boundary errors by identifying sentences with few tokens  or sentences which end in a token that appears to be an abbreviation or acronym.As shown in Table 2, before further adjudication and correction there was already relatively high inter-annotator agreement measured by Fleiss' Kappa at the mention level.
After quality control, we divided the annotation into training, development, and test splits consisting of 70%, 10%, and 20% of the data respectively.Appendix A provide details on the number of tokens per entity (PER, LOC, ORG, and DATE) and the fraction of entities in the tokens.

Massively multilingual PLMs Table
RemBERT was trained with a similar objective, but makes use of a larger output embedding size during pre-training and covers more African languages.XLM-R was trained only with MLM on 100 languages and on a larger pre-training corpus.mDe-BERTaV3 makes use of ELECTRA-style (Clark et al., 2020) pre-training, i.e., a replaced token detection (RTD) objective instead of MLM.

Africa-centric multilingual PLMs
We also obtained NER models by fine-tuning two PLMs that are pre-trained on African languages.AfriB-ERTa (Ogueji et al., 2021) was pre-trained on less than 1 GB of text covering 11 African languages, including six of our focus languages, and has shown impressive performance on NER and sentiment classification for languages in its pretraining data (Adelani et al., 2021b;Muhammad et al., 2022).AfroXLM-R (Alabi et al., 2022) is a language-adapted (Pfeiffer et al., 2020) version of XLM-R that was fine-tuned on 17 African languages and three high-resource languages widely spoken in Africa ("eng", "fra", and "ara").Appendix J provides the model hyper-parameters for fine-tuning the PLMs.

Baseline Results
Table 4 shows the results of training NER models on each language using the eight multilingual and Africa-centric PLMs.All PLMs provided good performance in general.However, we observed worse results for mBERT and AfriBERTa especially for languages they were not pre-trained on.For instance, both models performed between 6-12 F1 worse for bbj, wol or zul compared to XLM-R-base.We hypothesize that the performance drop is largely due to the small number of African languages covered by mBERT as well as AfriBERTa's comparatively small model capacity.XLM-R-base gave much better performance (> 1.0 F1) on average compared to mBERT and AfriB-ERTa.We found the larger variants of mBERT and XLM-R, i.e., RemBERT and XLM-R-large to give much better performance (> 2.0 F1) than the smaller models.Their larger capacity facilitates positive transfer, yielding better performance for unseen languages.Surprisingly, mDeBERTaV3 provided slightly better results than XLM-R-large and RemBERT despite its smaller size, demonstrating the benefits of the RTD pre-training (Clark et al., 2020).

Error Analysis with ExplainaBoard
Furthermore, using ExplainaBoard (Liu et al., 2021), we analysed the best three baseline NER models: AfroXLM-R-large, mDeBERTaV3, and XLM-R-large.We discovered that 2-token entities were easier to predict accurately than lengthier entities (4 or more words).Moreover, the result shows that all the models have difficulty predicting zero-frequency entities effectively (entities with no occurrences in the training set).Interestingly, AfroXLMR-large is significantly better than other models for zero-frequency entities, suggesting that training PLMs on African languages promotes generalization to unseen entities.Finally, we observed that the three models perform better when predicting PER and LOC entities compared to ORG and DATE entities by up to (+5%).Appendix D provides more details on the error analysis.

Dataset Geography of Entities
Next, we analyse the geographical representativeness of the entities in our dataset, specifically, we measure the count of entities based on the countries they originate from.Following the approach of Faisal et al. (2022), we first performed entity linking of named entities present in our dataset to Wikidata IDs using mGenre (De Cao et al., 2022), followed by mapping Wikidata IDs to countries.
Figure 1 shows the result of number of entities per continent and the top-10 countries with the largest representation of entities.Over 50% of the entities are from Africa, followed by Europe.This shows that the entities of MasakhaNER 2.0 properly represent the African continent.Seven out of the top-10 countries are from Africa, but also includes USA, United Kingdom and France.

Transfer Between African NER Datasets
African languages have a diverse set of linguistic characteristics.To demonstrate this heterogeneity, we perform a transfer learning experiment where we compare the performance of multilingual NER models jointly trained on the languages of MasakhaNER 1.0 or MasakhaNER 2.0 and perform zero-shot evaluation on both test sets.We consider three experimental settings: Table 5 shows the result of the three settings.When evaluating on the MasakhaNER 2.0 test set in set-  6 Cross-Lingual Transfer The success of cross-lingual transfer either in zero or few-shot settings depends on several factors, including an appropriate selection of the best source language.Several attempts at cross-lingual transfer make use of English as the source language due to its availability of training data.However, English is unrepresentative of African languages and transfer performance is often lower for distant languages (Adelani et al., 2021b).

Choosing Transfer Languages for NER
Here, we follow the approach of  LangRank is trained using these features to determine the best transfer language in a leave-one-out setting where, for each target language, we train on all other languages except the target language.We compute transfer F1 scores from a set of N transfer (source) languages and evaluate on N target languages, yielding N × N transfer scores.Choice of Transfer Languages We selected 22 human-annotated NER datasets of diverse languages by searching the web and HuggingFace Dataset Hub (Lhoest et al., 2021).We required each dataset to contain at least the PER, ORG, and LOC types, and we limit our analysis to these types.We also added our MasakhaNER 2.0 dataset with 20 languages.In total, the datasets cover 42 languages (21 African).Each language is associated with a single dataset.Appendix C provides details about the languages, datasets, and data splits.
To compute zero-shot transfer scores, we fine-tune mDeBERTaV3 on the NER dataset of a source language and perform zero-shot transfer to the target languages.We choose mDeBERTaV3 because it supports 100 languages and has the best performance among the PLMs trained on a similar number of languages.

Single-source Transfer Results
Figure 2 shows the zero-shot evaluation of training on 42 NER datasets and evaluation on the test sets of the 20 MasakhaNER 2.0 languages.On average, we find the transfer from non-African languages to be slightly worse (51.7 F1) than transfer from African languages (57.3 F1).The worst transfer result is using bbj as source language (41.0 F1) while the best is using sna (64 F1), followed by yor (63 F1).We identify German (deu) and Finnish (fin) as the top-2 transfer languages among the non-African languages.In most cases, languages that are geographically and syntactically close tend to benefit most from each other.For example, sna, xho, and zul have very good transfer among themselves due to both syntactic and geographical closeness.Similarly, for Nigerian languages (hau, ibo, pcm, yor) and East African languages (kin, lug, luo, swa), geographical proximity plays an important role.While most African languages prefer transfer from another African language, there are few exceptions, like swa preferring transfer from deu or ara.The latter can be explained by the presence of Arabic loanwords in Swahili (Versteegh, 2001).Similarly, nya and tsn also prefer deu.Appendix G provides results for transfer to non-African languages.

LangRank and Co-training Results
We also investigate the benefit of training on the second-best language in addition to the languages selected by LangRank.We jointly train on the combined data of the top-2 transfer languages or the top-2 languages predicted by LangRank and evaluate their zero-shot performance on the target language.Table 6 shows the result for the top-2 transfer languages using the best from 42 × 42 transfer F1scores and LangRank model predictions.LangRank predicted the right language as one of the top-2 best transfer language in 13 target languages.The target languages with incorrect predictions are fon, ibo, kin, lug, luo, nya, and swa.The transfer languages predicted as alternative are often in the top-5 transfer languages or are less than (−5 F1) worse than the best transfer language.For example, the best transfer language for lug is kin (81 F1) but LangRank predicted luo (76 F1).Appendix H gives results for non-African languages.
Features that are important for transfer The most important features for the selection of best language by LangRank are geographic distance (d geo ) and entity overlap (eo).The d geo is influential because named entities (e.g.name of a politician or a city) are often similar from languages spoken in the same country (e.g.Nigeria with 4 languages in MasakhaNER 2.0) or region (e.g.East African languages).Similarly, we find entity overlap to have a positive Spearman correlation (R = 0.6) to transfer F1-score.Appendix F provides more details on the correlation results.d geo occurred as part of the top-3 features for 15 best transfer language and 16 second-best languages.Similarly, for eo, it appeared 11-13 times for the top-2 transfer languages.Interestingly, dataset size was not among the most important features, highlighting the need for typologically diverse training data.

Best Transfer Language Outperforms English
We compare the zero-shot transfer performance of the top-2 transfer languages to using eng as the transfer language.They significantly outperform the eng average of 56.9 by +14 and +12 F1 for the first and second-best source language, respectively.

Co-training of Top-2 Transfer Languages Improves Performance
We find that co-training the top-2 transfer languages further improves zero-shot performance over the best transfer by around +3 F1.It is most significant for fon, ibo, kin and twi with 3-7 F1 improvement.Co-training the top-2 transfer languages predicted by LangRank is better than using the second-best transfer language, but often performs worse than the best transfer language.

Sample Efficiency Results
Figure 3 shows the performance when the model is trained on a few target language samples compared to when the best transfer language is used prior to fine-tuning on the same number of target language samples.We show the results for four languages (which reflect common patterns across all languages) and an average (ave) over the 20 languages.As seen in the figure, models achieve less than 50 F1 when we train on 100 sentences Figure 3: Sample Efficiency Results for 100 and 500 samples in the target language, model fine-tuned from a PLM (e.g.FT-100 -trained on 100 samples from the target language) or fine-tuned from the best transfer language NER model (e.g BT-Lang-0 -trained on 0 samples from the target language or zero-shot) and over 75 F1 when training on 500 sentences.
In practice, annotating 100 sentences takes about 30 minutes while annotating 500 sentences takes around 2 hours and 30 minutes; therefore, slightly more annotation effort can yield a substantial quality improvement.We also find that using the best transfer language in zero-shot settings gives a performance very close to annotating 500 samples in most cases, showing the importance of transfer language selection.By additionally fine-tuning the model on 100 or 500 target language samples, we can further improve the NER performance.Appendix I provides the sample efficiency results for individual languages.

Conclusion
In this paper, we present the creation of MasakhaNER 2.0, the largest NER dataset for 20 diverse African languages and provide strong baseline results on the corpus by fine-tuning multilingual PLMs on in-language NER and multilingual datasets.Additionally, we analyze cross-lingual transfer in an Africa-centric setting, showing the importance of choosing the best transfer language in both zero-shot and few-shot scenarios.Using English as the default transfer language can have detrimental effects, and choosing a more appropriate language substantially improves fine-tuned NER models.By analyzing data-dependent, geographical, and typological features for transfer in NER, we conclude that geographical distance and entity overlap contribute most effectively to transfer performance.We thank Mengzhou Xia, Antonis Anastasopoulos, and Fahim Faisal for their help with the LangRank code and data geography analysis.We thank Haneul Yoo for her comment on the initial Korean NER result.We thank Heng Ji and Ying Lin for providing the ELISA NER tool used for annotation.We thank Daniel D'souza for helping to set-up ELISA NER annotation tool.We thank Kelechi Ogueji for providing monolingual corpus for isiZulu.We thank Google for providing GCP credits to run some of the experiments.We thank the ML Group, Luleå University of Technology, for the compute resources for running some of the experiments.Finally, we thank the Masakhane leadership, Melissa Omino, Davor Orlič and Knowledge4All for their administrative support throughout the project.

Limitations
Some Language families not covered While we try to cover 20 topologically diverse languages and language families, a few locations in Africa and smaller language family groups were not covered.For example, languages from the Khoisan and Austronesian (like Malagasy) family were not covered.Also, languages spoken in the central Africa like South Sudan, Chad, and DRC were not covered.

News Domain Data
As the data we annotated belonged to the news domain, models trained from this data may not generalize well to other domains.
In particular, the models may not perform well on more casual text that may use different vocabulary, discuss different entities, and contain more orthographic variation.This limitation also applies for the English NER Corpus.

Generalizability of Transfer Learning Findings
As we only experimented with one task (NER), our findings regarding effective approaches to transfer learning for African languages and PLMs may not generalize to other tasks (e.g. machine translation, part of speech tagging); other features of language similarity may be more important for other tasks.
Explaining Transfer Learning Findings We found that the LangRank model could not predict the top transfer languages with 100% accuracy.This suggests that there are other, unknown factors that could affect transfer performance, which we did not explore.For example, there is still work to be done to understand the sociolinguistic connections and language contact conditions that may correlate with effective transfer.

Ethics Statement
Our research process has been deeply rooted in the principles of participatory AI research (∀ et al., 2020), where the populations most affected by the research-the native speakers of the languages in this case-are involved throughout the project as stakeholders.
We believe our work will be of benefit to the speakers of the included languages by enabling better language technology for their languages.By keeping them engaged throughout the process and as collaborators in this work, we have been able to become aware of any potential harms.As the data we use for annotation is news data that was already publicly available, the release of our annotation is unlikely to cause unintended harm.
However, there are always potential unintended consequences when creating NER data and models.The data selection, annotation, adjudication, and model training process can all introduce biases that may have negative effects.Specifically, within each language, the models trained may perform better when processing names that commonly appear in newswire, and worse when processing names belonging to entities less well-represented in the news domain, propagating biases to downstream tasks.
pretrained language models: When and why does it work?In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5231-5247, Online.Association for Computational Linguistics.

A Data Source and Splits
Table 7 shows the MasakhaNER 2.0 language, data source, train/dev/test split, and the number of tokens per entity type.

B Language Characteristics
Table 8 provides the details about the language characteristics.

B.1 Morphology and Noun classes
Many African languages are morphologically rich.
According to the World Atlas of Language Structures (WALS; Nichols and Bickel, 2013), 16 of   bh, ch, dl, dy, dz, gc, gq, gr, gx, hh, hl, kh, kr, lh, mh, ng, ngc, ngh, ngq, ngx, nkq, nkx, nh, nkc, nx, ny, nyh, ph, qh, rh, sh, th, ths, thsh, ts, tsh, ty, tyh, wh, xh, yh nx, ts, nq, ph, hh, ny, gq, hl, bh, nj, ch, ngc, ngq, th, ngx, kl, ntsh, sh, kh, tsh, ng, nk, gx, xh, gc, mb, dl, nc, qh yes, 3 tones no SVO agglutinative strong prefixing active, 17 Table 8: Linguistic Characteristics of the Languages our languages employ strong prefixing or suffixing inflections.Niger-Congo languages are known for their system of noun classification.12 of the languages actively make use of between 6-20 noun classes, including all Bantu languages and Ghomálá', Mossi, Akan and Wolof (Nurse and Philippson, 2006;Payne et al., 2017;Bodomo and Marfo, 2002;Babou and Loporcaro, 2016).While noun classes are often marked using affixes on the head word in Bantu languages, some non-Bantu languages, e.g., Wolof make use of a dependent such as a determiner that is not attached to the head word.For the other Niger-Congo languages such as Fon, Ewe, Igbo and Yorùbá, the use of noun classes is merely vestigial (Konoshenko and Shavarina, 2019).For example, Yorùbá only distinguishes between human and non-human nouns.Bambara is the only Niger-Congo language without noun classes, and some have argued that the Mande family should be regarded as an independent language family.Three of our languages from the Southern Bantu family (chiShona, isiXhosa and isiZulu) capitalize proper names after the noun class prefix as in the language names themselves.This characteristic limits the transfer learning of NER from languages without this feature, since NER models overfit on capitalization (Mayhew et al., 2019).

B.2 IsiXhosa and isiZulu morphological structure
IsiXhosa and isiZulu are agglutinative languages with a complex morphology.Each root or stem can attach a variety of affixes to form new inflections and derivations, with a variety of affixes added to root and stem morphemes to vary their meaning and convey syntactic agreement.The noun class system and the concord agreement system are the foundations of isiXhosa and isiZulu noun grammar.This section offers an overview of these two principles and their applicability to the realization of NEs.First, we briefly describe the noun class system, after which we discuss prefixing and capitalization work for both languages.
According to the Meinhoff system (Melzian, 1933), nouns in African languages are classified into one of 18 numbered classes based on their prefix.As shown in the following example, singular nouns in class 1 take the prefix um-, while associated plural nouns in class 2 take the prefix aba-.

B.2.1 Prefix
Even though all named entities are nouns since they designate a distinct entity, noun class designations are critical in identifying NEs.According to Oosthuysen (2016), the purpose of the noun class prefix is to distinguish the class to which it belongs.It shows whether the noun is singular or plural.The derivation of all significant prefixes and cordial agreements is based on this.
In isiXhosa, named entities referring to personal nouns with the prefix um-belongs to noun class 1 with noun class 2 being its plural form.Named entities such as jobs, objects and concepts belong to noun class 3, e.g.umpheki (cook) and umthwalo (burden).Lastly in isiXhosa, borrowed words from English and Afrikaans such as ibhanka (bank) and ihamire (hammer), belong to class 9.In isiZulu, noun class 1 is a singular class which uses the prefix umu-/um-.The allomorph umu-occurs when the noun stem consists of one syllable, e.g.umuntu (person) and the allomorph um-occurs when the noun stem has more than one syllable, e.g.umfana (boy).The noun class 2 is a plural class, with its singular in class 1. Noun class 2 uses the prefix aba-/ab-, e.g.abantu (people), abafana (boys).Noun classes 1 and 2 are a personal class only containing personal nouns.
Noun class 1a is a subclass of noun class 1.This class contains personal nouns referring to family relationships, professions, proper names and personalized nouns.This class uses the prefix u-with no allomorphs, e.g.ugogo (grandmother), unesi (nurse) or uSipho (personal name).The noun class 2a is a regular plural of class 1a which uses the prefix o-, e.g.ogogo (grandmothers), onesi (nurses) or oSipho (Sipho and company).

B.2.2 Capitalization
Capitalization is a very common feature for a number of natural language processing tools, such as named entity recognition systems that identify people's names, and locations (De Waal et al., 2006)

F Overlap Results
In Figure 4, we examine the word overlap between different languages, and how this correlates with the transfer performance.In general, these two quantities are strongly correlated (Spearman's R = 0.6, p < 0.05), echoing a similar result described by Beukman (2022).Note that the entity overlap feature used by the ranking model in the main text was calculated in a slightly different way; namely, considering all tokens instead of just the 4 named entities and not normalizing the overlap.This case still shows a positive correlation, although it is slightly smaller with Spearman's R = 0.49.

G Zero-shot Transfer
Figure 5 shows N ×N transfer results to languages in MasakhaNER 2.0.We see that English is not the best transfer language in general.It is better to choose a more geographically close African language.
Figure 6 shows N × N transfer results to languages not in MasakhaNER 2.0.We see that English appears to be the best transfer on average, which is not the case for African languages.The reason for this is that many of the non-African languages we evaluated on are from the Indo-European, similar to English.

H Best Transfer Language for Other Languages
Table 12 provides the result of the best transfer language for other languages not in MasakhaNER 2.0.

I Sample Efficiency Results
Figure 7 shows the result of training NER models using 100 and 500 samples for each language.

J Model Hyper-parameters for Reproducibility
For training NER models, we fine-tune PLM, we make use of a maximum sequence length of 200, batch size of 16, gradient accumulation of 2, learning rate of 5e-5, and number of epochs 50.The experiments of the large PLMs were performed on using Nvidia V100 GPU.For AfriBERTa and mBERT, we make use of Nvidia GeForce RTX-2080Ti.For evaluation, we make use of the microaveraged F1 score.Figure 7: Sample Efficiency Results for 100 and 500 samples in the target language, model fine-tuned on a PLM (e.g.FT-100 -trained on 100 samples from the target language) or fine-tuned on the best transfer language NER model (e.g.BT-Lang-0 -trained on 0 samples from the target language or zero-shot) (a) Train on all languages in MasakhaNER 1.0 using MasakhaNER 1.0 training data.(b) Train on the languages in MasakhaNER 1.0 (excl."amh") using the MasakhaNER 2.0 training data.(c) Train on all languages in MasakhaNER 2.0 using MasakhaNER 2.0 training data.

Figure 1 :
Figure 1: Number of entities per continent and the top-10 countries with the largest number of entities

Figure 2 :
Figure 2: Zero-shot Transfer from several source languages to African languages for languages in MasakhaNER 2.0 and the average (ave) over all 20 languages.Appendix G shows results for each of the 20 languages.

Figure 6 :
Figure 6: Zero-shot Transfer from several source languages to other languages not in MasakhaNER 2.0 bam bbj ewe fon hau ibo kin lug luo mos nya pcm sna swa tsn twi wol xho yor zul ave

Table 1 :
Languages and Data Splits for MasakhaNER 2.0 Corpus.Language, family (NC: Niger-Congo), number of speakers, news source, and data split in number of sentences

Table 3 :
Language coverage and size for PLMs.

Table 4 :
.0±0.2 NER Baselines on MasakhaNER 2.0.We compare several multilingual PLMs including the ones trained on African languages.Average is over 5 runs.

Table 5 :
Multilingual evaluation on African NER datasets.We compare the performance of AfroXLM-R-large trained on languages of MasakhaNER 2.0 and MasakhaNER 1.0 and evaluated both on the same and on the other dataset.The first column indicate the languages used for training (the 10 languages from MasakhaNER or the 20 languages from MasakhaNER 2.0).The second column indicates the training data.Average is over 5 runs.

Table 6 :
(Lin et al., 2019)uages for NER.The best zero-shot result is bolded, numbers that are not significantly different are underlined.The ranking model features are based on the definitions in(Lin et al., 2019)like: geographic distance (d geo ), genetic distance (d gen ), inventory distance (d inv ), syntactic distance (d syn ), phonological distance (d pho ), transfer language dataset size (s tf ), transfer over target size ratio (sr), and entity overlap (eo).The languages highlighted in gray have very good transfer performance (> 70%) using the best transfer language.

Table 7 :
Languages and Data Splits for MasakhaNER 2.0 Corpus.Distribution of the number of entities

Table 9 :
Languages and Data Splits for Other NER Datasets.
Lin et al. (2019)etween the data overlap and F1 transfer performance.For source language X and target language Y , denote the set of unique named entities (PER, ORG, LOC, DATE) by T X and T Y respectively.The overlap here was calculated as|T X ∩T Y | |T X |+|T Y | , as inLin et al. (2019).

Table 12 :
(Lin et al., 2019)uage for NER.The ranking model features are based on the definitions in(Lin et al., 2019)like: geographic distance (d geo ), genetic distance (d gen ), inventory distance (d inv ), syntactic distance (d syn ), phonological distance (d pho ), transfer language dataset size (s tf ), target language dataset size(s tg ), transfer over target size ratio (sr), and entity overlap (eo).