AfroLID: A Neural Language Identification Tool for African Languages

Language identification (LID) is a crucial precursor for NLP, especially for mining web data. Problematically, most of the world’s 7000+ languages today are not covered by LID technologies. We address this pressing issue for Africa by introducing AfroLID, a neural LID toolkit for 517 African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems. When evaluated on our blind Test set, AfroLID achieves 95.89 F_1-score. We also compare AfroLID to five existing LID tools that each cover a small number of African languages, finding it to outperform them on most languages. We further show the utility of AfroLID in the wild by testing it on the acutely under-served Twitter domain. Finally, we offer a number of controlled case studies and perform a linguistically-motivated error analysis that allow us to both showcase AfroLID’s powerful capabilities and limitations


Introduction
Language identification (LID) is the task of identifying the human language a piece of text or speech segment belongs to.The proliferation of social media have allowed greater access to multilingual data, making automatic LID an important first step in processing human language appropriately (Tjandra et al., 2021;Thara and Poornachandran, 2021).This includes applications in speech, sign language, handwritten text, and other modalities of language.It also includes distinguishing languages in codemixed datasets (Abdul-Mageed et al., 2020;Thara and Poornachandran, 2021).Unfortunately, for the majority of languages in the world, including most African languages, we do not have the resources for developing LID tools.
⋆ Authors contributed equally. 1 AfroLID is publicly available at https://github.com/UBC-NLP/afrolid.This situation has implications for the future NLP technologies.For instance, LID has facilitated development of widely multilingual models such mT5 (Xue et al., 2021) and large multilingual datasets such as CCAligned (El-Kishky et al., 2020), ParaCrawl (Esplà et al., 2019), WikiMatrix (Schwenk et al., 2021), OSCAR (Ortiz Suárez et al., 2020), and mC4 (Xue et al., 2021) which have advanced research in NLP.Comparable resources are completely unavailable for the majority of the world's 7000+ today, with only poor coverage of the so-called low-resource languages (LR).This is partly due to absence of LID tools, and impedes future NLP progress on these languages (Adebara and Abdul-Mageed, 2022).The state of African languages is not any better than other regions: Kreutzer et al. ( 2021) perform a manual evaluation of 205 datasets involving African languages such as those in CCAligned, ParaCrawl, WikiMatrix, OSCAR, and mC4 and show that at least 15 corpora were completely erroneous, a significant fraction contained less than 50% of correct data, and 82 corpora were mislabelled or used ambiguous language codes.These consequently affect the quality of models built with these datasets.Alabi et al. (2020) find that 135K out of 150K words in the fastText embeddings for Yorùbá belong to other languages such as English, French, and Arabic.New embedding models created by Alabi et al. (2020) with a curated high quality dataset outperform off-the-shelf fastText embeddings, even though the curated data is smaller.
In addition to resource creation, lack (or poor performance) of LID tools negatively impacts preprocessing of LR languages since LID can be a prerequisite for determining, e.g., appropriate tokenization.(Duvenhage et al., 2017a).Furthermore, some preprocessing approaches may be necessary for certain languages, but may hurt perforrmance in other languages (Adebara and Abdul-Mageed, 2022).Developing LID tools is thus vital for all NLP.In this work, we focus on LID for African languages and introduce AfroLID.
AfroLID is a neural LID tool that covers 517 African languages and language varieties 2 across 14 language families.The languages covered belong to 50 African countries and are written in five diverse scripts.We show the countries covered by AfroLID in Figure 1.Examples of the different scripts involved in the 517 languages are displayed in Figure 2. To the best of our knowledge, AfroLID supports the largest subset of African languages to date.AfroLID is also usable without any end-user training, and it exploits data from a variety of domains to ensure robustness.We manually curate our clean training data, which is of special significance in low resource settings.We show the utility of AfroLID in the wild by applying it on two Twitter datasets and compare its performance with existing LID tools that cover any number of African languages such as CLD2 (McCandless, 2010), CLD3 (Salcianu et al., 2018), Franc, LangDetect (Shuyo, 2010), andLangid.py (Lui andBaldwin, 2012).Our results show that AfroLID consistently outperforms all other LID tools for almost all languages, and serves as the new SOTA for language identification for African languages.
To summarize, we offer the following main con-2 Our dataset involves different forms that can arguably be viewed as varieties of the same language such as Twi and Akan. 1.We develop AfroLID, a SOTA LID tool for 517 African languages and language varieties.
To facilitate NLP research, we make our models publicly available.
2. We carry out a study of LID tool performance on African languages where we compare our models in controlled settings with several tools such as CLD2, CLD3, Franc, LangDetect, and Langid.py.
3. Our models exhibit highly accurate performance in the wild, as demonstrated by applying AfroLID on Twitter data.
4. We provide a wide range of controlled case studies and carry out a linguisticallymotivated error analysis of AfroLID.This allows us to motivate plausible directions for future research, including potentially beyond African languages.
The rest of the paper is organized as follows: In Section 2 we discuss a number of typological features of our supported languages.We describe AfroLID's training data in Section 3. Next, we introduce AfroLID in 4.This includes our experimental datasets and their splits, preprocessing, vocabulary, implementation and training details, and our evaluation settings.We present performance of AfroLID in Section 5 and compare it to other LID tools.Our analysis show that AfroLID outperforms other models for most languages.In the same section, we also describe the utility of AfroLID on non-Latin scripts, Creole languages, and languages in close geographical proximity.Although AfroLID is not trained on Twitter data, we experiment with tweets in Section 6 in order to investigate performance of AfroLID in out of domain scenarios.Through two diagnostic studies, we demonstrate AfroLID's robustness.We provide an overview of related work in Section 7. We conclude in Section 8, and outline a number of limitations for our work in Section 9.

Typological Information
Language Families.We experiment with 517 African languages and language varieties across 50 African countries.These languages belong to 14 language families (Eberhard et al., 2021) as follows: Afro-Asiatic, Austronesian, Creole (English based), Creole (French based), Creole (Kongo based), Creole (Ngbadi based), Creole (Portuguese based), Indo-European, Khoe-Kwadi (Hainum), Khoe-Kwadi (Nama), Khoe-Kwadi (Southwest), Niger-Congo, and Nilo-Saharan.The large and typologically diverse data we exploit hence endow our work with wide coverage.We show in Figure 1 a map of Africa with the countries AfroLID covers.We also show the number of languages we cover, per country, in Figure E in the Appendix.Table E.1, Table E.2, and Table E.3 in the Appendix also provide a list of the languages AfroLID handles.We represent the languages using ISO-3 codes 3 for both individual languages and macro-languages.We use a macro-language tag when the language is known but the specific dialect is unknown.For this reason we specify that AfroLID supports 517 African languages and language varieties.Sentential Word Order.There are seven categories of word order across human languages around the world.These are subject-verb-object (SVO), subject-object-verb (SOV), object-verbsubject (OVS), object-subject-verb (OSV), verbobject-subject (VOS), verb-subject-object (VSO), and languages lacking a dominant order (which often have a combination of two or more orders within its grammar) (Dryer and Haspelmath, 2013).Again, our dataset is very diverse: we cover five out of these seven types of word order.Table 1 shows sentential word order in our data, with some representative languages for each category.Diacritics.Diacritic marks are used to overcome the inadequacies of an alphabet in capturing important linguistic information by adding a distinguishing mark to a character in an alphabet.Diacritics are often used to indicate tone, length, case, nasalization, or even to distinguish different letters of a 3 https://glottolog.org/glottolog/language. language's alphabet (Wells, 2000;Hyman, 2003;Creissels et al., 2008).Diacritics can be placed above, below or through a character.Diacritics are common features of the orthographies of African languages.Out of 517 languages/language varieties in our training data, 295 use some diacritics in their orthographies.We also provide a list of languages with diacritics in our training data in  Scripts.Our dataset consists of 14 languages written in four different non-Latin scripts and 499 languages written in Latin scripts.The non-Latin scripts are Ethiopic, Arabic, Vai, and Coptic.

Curating an African Language Dataset
AfroLID is trained using a multi-domain, multiscript language identification dataset that we manually curated for building our tool.To collect the dataset, we perform an extensive manual analysis of African language presence on the web, identifying as much publicly available data from the 517 language varieties we treat as is possible.We adopt this manual curation approach since there are only few African languages that have any LID tool coverage.In addition, available LID tools that treat African languages tend to perform unreliably (Kreutzer et al., 2021).We therefore consult research papers focusing on African languages, such as (Adebara and Abdul-Mageed, 2022), or provide language data (Muhammad et al., 2022;Alabi et al., 2020), sifting through references to find additional African data sources.Moreover, we search for newspapers across all 54 African countries. 4We also collect data from social media such as blogs and web fora written in African languages as well as databases that store African language data.These include

AfroLID
Experimental Dataset and Splits.From our manually-curated dataset, we randomly select 5, 000, 50, and 100 sentences for train, development, and test, respectively, for each language.6Overall, AfroLID data comprises 2, 496, 980 sentences for training (Train), 25, 850 for development (Dev), and 51, 400 for test (Test) for 517 languages and language varieties.Preprocessing.We ensure that our data represent naturally occurring text by performing only minimal preprocessing.Specifically, we tokenize our data into character, byte-pairs, and words.We do not remove diacritics and use both precomposed and decomposed characters to cater for the inconsistent use of precomposed and decomposed characters by many African languages in digital media. 7e create our character level tokenization scripts and generate our vocabulary using Fairseq.We use sentencepiece tokenizer for the word level and byte-pair tokens before we preprocess in Fairseq.
Vocabulary.We experiment with byte-pair (BPE), word, and character level encodings.We used vocabulary sizes of 64K, 100K, and 2, 260 for the bpe, word, and character level models across the 517 language varieties.The characters included both letters, diacritics, and symbols from other non-Latin scripts for the respective languages.Dev based on F 1 .For all our models, we report the average of three runs.

Model Performance and Analysis
As Table 3   AfroLID in Comparison Using our Dev and Test data, we compare our best AfroLID model (BPE model) with the following LID tools: CLD2, CLD3, Franc, LangDetect, and Langid.py.Since these tools do not support all our AfroLID languages, we compare accuracy and F 1 -scores of our models only on languages supported by each of these tools.As Tables A.1 and 4 show, AfroLID outperforms other tools on 7 and 8 languages out of 16 languages on the Dev set and Test set, respectively.We also compare F 1 -scores of Franc on the 88 African languages Franc supports with the F 1 -scores of AfroLID on those languages.As shown in Tables 5 and 6, AfroLID outperforms Franc on 78 languages and has similar F 1 -score on five languages on the Dev set.AfroLID also outperforms Franc on 76 languages, and has similar F 1 -score on five languages on the Test set.Effect of Non-Latin Script.We investigate performance of AfroLID on languages that use one of Arabic, Ethiopic, Vai, and Coptic scripts.Specifically, we investigate performance of AfroLID on Amharic (amh), Basketo (bst), Maale (mdy), Sebat Bet Gurage (sgw), Tigrinya (tir), Xamtanga (xan), Fulfude Adamawa (fub), Fulfude Caka (fuv), Tarif (rif), Vai (vai), and Coptic (cop).10Vai and Coptic, the two unique scripts in AfroLID have an F 1 -score of 100 each.This corroborates research findings that languages written in unique scripts within an LID tool can be identified with up to 100% recall, F 1 -score, and/or accuracy even using a small training dataset (Jauhiainen et al., 2017a).We assume this to be the reason Langid.pyoutperforms AfroLID on Amharic as seen in Table 4, since Amharic is the only language that employs an Ethiopic script in langid.py.AfroLID, on the other hand, has 8 languages using Ethiopic scripts.However, it is not clear why Basketo, which uses Ethiopic scripts has 100 F 1 -score.We, how-   ever, found errors in Amharic, Sebat Bet Gurage, and Xamtanga (which use Ethiopic scripts) as well as Fulfude Adamawa, and Fulfude Caka (which use Arabic scripts).We find that languages using Ethiopic scripts are often confused with those using Ethiopic scripts (except for 2% of the time when Amharic is labelled as Wolof).We categorize this example under "others" in Figure 5 and B.1.On the other hand, Fulfude languages are wrongly labelled as other dialects of Fulfude that use Latin scripts.We visualize further details of the errors in Creole Languages.We investigate performance of AfroLID on Creole languages.Creole languages are vernacular languages that emerged as a result of trade interactions between speakers of mutually unintelligible languages (Lent et al., 2022).A Creole language therefore shares lexical items and grammatical structures with one or more dif-  We find that Guinea-Bissau Creole (pov), which is Portuguese based, is wrongly labelled as Kabuverdianu (kea) another Portuguese based Creole 1% of the time.Cameroonian pidgin (wes) is also wrongly labelled as Nigerian pidgin (pcm) 7% of the time.Since both Cameroonian and Nigerian Pidgin are English based, we assume lexical and/or grammatical similarities are responsible for these errors.It is also interesting to find cases where the wrong labels are languages spoken in the same geographical regions as the Creoles.For example, Kituba is wrongly labelled as Yombe, and both languages are spoken in Congo.Mauritian Creole (mfe), which is French based, is also wrongly labelled as Seychelles Creole (crs, another French based Creole) and two Indigenous languages spoken in Francophone Africa Ngiemboon, and Masana.We now further investigate the role of geographical proximity in our results.
Effect of Geographic Proximity.We evaluate performance of AfroLID on languages that share a large number of lexical items, or those that are spoken within the same country.In this analysis, we focus on 10 South African languages: Afrikaans (afr), Ndebele (nbl), Sepedi (nso), Sotho (sot), Swati (ssw), Tswana (tsn), Tsonga (tso), Tsivenda (ven), Xhosa (xho), and Zulu (zul).We select South Africa because most South Africans are multi-lingual, and it is not uncommon to find code-mixing using a combination of Indigenous languages within the same text (Finlayson and Slabbert, 1997;Mabule, 2015).Figures B.3 (in Appendix) and 7 show the types of errors AfroLID makes in identifying these languages on our Dev and Test datasets respectively.We find that about ∼ 70% of the errors are with other South African languages.Another 16% are with dialects from neighbouring countries including Tswa, a dialect of Tsonga, Ndebele (Zimbabwe) similar to Zulu, and Ronga, a dialect of Tsonga. 11We now provide a number of case studies we carry out to further probe AfroLID performance.

Diagnostic Case Studies
Although AfroLID is not trained on Twitter data, we evaluate its performance on Twitter to investigate the robustness of our models in out of domain scenarios.Namely, we carry out two diagnostic case studies using Twitter data.In the first study, which we refer to as Twitter in the wild, we use unannotated Tweets crawled from the web.In the the second, we use annotated tweets.We now turn to the details of these studies.

Case Study I: AfroLID in the Wild
In order to evaluate the utility of AfroLID in a real-world scenario, we collect 700M tweets from Africa.For this, we use Twitter streaming API from 2021−2022 with four geographical bounding boxes (central, eastern, western, and southern of Africa).We extract a random sample of 1M tweets from this larger Twitter dataset for our analysis.As is known, Twitter currently automatically labels a total of 65 languages.Only one of these languages, i.e., Amharic, is an African language in our 517 languages.In the 1M sample, 110 tweets were tagged as "Amharic" and 6, 940 as "undefined" by Twitter.We run our model on the "undefined" data.In all, the 6, 940 tweets were identified as belonging to 242 African languages by AfroLID.Since the Tweets we used were unannotated, we are not able to determine the number of tweets wrongly classified by AfroLID for each language.For this reason, we only evaluate a subset of the predicted languages: we ask native speakers of three languages (Yorùbá, Hausa, and Nigerian Pidgin) to help identify each tweet that was classified by AfroLID as belonging to their language.We provide details of this annotation study and examples of annotated samples in Table D.1 ( Appendix D).We find that AfroLID is able to correctly identify Yorùbá both with and without diacritics and code-mixed examples.A total of 16 tweets are classified as Yorùbá by AfroLID, of which 7 are correct (43.75%), 2 are mixed with English, and 7 are wrongly labelled.
Of the wrongly labelled tweets, one is identified as Nigerian Pidgin, while the others are unknown languages.For Nigerian Pidgin, of the 28 tweets predicted, 2 are correct (12.50%), 1 is mixed with an unknown language, and the others are wrongly classified.We find that in most cases, tweets classified as Nigerian pidgin are code-mixed with English and another Indigenous language.This gives us indication that AfroLID identifies Nigerian Pidgin as an English-based Creole.Finally, a total of 333 tweets are classified as Hausa.Of these, 105 examples are correct (37.50%), 18 are mixed, while the others are wrongly labeled.

Case Study II: AfroLID on AfriSenti
We

Conclusion
We introduced our novel African language identification tool, AfroLID.To the best of our knowledge, AfroLID is the first publicly available tool that covers a large number of African languages and language varieties.AfroLID also has the advantages of wide geographical coverage (50 African countries) and linguistic diversity.We demonstrated the utility of AfroLID on non-Latin scripts, Creoles, and languages with close geographical proximity.We also empirically showed AfroLID's superiority to five available tools, including in performance in the wild as applied to the much-needed Twitter domain.In the future, we plan to extend AfroLID to cover the top 100 most popular languages of the world as well as code-switched texts.

Limitations
We can identify a number of limitations for our work, as follows: • AfroLID does not cover high-resource, popular languages that are in wide use by large populations.This makes it insufficient as a stand-alone tool in real-world scenarios where many languages are used side-by-side.Extending AfroLID to more languages, however, should be straightforward since training data is available.Indeed, it is our plan to develop AfroLID in this direction in the future.
• AfroLID recognizes only Indigenous African languages in monolingual settings.This limits our tool's utility in code-mixed scenarios, (although Creoles are like code-mixed languages).This is undesirable especially because many African languages are commonly code-mixed with foreign languages due to historical reasons (Adebara and Abdul-Mageed, 2022).Again, to improve accuracy in the future, it would be beneficial to add foreign languages support in code-mixed settings such as with English, French, and Portuguese.
• Although we strive to test AfroLID in realworld scenarios, we were not able to identify native speakers except from a small number of languages.In the future, we plan to work more with the community to enable wider analyses of our predictions.

Ethical Considerations
Although LID tools are useful for a wide range of applications, they can also be misused.We release AfroLID hoping that it will be beneficial to wide audiences such as to native speakers in need of better services like health and education.Our tool is also developed using publicly available datasets that may carry biases.Although we strive to perform analyses and diagnostic case studies to probe performance of our models, our investigations are by no means comprehensive nor guarantee absence of bias in the data.In particular, we do not have access to native speakers of most of the languages covered in AfroLID.This hinders our ability to investigate samples from each (or at least the majority) of the languages.We hope that future users of the tool will be able to make further investigations to uncover AfroLID's utility in wide real-world situations.Table A.1: A comparison of results on AfroLID with CLD2, CLD3, Langid.py,LangDetect, and Franc using F 1 -score on the Dev set.A dash ("−") indicates that the tool does not support the language.

B Analysis of AfroLID
We perform the experiments on non-Latin scripts, Creoles, and languages in close geographical proximity on the Dev set, as in Subsection 5. We show the results on the performance of AfroLID on non-Latin scripts in Table B.1, Creole languages in Table B.2 and geographical proximity in Table B.3 respectively.Datasets for LID are often created using various genre of data for one or more languages.For multilingual LID, which is the focus of our work, documents are gathered from web pages containing multiple languages.Web pages for multilingual organizations are also often desirable because the same text is translated into various languages.
Most datasets for multilingual LID cover European languages and many other high resource languages, making AfroLID dataset a significant contribution to AfricaNLP.To the best of our knowledge, AfroLID dataset is the first publicly available dataset for multilingual language identification for African languages.We provide details of some other publicly available corpora for LID.DSL Corpus Collection (Tan et al., 2014;Malmasi et al., 2016;Zampieri et al., 2015Zampieri et al., , 2014) is a multilingual collection of short excerpts of jour- ) is a dataset collected from three different learner corpora of Portuguese including COPLE2; Leiria corpus, and PEAPL.The three corpora contain written productions from learners of Portuguese with different proficiency levels and native languages.The dataset included all the data in COPLE2 and sections of PEAPL2 and Leiria corpus with details of the dataset in Table C.1.Therefore, the dataset include texts corresponding to the following 15 languages: Arabic, Chinese, Dutch, English, French, German, Italian, Japanese, Korean, Polish, Romanian, Russian, Swedish, Spanish, and Tetum.
Wanca 2017 Web Corpora (Jauhiainen et al., 2020) is made up of re-crawls performed by the SUKI project.The target of the re-crawl was to download and check the availability of the then current version of the Wanca service of about 106, 000 pages.This list of 106, 000 http addresses was the result of several earlier web-crawls, in which they had identified the language in a total of 3, 753, 672, 009 pages.
EUROGOV, TCL, and WIKIPEDIA (Baldwin and Lui, 2010) consist of documents with a single encoding across 10 European languages; shorter documents across different encodings for 60 languages, and wikipedia web crawls for 67 languages respectively.These collection cover different genres with Eurogov collected from government documents, TCL from online news sources and Wikipedia dumps.
The UMass Global English on Twitter Dataset (Blodgett et al., 2017) contains 10, 502 tweets, randomly sampled from all publicly available geotagged Twitter messages, annotated for being in English, non-English, or having code switching, language ambiguity, or having been automatically generated.It includes messages sent from 130 dif-ferent countries.

C.2 Features
Different features can be used for training a LID system including: • Bytes and Encoding: Some encodings use a fixed number of bytes e.g ASCII while some others use variable length encoding.Some languages also use specific encodings (GuoBiao 18030 or Big5 for chinese) while the same encoding can be used for different languages (e.g UTF-8).
• Characters: Non-alphabetic, alphabets, capitalization, the number of characters in words and word combinations, the number of characters in words and word combinations have been used as features.Non-alphabetic characters has been used to detect languages like Arabic, emojis, and other languages that use non-alphabetic characters (Samih, 2017;Bestgen, 2017;Dongen, 2017).Alphabets can also be used to exclude languages where a unique character is absent in the test document.
• Character combination: co-occurrences of some characters can be used to detect some languages.Linguistically, some languages abhor certain combination of characters which some other languages allow.For example some Niger-Congo languages abhor vowel hiatus and every consonant must be followed by a vowel.This feature has been found useful for developing LID systems (van der Lee and van den Bosch, 2017;Dongen, 2017;Martinc et al., 2017).
• Morphemes, Syllables and Chunks: different morphological features including prefixes, suffixes, and character n-grams (Gomez et al., 2017).Syllables, chunks, and chunks of syllables / ngrams have also been used for LID.This also has linguistic significance in that the prefix, suffixes and morphological information embedded in a language can provide information about the etymology of a language.
• Words: The position of words (Adouane and Dobnik, 2017), the string edit distance and n-gram overlap between the word to be identified and words in dictionaries, dictionary of unique words in a language, basic dictionary of a language, most common words, word clusters among others are some discriminating features used for LID.
• Combination of words: Here, length of words, the ratio to the total number of words of: onceoccurring words, twice-occurring words, short words, long words, function words, adjectives and adverbs, personal pronouns, and question words are some features used here (van der Lee and van den Bosch, 2017).This feature is linguistically significant since the ratio of certain categories of words can be useful for identifying some languages.
• Syntax and Part of speech (POS) tags: Syntactic features can be used to identify languages.
Identifying an adjective before a noun for instance may be a good indication for some languages and even the tags available can be a useful feature.Syntactic parsers together with dictionaries and morpheme lexicons, n-grams composed of POS tags and function words have all been used as features (Adouane and Dobnik, 2017) for LID.
• Languages identified for surrounding words in word-level LID: The language of surrounding words can also be a useful feature since there may be a higher likelihood of having some languages used together.This is especially true in the case of codeswitching where some languages are more likely to be used together than some others (Dongen, 2017).
• Feature smoothing: Feature smoothing is required in order to handle the cases where not all features in a test document have been attested in the training corpora.Feature smoothing is used in low resource scenarios and when the frequency of some features are high.Different types of feature smoothing is possible.Some of them are additive smoothing where an extra number of occurrences is added to every possible feature in the language model (Jauhiainen et al., 2019).

C.3 Methods
Algorithms for LID work by first using one or more features before using a classification algorithm to determine the appropriate language for a text (Grothe et al., 2008;Jauhiainen et al., 2019).
Hidden Markov Models (HMM) Hidden Markov Models (HMM) are commonly used in spoken language identification (Zissman and Berkling, 2001;Yan and Barnard, 1995) as well as for written language (Guzman et al., 2016).Language models are first trained for each language that the system must know about using a text corpora, and stored for later comparison with unidentified text.In these models the parameters of the HMM are the transition probability and the initial probability.Probabilities are calculated using the relative frequency of each transition or initial state of the training data.
After training, the system calculates the sequence probability using each language model that has been trained (Padró and Padró, 2004).
N-Gram-Based Text Categorization This method introduced by (Cavnar and Trenkle, 1994;Grothe et al., 2008) is based on comparing unique n-gram frequency profiles.These frequencies are sorted in decreasing order for all unique n-grams.N-gram profiles are created for each language to be trained with n = 1 to 5. To classify a piece of text, the n-gram frequency for that text is built and compared to the n-gram profiles calculated during the training phase.This is done by computing the distance between the n-gram profiles of the text and that for each language model.The computation also penalizes the total score of the language for each missing n-gram.The language with the lowest score is selected as the identified language (Jauhiainen et al., 2017a;Padró and Padró, 2004).
LIGA This uses a graph-based n-gram approach called LIGA which was originally used for sentiment analysis (Tromp, 2011) and adopted for LID (Vogel and Tresner-Kirsch, 2012).The language models use the relative frequencies of character trigrams and those of 4-grams.To identify the language in a text, the relative frequency of each trigram and 4-gram found in a language model is added to the score of the language.The language with the highest score is selected as the language of the text.

HELI Method
The HeLI method (Jauhiainen et al., 2017b) uses character n-grams based language models for each language.The n-gram values are hyperparameters from one to a specific maximum number N max .The model then selects one language model when classifying the language of a text.The selection is based on the most applicable model to the specified text.The model then gradually backs off to a lower order n-gram if the n-gram with the N max is not applied until an n-gram can be applied.The validation set is used during evaluation to determine the best values for N max , the maximum number of features to be included in the language models, and the penalty for languages without the selected feature.The penalty functions like a smoothing parameter by transferring some of the probability mass to unseen features in the language model (Jauhiainen et al., 2017a).
Whatlang program This uses language models built with n-grams of variable byte lengths between 3 − 12 (Brown, 2013).The K most frequent n-grams and their relative frequencies are then extracted and calculated for each language.Once the first model is generated, substrings of larger n-grams are filtered out if the larger n-gram has a frequency not less than 62% of the frequency of the shorter n-grams.The model weights are computed for each language such that shorter ngrams with the same relative frequency have lower weights than those with larger n-grams.This is because larger n-grams are more informative but less common.
C.4.1 CLD214 CLD2 (McCandless, 2010) covers 83 languages and trained on web pages text, using one of three different token algorithms.CLD2 probabilistically detects over 86 languages including Afrikaans and Swahili.Unicode UTF-8 text, either plain text or HTML/XML.It requires that legacy encodings be converted to valid UTF-8.For mixed-language input, CLD2 returns the top three languages found and their approximate percentages of the total text bytes (e.g.80% English and 20% French out of 1000 bytes of text means about 800 bytes of English and 200 bytes of French).Optionally, it also returns a vector of text spans with each language identified.

C.4.3 EquiLID
EquiLID (Jurgens et al., 2017) 16 is a character based DNN encoder − decoder model (Cho et al., 2014;Sutskever et al., 2014) with an attention mechanism (Bahdanau et al., 2015).Equilid is a general purpose language identification library and command line utility built to identify a broad coverage of languages, recognize language in social media, with a particular emphasis on short text, recognizing dialectic speech from a language's speakers, identify code-switched text in any language pairing at least at the phrase level, provide whole message and per-word.EquiLID covers 70 languages including Amharic.

C.4.5 Franc
Franc supports 403 languages including 88 African languages.It is built using Universal Declaration of Human Rights UDHR documents translated into multiple languages.Details of the model architecture is not available, however there is indication that n-grams are used in the model.

C.4.6 LangDetect
LangDetect (Shuyo, 2010) covers 49 languages including Afrikaans and Swahili.LangDetect uses a huge dictionary of inflections and compound words over a Naive Bayes model with character n-grams.

C.4.7 Langid.py
Langid.py (Lui and Baldwin, 2012) covers 97 languages including Afrikaans, Amharic, Malagasy, Kinyarwanda, Swahili, and Zulu.The model is trained over a naive Bayes classifier with a multinomial event model using a mixture of byte n-grams.langid.pywas designed to be used offthe-shelf.It comes with an embedded model using training data drawn from 5 domains -government documents, software documentation, newswire, online encyclopedia, and an internet crawl, though no domain covers the full set of languages by itself, and some languages are present only in a single domain.Different aspects of langid.pyare evaluated in different ways.For cross-lingual feature selection evaluation, each dataset is partitioned into two sets of equal sizes.The first partition is used for training a classifier while the second is used for evaluation.Since each dataset covers a different set of languages, there may be languages in the evaluation dataset that are not present in the training dataset (Lui and Baldwin, 2011).The langid.pymodule on the other hand is evaluated on different datasets and the accuracy is compared with those for CLD, Textcat, and LangDetect.The accuracy of Langid.pyexceeded those from other tools on two twitter datasets (Lui and Baldwin, 2012).Langid.pycan be used as a command line tool, python library, or web service tool.Other LID tools without representation of African languages include LDIG, and Microsoft LID-tool (Gella et al., 2013(Gella et al., , 2014) ) which is a word level language identification tool for identifying code-mixed text of languages (like Hindi etc.) written in roman script and mixed with English.

D Twitter Analysis
For the Twitter in the wild analysis, we ask for annotations of yes, no or mixed on each tweet, where yes indicates agreement with the predicted label, no indicates disagreement, and mixed indicates that the tweet contains one or more other language than the predicted.We also ask for further annotations if the tweet is not in the predicted language, or is mixed with another/other language(s).In these cases, respondents are asked to identify the correct language (or mixed language[s]) if they know the language(s).We provide example annotation in the wild analysis in Table D.1 .

E Languages Covered in AfroLID
AfroLID supports 517 African languages and language varieties.We show a large map indicating the countries and languages represented in Figure E.1. Figure E.2 and E.3 show the number of languages covered in each country and the language family information for the languages.We also show the languages and language codes in Table E Table D.1: Some example annotations for the Twitter in the wild analysis.We show for each language the 4 possible annotations.

Figure 1 :
Figure 1: All 50 African countries in our data, with our 517 languages/language varieties in colored circles overlayed within respective countries.More details are in Appendix E.

Figure 2 :
Figure 2: Examples from the five scripts in our data.tributions:
Figure B.1 (in Appendix) and 5 for our Dev and Test sets.

Figure 5 :
Figure 5: Errors on the different script in AfroLID Test set.We use ISO-3 codes to represent the languages."Others" refers to languages AfroLID identifies as outside the list of languages selected for analysis.

Figure 6 :
Figure 6: Errors on the different Creoles in AfroLID.We use ISO-3 codes to represent the languages."Others" refers to languages AfroLID identifies as outside the list of languages selected for analysis.

Figure 7 :
Figure 7: Errors on Indigenous South African languages in AfroLID Test data."Others" refers to languages AfroLID identifies as outside the list of languages selected for analysis.
also test performance of AfroLID on the recently released AfriSenti Twitter dataset of African languages.AfriSenti(Muhammad et al., 2022;Yimam et al., 2020) contains ∼ 56, 000 tweets annotated for sentiment in Amharic, Hausa, Igbo, Nigerian Pidgin, Swahili, and Yorùbá.We run AfroLID and Franc tool on AfriSenti.As Figure8shows, AfroLID outperforms Franc on all languages except Nigerian Pidgin.We assume this is because Franc supports English and may have learnt some lexical / grammatical information from English to aid the identification of Nigerian Pidgin (although AfroLID outperforms Franc on Nigerian Pidgin on our Dev and Test as shown in Table5 and 6.

Figure B. 1 :
Figure B.1: Errors on the different script in AfroLID Dev set.We use ISO-3 codes to represent the languages."Others' refers to languages AfroLID identifies as outside the list of languages selected for analysis.

Figure B. 2 :
Figure B.2: Errors on the different Creoles in AfroLID.We use ISO-3 codes to represent the languages."Others" refers to languages AfroLID identifies as outside the list of languages selected for analysis.

Figure B. 3 :
Figure B.3: Errors on Indigenous South African languages in AfroLID Dev data."Others' refers to languages AfroLID identifies as outside the list of languages selected for analysis.C Extended Literature Review C.1 Datasets

Figure E. 1 :
Figure E.1: All 50 African countries in our data, with our 517 languages/language varieties in colored circles overlayed within respective countries.

Figure E. 3 :
Figure E.3: Percentage of languages per family on training dataset.

Table 1 :
Sentential word order in our data.

Table C .
3 in the Appendix.

Table 2 :
Non-Latin scripts in AfroLID data.
⋆ Oromo: is available in Latin script as well.

Namely, we randomly sample 200 examples from each language in our training data to create a smaller train set, 9 while us- ing our full Dev set. We train for up to 100 epochs, with early stopping. We search for the following hyperparameter values, picking bolded ones as our
Implementation.AfroLID is built using a Transformer architecture trained from scratch.We use 12 attention layers with 12 heads in each layer, 768 hidden dimensions, making up ∼ 200M parameters.8HyperparameterSearch and Training.To identify our best hyperparameters, we use a subset of our training data and the full development set for our hyperparameter search.

83) are between 95-99 F 1 , while 56 languages (% = 10.83) are between 90-95 F 1 .
shows, our BPE model outperforms both the char and word models on both Dev and Test data.On Dev, our BPE model acquires 96.14 F 1 and 96.19 acc, compared to 85.75 F 1 and 85.85 for char model, and 90.22 F 1 and 90.34 acc for word model, respectively.Our BPE model similarly excels on Test, with 95.95 F 1 and 96.01 acc.

Table 4 :
A comparison of results on AfroLID with CLD2, CLD3, Langid.py,LangDetect, and Franc using F 1 -score on the Test set.− indicates that the tool does not support the language.

Table 5 :
F 1 -scores on our Dev dataset for languages in AfroLID and Franc for 88 languages.

Table 6 :
F 1 -scores on our Test dataset for languages in AfroLID and Franc for 88 languages.
William B. Cavnar and John M. Trenkle.1994.Ngram-based text categorization.In In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 161-175.Appendices A Results of AfroLID on Dev Set

Table C .
3: Language varieties that use diacritics in our training data.