LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages

Knowing the language of an input text/audio is a necessary first step for using almost every natural language processing (NLP) tool such as taggers, parsers, or translation systems. Language identification is a well-studied problem, sometimes even considered solved; in reality, most of the world's 7000 languages are not supported by current systems. This lack of representation affects large-scale data mining efforts and further exacerbates data shortage for low-resource languages. We take a step towards tackling the data bottleneck by compiling a corpus of over 50K parallel children's stories in 350+ languages and dialects, and the computation bottleneck by building lightweight hierarchical models for language identification. Our data can serve as benchmark data for language identification of short texts and for understudied translation directions such as those between Indian or African languages. Our proposed method, Hierarchical LIMIT, uses limited computation to expand coverage into excluded languages while maintaining prediction quality.


Introduction
Building natural language processing (NLP) tools like machine translation, language identification, part of speech (POS) taggers, etc. increasingly requires more and more data and computational resources.To attain good performance on a large number of languages, model complexity and data quantity must be increased.However, for a majority of the world's 7000 languages, large amounts of data are often unavailable which creates a high barrier of entry (Blasi et al., 2022;Joshi et al., 2020;Khanuja et al., 2023).Increasing model complexity for large-scale models also requires disproportion-Figure 1: Most languages in our dataset are from the Indian Subcontinent and Sub-Saharan Africa, with significant minorities from Europe (primarily in the role of the high-resource language parallel translation available for each story).Color broadly indicates continent or region (North America, South America, Africa, Europe, Asia, Oceania) and size indicates number of languages per country in our dataset.ate amount of computational resources, further disincentivizing researchers to work towards including these languages in modern NLP systems.
A popular data collection approach is largescale web mining (Tiedemann and Nygaard, 2004;Bañón et al., 2020;Schwenk et al., 2021b), where large parts of the internet are scoured to find training data for data-hungry NLP algorithms.When faced with a sentence or phrase, such algorithms must know how to reliably sort this text into the appropriate language bucket.Since the web is replete with content in a variety of languages, a model needs to recognize text in a sufficiently large number of these languages with high accuracy.Identifying parallel bitext is even more demanding as a machine translation system must also be available to correctly identify and align parallel data (Vegi et al., 2022;Kunchukuttan et al., 2018).This datacollection paradigm becomes inaccessible for lowresource languages because high-quality translation models usually require substantial amounts of parallel data for training, which is often unavailable.Without high-quality language identification and translation system, it becomes practically impossible to mine the internet for relevant text during such collection efforts.Additionally, mispredictions by language identification and data collection algorithms can increase inter-class noise, reducing the crawled data's quality, and harming performance in downstream tasks without strong quality evaluation metrics (Kocyigit et al., 2022).
How can we address these challenges and build high-quality identification and translation for lowresource languages?
Resource Creation Highlighting the need for resource creation in low-resource languages, we first share a new parallel children's stories dataset, MCS-350, created using two resources: African Storybooks Initiative2 and Indian non-profit publishing outfit Pratham Books' digital repository Storyweaver3 (available under permissive Creative Commons licenses).The combined dataset includes original and human-translated parallel stories in over 350 languages (visualized in Figure 1) and we merge, preprocess, and structure it so it is easily utilizable by NLP researchers for training and benchmarking ( §2).
Machine Translation Armed with parallel stories in many low-resource African and Indian languages, we tackle machine translation in resourceconstrained situations next.If we aim to collect parallel data in low-resource languages, language identification itself is insufficient and we need highquality translation models as well.We utilize a pretrained multilingual translation model (Alam and Anastasopoulos, 2022) and explore training with hierarchical language-level and language familylevel adapter units to translate children's stories at the page level ( §3).
Language Identification Finally, we take on the biggest bottleneck in low-resource language data collection efforts -language identification.We propose LIMIT -a misidentification-based hierarchical modeling approach for language identification, that utilizes data and computational resources efficiently and shows cross-domain generalization.The proposed approach is exciting because unlike previously published language identification models like AfroLID (Adebara et al., 2022), CLD3 (Salcianu et al., 2020)  ing large multilingual models for a new set of languages and still outperforms existing systems.Large multilingual models often require thousands of sentences for training, ex.AfroLID (Adebara et al., 2022) collects and trains on over 4000 sentences per language.On the other hand, for many low-resource languages in India and Africa, we may not even be able to collect 1000 sentences at first 2. Also, in contrast with other recent work in hierarchical language identification (Goutte et al., 2014;Lui et al., 2014;Bestgen, 2017;Jauhiainen et al., 2019), our work stands out because it accounts for mispredictions made by existing trained models.Unlike other work, it does not predict a group/language family first, but rather directly learns confusion relationships between language pairs (which may not be from the same language family).By leveraging hierarchically organized units on top of a root model, we avoid complete retraining, saving computational resources, while increasing coverage into many new and understudied languages and language pairs (especially those between two low-resource languages) ( §4).
To summarize, our main contributions are: 1. We compile MCS-350, a dataset of 50K+ parallel children's stories from African Storybooks Initiative and Storyweaver in 350+ languages ( §2). 2. We share a machine translation benchmark enabling translation evaluation in more than 1400 new translation directions ( §3

Parallel Dataset
We collect stories through a mix of web scraping and public APIs, preprocess them to remove mismatched/incorrect text, extract monolingual text for language identification and parallel text for machine translation.We maintain metadata about authors, translators, illustrators, reading level, parallel translations, and copyrights for each story.We remove stories that are either empty or those from non-English languages that have over 50% pages containing majority English text with 90% confidence using langdetect (Nakatani, 2010).This leaves us with ∼52K stories.Note that both African Storybooks Initiative and Pratham Storyweaver human verify stories and language.However, there are several abandoned translation projects and completed but unverified stories that need automated checking.Therefore, our preprocessing is meant for unverified stories, and may introduce noise in the collected data.By improving the preprocessing filters, we can likely further improve the quality of the unverified stories in the corpus.Collected stories in the pre-merge stage are available with their associated metadata in the repository.

Script
Languages

Multilingual Documents
MCS-350 contains multilingual stories with language identifiers denoted by L 1 _L 2 for a story multilingual in L 1 and L 2 .Such stories include text in multiple languages within the same page.Text may be code-mixed or consecutively presented.To extract as many parallel sentences as possible to support vulnerable languages and also create new translation directions, we employ string-similarity based matching to identify the segments corresponding to the high-resource language in the pair, and therefore automatically generating parallel sentences from 10K pages across 52 languages.E.g., through this process, we extracted 1000+ sentences in Kui (0 sentences pre-extraction), a minority Dravidian language with about 900K native speakers.We manually verified all extracted monolingual text after using string matching on multilingual stories.

Language Varieties/Lects
We attempt to separate language varieties/lects into unique prediction classes if there is sufficient training data for them (≥ 1000 sentences).If an ISO code is unavailable for the lect, we assign a class name with the ISO code and the subdivision specified as: ISO_SUBDIVISION.For instance, we separated Gondi's South Bastar lect (GON_BASTAR, 4000+ sentences) from the generic language code for Gondi (GON).For fair evaluation and comparison, we provide manual mappings for any nonstandard identifiers from the output space of various language identification tools.Lects with too little data are merged into their parent language, e.g., "Bangla (Bangladesh)" merged into "Bengali".

Data Overview
MCS-350 covers over 350 languages from a diverse pool of language families.In    FLORES-200 (nway, 200 languages;NLLB Team et al., 2022), or OPUS-100 (parallel data for 99 languages to/from English; Aharoni et al., 2019), our benchmark introduces up to 82 new languages leading to more than 1400 new language pairs (see Table 2).

Machine Translation Benchmark
While it is true that resource creation in lowresource languages requires fine-grained and highquality language identification, collecting parallel data additionally requires high-quality MT ( §1).In this section, we explore phylogeny-based hierarchical adapter units to improve translation quality between two African languages, and between African languages and English/French.

Data
We exploit the parallel nature of children's stories in MCS-350 and ensure that all training stories are separate from test (1000 pages) stories.This is done to get a more realistic estimate of translation quality on new stories.For languages with < 1000 pages across stories, we use 500-page test sets.

Experimental Settings
As our baseline, we used the model from Alam and Anastasopoulos (2022), which is the bestperforming publicly available model from the WMT Shared Task on Large Scale Evaluation for African Languages (Adelani et al., 2022). 6They first fine-tuned the DeltaLM7 model (Ma et al., 2021) in 26 languages.After that, they added lightweight language-specific adapter layers (Pfeiffer et al., 2022) and fine-tuned only the adapters in those 26 languages.We can either use a single adapter per language (L-Fine) or organize the adapters in a phylogenetically-informed hierarchy Model F 1 Supported Common Total (with LIMIT) CLD3 (Salcianu et al., 2020) 0.11 101 81 376 langid.py(Lui and Baldwin, 2012) 0 (F-Fine) so that similar languages share languagefamily and genus-level adapters (Faisal and Anastasopoulos, 2022).We perform both L-Fine and F-Fine experiments using the publicly available code 8 and also share an additional baseline by finetuning the DeltaLM model without adapters.Details on phylogenetic trees and reproducibility are in Appendix §A.3.

Evaluation
In Table 4, we show the performance of our L-Fine and F-Fine models compared to the baseline on our test set.We evaluate using three wellknown MT metrics: BLEU (Papineni et al., 2002), CHRF++ (Popović, 2017), and spBLEU (NLLB Team et al., 2022).For spBLEU, we use the FLO-RES200 SPM model to create subwords.
Based on all three metrics, our L-Fine model outperforms the Baseline model consistently by 4.0-11.5 spBLEU points by just fine-tuning with language-specific adapters.Our F-Fine model outperforms the L-Fine model by 5.0-7.5 spBLEu points by fine-tuning only some shared parameters among languages and language-specific adapters.We also test our models on a public benchmark, FLORES200 (Appendix §B), and observe that due to the domain shift, L-Fine and F-Fine models under-perform the Baseline.
Despite this domain shift, several low-resource language pairs benefit from adapter fine-tuning across domains.We report these language pairs and their respective spBLEU gains for the F-Fine model in Table 5.We get the highest gains for English-Xhosa (20.1 points) and English-Hausa (18.8 points) across domains, both of which had poor performance from the Baseline model with spBLEU of 3.5 and 4.5, respectively.We also no-8 https://github.com/mahfuzibnalam/large-scale_MT_African_languages tice cross-domain improvement in some translation directions involving two African languages such as Ganda-Kinyarwanda (2.9 points) and Northern Sotho-Ganda (3.0 points).Exhaustive results for other language pairs can be found in Appendix §B.

Language (Mis)Identification Benchmark
Language identification (LID) affects low-resource language resource creation efforts severely (Jauhiainen et al., 2019;Schwenk et al., 2021a) because to collect data, we need accurate language identifiers that themselves need high-quality data to train (Burchell et al., 2023) , creating a vicious cycle.Low-quality systems often make mispredictions which increases inter-class noise and reduces the crawled data's quality (Kocyigit et al., 2022;Burchell et al., 2023) both for the predicted language and the true language.To correct mispredictions and improve accuracy in supported languages with limited data, we propose a hierarchical modeling approach.Hierarchical modeling is an extremely popular choice for a wide variety of algorithmic tasks and it has been explored for language identification as well (Goutte et al., 2014;Lui et al., 2014;Bestgen, 2017;Jauhiainen et al., 2019).However, previous work has focused on predicting language group/family first, followed by finer-grained predictions with a smaller set of classes.Our work departs from this paradigm in two ways -first, we bring focus onto expanding language identification coverage in pre-trained or off-the-shelf systems without retraining, and second, we predict a prior and posterior language based on confusion and misprediction patterns of the model directly (without predicting language family/group first).
Under our technique, we first choose a wellperforming root model with high-coverage that provides us with the base/prior prediction.Such base predictions are obtained for a sample of MCS-350's training set, allowing us to identify systemic confusion patterns embedded within the model using a confusion matrix.Based on the identified misprediction patterns (which may or may not be between languages in the same family), we train lightweight confusion-resolution subunits that can be attached onto the root model to make the posterior prediction.Our results show that, with this architecture, a small sample of data is sufficient to investigate pretrained, off-the-shelf, or blackbox commercial models and identify systemic misprediction patterns across domains.

Experimental Settings
Wide-Coverage root Model To pick an appropriate root model to test our misidentificationbased hierarchical approach, we compare several state-of-the-art pre-trained models ( §4.2) and choose the system with the highest macro-F 1 score, giving equal importance to all languages in MCS-350.
Traditional Hierarchical group-first Model Classical hierarchical models predict language family/group first, followed by the specific language (Goutte et al., 2014;Lui et al., 2014;Bestgen, 2017;Jauhiainen et al., 2019).These groups often have phylogenetic backing and are not learned through the output distribution of the root model.
For benchmarking, we train this traditional hierarchical group model as well (Table 7).

N-Way Multilingual multi Model
To contrast our work with typical large multilingual modeling where there is no architectural class hierarchy, we train a large fastText multilingual model with all 350+ languages (multi).With a large number of classes, we know that low-resource languages suffer due to class imbalance, even with upsampling.But, we still include performance results from multi in Table 7 to compare it with the two hierarchical approaches.
LIMIT's Confusion-resolution Units We use fastText (Joulin et al., 2017) to train small models that will specialize in distinguishing between 2-3 highly-confused languages each.9Up to 1000 sentences/language are used for training, and 100 randomly selected sentences across stories are reserved as the final test set.We train our own embeddings because existing multilingual embeddings (Devlin et al., 2019) are not trained on sufficiently wide low-resource language data.

Evaluation Metric
To select a root model, performance is compared based on aggregated macro-F 1 scores across languages (Table 6).To compare the performance of the root model and LIMIT (our proposed approach) on MCS-350, our benchmark dataset, and the existing FLORES-200 benchmark, we report language-level F 1 scores (

Pre-trained root Models
In Table 6, we show macro-F 1 scores across all 350+ languages for popular pretrained identification systems like Google's CLD3, Langid.py,Franc, fastText (Joulin et al., 2017), and HeLI-OTS (Jauhiainen et al., 2022a).Franc, built using the Universal Declaration of Human Rights (UDHR) data, comes out with the best macro-F 1 , covering 30% of our languages (105/356 languages).It is derived from guess-language 10 which uses a mix of writing system detection and character-level trigrams.Hence, we use Franc as the root system for our misprediction-based hierarchical modeling experiments.The overall low scores on human-10 https://github.com/kent37/guess-languagewritten sentences in MCS-350 (all systems achieve an F 1 score < 0.20) are worth noting, and indicate that off-the-shelf systems ultimately tend to perform really well only on some languages, despite officially supporting hundreds of languages.

Language (Mis)identification
Next, we inspect the best-performing root model's confusion matrix on MCS-350's training set (a representative example is shown in Figure 2) to understand and identify misprediction patterns.For each test language, we divide the root model's predictions by the total number of tested examples giving us a hit ratio for each pair.E.g., (Gujarati, Kutchi) would represent the ratio of Kutchi sentences that were misidentified as Gujarati.Upon inspection of the confusion matrix, we identified the following 9 clusters with a high confusion ratio (> 0.7).According to our experimental approach outlined in §4.1, we train a lean fastText classifier for each of these clusters, that will specialize in differentiating between these highly-confused languages: 1. Gujarati, Kutchi, Bhilori 2. Amharic, Tigrinya, Silt'e 3. Koda, Bengali, Assamese 4. Mandarin, Yue Chinese 5. Konda, Telugu 6. Kodava, Kannada 7. Tsonga, Tswa 8. Dagaare, Mumuye 9. Bats, Georgian

Expanded Language Coverage
We report F 1 scores for each of the 9 highly confused clusters' languages (Table 7) and observe that languages in each cluster share writing systems and are often phylogenetically related.Our misidentification-based model, LIMIT, is successful at improving F 1 scores on both our newly collected MCS-350 dataset as well as the public benchmark, FLORES-200.On MCS-350, LIMIT improves F 1 scores from 0.29 to 0.68, a 55% error reduction.Of the multidomain data available in FLORES-200 (11/21 languages), LIMIT improves F 1 from 0.77 to 0.86, a 40% error reduction, demonstrating that our method's utility is not restricted to the training data's domain.
Note that hierarchical modeling could be viewed as further complicating a simple root model, but we contend that this is valuable when retraining is not an option due to lack of data, closed-source code, etc (Section 5).This simple extension allows us to extend a high-coverage root model to newer languages or domains that have small amounts of training data, while maintaining high-quality predictions.Furthermore, our hierarchical method LIMIT also outperforms a system multi that is trained on all the languages in the test set.

Sentence Length and Domain
For several languages like Gujarati, Amharic, Bengali, and Mandarin, low F 1 scores for MCS-350 compared to high F 1 scores on FLORES-200 indicate that shorter texts in the children's stories domain are much harder to identify.This is expected due to limited feature signals in shorter texts but it is worth noting that that is the opposite of our findings in the machine translation task ( §3.3), where translating shorter texts in MCS-350 proved easier than translating FLORES-200 data.Our misprediction-based hierarchical is not only easier to train with limited data, but also brings valuable crossdomain language identification improvements.A recently published dataset, BLOOM (Leong et al., 2022), leverages text and audio in children's stories from similar sources (African Storybooks, The Asia Foundation, Little Zebra Books etc.) to create benchmarks for image captioning and speech recognition.However, their data is monolingual, unaligned, and cannot be used for machine translation.We leveraged the highly parallel nature of the collected storybooks, five times the number of stories in BLOOM, and created test sets and baselines for understudied translation directions.
It is also important to us to avoid representation washing (Caswell et al., 2020) and we clearly highlight the sources of noise from unverified stories in our merged dataset.With stricter preprocessing filters applied at the pre-merge stage, a 'cleaner' dataset could be produced, like in Burchell et al. (2023).We provide access to our data at all such timesteps in the preprocessing pipeline so researchers are not required to use the final dataset, but may use an earlier raw version and preprocess it themselves according to their needs.

Machine Translation
Thousands of languages are spoken worldwide, so representing them with bilingual models would require thousands of models.Neither scalability nor adaptability makes this an ideal solution.Through various training methods (Aharoni et al., 2019;Wang et al., 2020), model structures (Wang et al., 2018;Zhang et al., 2021), and data augmentation (Tan et al., 2019;Pan et al., 2021) a variety of research has attempted to improve multilingual translation models.Adapter units were initially pro-posed for light-weight domain adaptation (Vilar, 2018) and then also for extending large pre-trained models to a downstream tasks and using bilinugal adapters (Houlsby et al., 2019;Bapna and Firat, 2019).

Language Identification
Text-based language identification is usually modelled as a classification task.By increasing the number of languages a classifier must predict, average accuracy generally tends to decrease (Jauhiainen et al., 2017), a problem we propose to tackle by leveraging a misprediction-based hierarchical approach.To distinguish between closely related languages, a lot of exciting research has been published at various editions of VarDial -The Workshop on NLP for Similar Languages, Varieties and Dialects (Aepli et al., 2022;Scherrer et al., 2022;Chakravarthi et al., 2021;Zampieri et al., 2020Zampieri et al., , 2014)).
Over the last 3 iterations of VarDial from 2019-2022, many new datasets and techniques to identify Romance languages (Jauhiainen et al., 2022b;Zaharia et al., 2021), Nordic languages (Haas and Derczynski, 2021), Uralic languages (Jauhiainen et al., 2020), German lects (Mihaela et al., 2021;Siewert et al., 2020), and the Slavic language continuum (Popović et al., 2020;Abdullah et al., 2020) were published.In contrast, we see only a handful papers and tasks on Indian languages at the venue with 2 focusing on Indo-Aryan and 2 focusing on Dravidian languages (Nath et al., 2022;Bhatia et al., 2021;Jauhiainen et al., 2021;Chakravarthi et al., 2020), and no papers or tasks, to our knowledge, on African languages.Outside the venue, recently published models like AfroLID (Adebara et al., 2022) for language identification and IndicTrans2 (AI4Bharat et al., 2023) for Indic-language translation are great large-scale efforts in the low-resource language space.Brown (2014), a notable technique, trains richer embeddings with non-linear mappings and achieves substantial improvements in downstream language identification on 1400+ languages.However, we do not benchmark with this technique because the paper does not contain any experiments in low-resource training setups.Training data is about 2.5 million bytes/language, while we are working with <50K bytes/language.Therefore, exploring non-linear embedding mappings in low-resource settings (Brown, 2014) is left for future work.

Hierarchical Modeling
Hierarchical approaches have proved successful in solving a myriad of computational problems, and have proved useful in language identification previously.The widely used approach first predicts a preliminary language group/family, and then another fine-tuned prediction from a smaller set of output classes contained within the language group/family (Goutte et al., 2014;Lui et al., 2014;Bestgen, 2017;Jauhiainen et al., 2019).In contrast, our work extends architecture to account for mispredictions made by existing trained models, and does not predict a group/language family first, but rather directly learns confusion relationships between language pairs.Then, similar to Bestgen (2017); Goutte et al. ( 2014), we train smaller classifiers for a fine-tuned posterior prediction.However, our approach departs from their paradigm in that our classifiers may also distinguish between highly-confused languages which belong to different language families.

Conclusion
In this work, we tackle the lack of resources for many of the world's languages and release a large, massively parallel children's stories dataset, MCS-350, covering languages from diverse language families, writing systems, and reading levels.Since translation is crucial for parallel resource creation, we explore adapter-based networks fine-tuned on a phylogenetic architecture, and utilize MCS-350 to create new translation benchmarks for vulnerable and low-resource languages.We demonstrate large improvements in the children's story domain and cross-domain improvement for several language pairs (on the FLORES benchmark dataset).On the algorithmic front, we introduce LIMIT, a hierarchical, misprediction-based approach to counter the inaccuracies of pre-trained language identification systems.Our method increases language coverage and prediction accuracy bypassing complete retraining, and shows crossdomain generalization despite being trained on our MCS-350 dataset.
In the future, we hope to further investigate misprediction-based hierarchical language identification across more datasets, with more configurations, and extensions such as probabilistic branching, automated constructions etc.As a natural next step, we will utilize LIMIT in a web-crawl to find and collect more low-resource language data.

Limitations
Our dataset covers 350+ text-based languages.However, out of the 7000 languages in the world, many are primarily spoken languages and do not have a presence in the form of articles, textbooks, stories etc.Therefore, language identification for speech is crucial and we plan on extending our text-based work to speech in future work.
While our proposed method LIMIT shows crossdomain improvements, we acknowledge that our system, like other language identification systems, is not perfect and may still make classification errors on new domains, text lengths, or orthographies.We encourage researchers to keep this in mind when applying our proposed method to their work.

Ethics Statement
Data used, compiled, and preprocessed in this project is freely available online under Creative Commons licenses (CC BY 4.0).Stories from the African Storybooks Initiative (ASI) are openly licensed, can be used without asking for permission, and without paying any fees.We acknowledge the writers, authors, translators, illustrators of each of the books and the ASI team for creating such a valuable repository of parallel storybooks in African languages.Stories from the Pratham Storybooks' Storyweaver portal are available under open licensing as well, and we preserve metadata for the author, illustrator, translator (where applicable), publisher, copyright information, and donor/funder for each book, in accordance with Storyweaver's guidelines.Since stories hosted on African Storybooks Initiative and Pratham Books' Storyweaver are intended for children and most of them are vetted or human-verified we do not explicitly check for offensive content.
Our language identification models, by design, are meant to provide an alternative to training resource-hungry large-scale multilingual models that require a lot of training data.Such models are inaccessible to many researchers since they require access to specialized computing hardware.Our models are built with sustainability and equity in mind, and can be trained in a matter of minutes on CPU on standard laptops.Tommi Jauhiainen, Heidi Jauhiainen, and Krister Lindén.2022b.Italian language and dialect identification and regional French variety detection using adaptive naive Bayes.In Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 119-129, Gyeongju, Republic of Korea.Association for Computational Linguistics.
International Committee on Computational Linguistics (ICCL).
Tommi Jauhiainen, Krister Lindén, and Heidi Jauhiainen.2017.Evaluation of language identification methods using 285 languages.In Proceedings of the 21st Nordic Conference on Computational Linguistics, pages 183-191, Gothenburg, Sweden.Association for Computational Linguistics.

A Reproducibility
In this section, we outline how to reproduce the different aspects our work.Data collection, data preprocessing, machine translation experiments and evaluation, and language identification experiments have been completed in a manner that is fully reproducible.

A.1 Data Curation
All data can be replicated and reproduced through code/data-collection.
Intermediate preprocessing steps can be applied through code/preprocessing, merged through code/merging, and summary statistics be produced through code/summary-stats.Data paths are set up so that any retrieved, preprocessed, merged data is located in data/.
A.2 Language ID code/language-id/ contains the relveant scripts to replicate all language identification experiments, training, model architecture, and results.Relevant language identification data is decoupled from the code directory and is located in data/language-id.

A.3 Machine Translation
Our machine translation experiments are performed using publicly available code from https://github.com/mahfuzibnalam/large-scale_MT_African_languages.
To produce results regarding novel translation directions enabled by our data, please refer to code/new_lang_pairs. Table A.1 shows the phylogeny configuration we use to fine-tune the MT system.

B Supplementary Machine Translation Benchmarks
On the following pages, we report the aggregate evaluation results of our MT models on the FLORES200 devtest of 176 languages (BLEU, CHRF++, spBLEU).We also report BLEU, CHRF++, and spBLEU for baseline, language-fine, and family-fine scores for all language pairs we perform machine translation experiments for (African focus languages from the WMT tasks' focus languages)

BLEU CHRF++ spBLEU
Baseline L-Fine F-Fine Baseline L-Fine F-Fine Baseline L-Fine F-Fine

Figure 2 :
Figure2: Subset of the multilingual root model's confusion matrix (6 languages).Using the confusion matrix, clusters of highly confused languages are identified and confusion-resolution units trained according to the tree shown on the right.The tree, for demonstration purposes, is a subset of the entire tree which has 9 confusionresolution units Language identification models tend to use popular training datasets like UDHR(Vatanen et al., 2010),Blodgett et al. (2017)  for social media, King and Abney (2013) (web-crawl in 30 languages), FLORES-200, JW-300(Agić and Vulić, 2019) (multilingual articles from Jehovah's Witness' website) etc.

Table 1 :
and Franc 4 , LIMIT avoids train-Our compiled dataset MCS-350 contains stories from a diverse set of languages families, mostly coming from Africa and India.Prominent language families with with 20K+ sentences across languages shown.

Table 2 :
enables MT evaluation between 1400+ new pairs compared to existing benchmarks.
books Initiative hosts parallel translated and humanverified children's stories in over 200 African languages.Pratham Books is a non-profit Indian publisher that aims to increase literacy of children and adults alike in Indian languages.Their digital repository, Storyweaver, publishes parallel translated stories in 300+ languages.This includes not only Indian languages but also African, European, and Indigenous languages from the Americas.

Table 3 :
Our dataset contains stories in many writing systems other than Latin, especially those from the Indian Subcontinent.Prominent non-Latin writing systems in MCS-350 are shown above.

Table 1 ,
we share the Model Avg ALL Avg AFRI→AFRI Avg X→ENG Avg ENG→X Avg Y→FRA Avg FRA→Y

Table 4 :
spBLEU across 176 translation directions involving African languages, we see that including phylogenetic information helps in translation, with the family-based F-Fine model showing the best performance, on average.Avg AFRI→AFRI denotes the overall average spBLEU of translation between two African languages.Avg X/Y→ENG/FRA and Avg ENG/FRA→X/Y denote translating into and out of English/French respectively.Parentheses below the averages represent standard deviations.Baseline refers to a DeltaLM model finetuned on 26 languages without adapters.We can see that it is harder to translate out of English than into English.

Table 6 :
This table shows different popular language identification ssytems, their F 1 scores on MCS-350, supported languages, common languages, and total coverage with LIMIT.Franc, trained on UDHR data, outperforms other systems both on performance and coverage, and will serve as the root model for our experiments.Macro F 1 score is computed across all 355+ languages to identify a system with the best overall coverage and accuracy.

Table 7 :
LIMIT improves F 1 scores over the root, multi, and group models on both our children's stories dataset and out-of-domain FLORES-200.The traditional hierarchical approach group underperforms the multilingual model multi on both MCS-350 and FLORES-200.Empty entries indicate unsupported languages and bolded entries indicate noteworthy differences in F 1 scores.Nested languages are misidentified as the parent in root.Note that for FLORES-200, the root model gets 0 F 1 score on ASM and YUE but both languages are covered by the dataset.

Table B .
MetricsModels Avg ALL Avg X→ENG Avg ENG→X Avg AFRICAN→AFRICAN Avg Y→FRA Avg FRA→Y TableB.1:Evaluationresults on our test set of 176 language directions.Avg X→ENG denotes the average score of directions between other languages and English.Avg ENG→X denotes the average score of directions between English and other languages.Avg AFRICAN→AFRICAN denotes the average score of directions between African languages to other African languages.Avg Y→FRA denotes the average score of directions between other languages and French.Avg FRA→Y denotes the average score of directions between French and other languages.Avg ALL denotes the average result of all translation directions.MetricsModels Avg ALL Avg X→ENG Avg ENG→X Avg AFRICAN→AFRICAN Avg Y→FRA Avg FRA→Y 2: Evaluation results on FLORES200 devtest of 176 language directions.Avg X→ENG denotes the average score of directions between other languages and English.Avg ENG→X denotes the average score of directions between English and other languages.Avg AFRICAN→AFRICAN denotes the average score of directions between African languages to other African languages.Avg Y→FRA denotes the average score of directions between other languages and French.Avg FRA→Y denotes the average score of directions between French and other languages.Avg ALL denotes the average result of all translation directions.
Fine F-Fine Baseline L-Fine F-Fine Baseline L-Fine F-Fine Table B.5: Results of all language pairs on FLORES200 devtest