Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages and allows us to train Glot500-m. We evaluate Glot500-m on five diverse tasks across these languages. We observe large improvements for both high-resource and low-resource languages compared to an XLM-R baseline. Our analysis shows that no single factor explains the quality of multilingual LLM representations. Rather, a combination of factors determines quality including corpus size, script, “help” from related languages and the total capacity of the model. Our work addresses an important goal of NLP research: we should notlimit NLP to a small fraction of the world’s languages and instead strive to support as many languages as possible to bring the benefits of NLP technology to all languages and cultures. Code, data and models are available at https://github.com/cisnlp/Glot500.


Introduction
The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., deepening their understanding of high-resource languages by scaling up parameters and training data. While this approach has revolutionized NLP, the achievements are largely limited to high-resource languages. Examples of "vertical" LLMs are GPT3 (Brown et al., 2020), PaLM (Chowdhery et al., 2022) and Bloom (BigScience et al., 2022). In this paper, we create Glot500-m, a model that instead focuses on scaling multilingual LLMs horizontally, i.e., scaling to a large number of languages * Equal contribution. almost all of which are low-resource. As LLMs are essential for progress in NLP, lack of LLMs supporting low-resource languages is an enormous impediment to bringing NLP to all of the world's languages and cultures. Our goal is to address this need with the creation of Glot500-m. 1 Existing multilingual LLMs support at most 104 (Conneau et al., 2020) out of the 7000 languages of the world. These 104 languages are the ones for which large amounts of training data are available through projects such as Oscar (Suárez et al., 2019) and the Wikipedia dumps. 2 We refer to these 104 languages as head languages and to the remaining languages as tail languages. One key problem we address is that the availability of data for tail languages is limited compared to head languages. As a result, tail languages have often been ignored by language technologies (Joshi et al., 2020).
Although there exists some work on machine translation for a large number of tail languages Bapna et al., 2022), existing LLMs for tail languages are limited to a relatively small number of languages (Wang et al., 2019;Alabi et al., 2022;Wang et al., 2022). In this paper, we address this gap. Our work has three parts. (i) Corpus collection. We collect Glot2000-c, a corpus covering thousands of tail languages. (ii) Model training. Using Glot500-c, a subset of Glot2000-c, we train Glot500-m, an LLM covering 511 languages. (iii) Validation. We conduct an extensive evaluation of the quality of Glot500-m's representations of tail languages on a diverse suite of tasks.
In more detail, corpus collection considers three major sources: websites that are known to publish content in specific languages, corpora with classified multilingual content and datasets published in specific tail languages. The resulting dataset Glot2000-c comprises 700 GB in 2266 languagescript pairs collected from ≈150 sources. After cleaning and deduplication, we create the subset Glot500-c, consisting of 534 language-script pairs corresponding to 511 languages -those with more than 30,000 sentences -to train Glot500-m.
Model training. To train Glot500-m, we employ vocabulary extension and continued pretraining techniques. The vocabulary of XLM-R is extended with new tokens trained on Glot500-c. Subsequently, we perform continued pretraining of XLM-R using the MLM objective described by Devlin et al. (2019).
Validation. We comprehensively evaluate Glot500-m on a diverse suite of natural language understanding, sequence labeling and multilingual downstream tasks to assess the quality of the learned representations for hundreds of languages. The results demonstrate that Glot500-m performs better than XLM-R-B for tail languages by a large margin while performing comparably (or better) for head languages.
Previous work on multilinguality has been hindered by the lack of LLMs supporting a large number of languages. This limitation has led to studies being conducted in settings dissimilar from realworld scenarios. For example, Dufter and Schütze (2020) use synthetic language data. And the curse of multilinguality has been primarily studied for a set of high-resource languages (Conneau et al., 2020). By creating Glot500-m, we can investigate these issues in a more realistic setting. We make code, data, and trained models available in the hope of fostering more research by the community on how to include the many hundreds of languages that are currently ill-served by NLP technology.
Contributions. (i) We train the multilingual model Glot500-m on a 700 gigabyte corpus, covering more than 500 diverse languages, and make it publicly available at https://github. com/cisnlp/Glot500. (ii) We collect and clean Glot500-c, a corpus that covers these diverse languages and allows us to train Glot500-m, and will make as much of it publicly available as possible. (iii) We evaluate Glot500-m on pseudoperplexity and on five diverse tasks across these languages. We observe large improvements for both highresource and low-resource languages compared to an XLM-R baseline. (iv) Our extensive analysis shows that no single factor explains the quality of multilingual LLM representations. Rather, a combination of factors determines quality including corpus size, script, "help" from related languages and the total capacity of the model. (v) Our work addresses an important goal of NLP research: we should not limit NLP to a relatively small number of high-resource languages and instead strive to support as many languages as possible to bring the benefits of NLP technology to all languages and cultures.

Related Work
Training unsupervised multilingual models using the masked language modeling (MLM) objective has proven to be effective to achieve cross-lingual representations (Devlin et al., 2019;Conneau et al., 2020). These models can be further improved by incorporating techniques such as discriminative pretraining (Chi et al., 2022) and the use of parallel data (Yang et al., 2020;Chi et al., 2021). However, these improvements primarily benefit a limited set of languages with large training corpora.
Recent research has attempted to extend existing LLMs to languages with limited resources. Alternatively, parameter-efficient fine-tuning methods adapt pre-trained models to new languages by training a small set of weights effectively (Zhao et al., 2020;Pfeiffer et al., 2021;Ansell et al., 2022). Pfeiffer et al. (2022) address the "curse of multilinguality" by sharing a part of the model among all languages and having separate modules for each language. We show that the common perception that multilinguality increases as we add more languages, until, from some point, it starts decreasing is naive. The amount of available data per language and the similarity between languages also play important roles ( §6.8).
Another approach trains LLMs from scratch for a limited number of tail languages; e.g., AfriBERTa (Ogueji et al., 2021a) and IndicNLPSuite (Kakwani et al., 2020) are LLMs for 11 African languages and 11 Indic languages, respectively. In concurrent work, Adebara et al. (2022) train a multilingual model for 517 African languages on a 42 gigabyte corpus, but without making the model available and with an evaluation that covers a much smaller number of languages than ours.
Closely related to our work on corpus creation, Bapna et al. (2022) and  also create NLP resources for a large number of tail languages. They train a language identifier model and extract textual data for tail languages from large-scale web crawls. This approach is effective, but it requires significant computational resources and native speakers for all tail languages. This is only feasible for large corporations. Bapna et al. (2022) have not made their data available. Costajussà et al. (2022) have only released a portion of their data in around 200 languages.
One of the key benefits of "horizontally" scaled multilingual LLMs is transfer from high-resource to low-resource languages. Our evaluation suggests that Glot500-m excels at this, but this is not the main focus of our paper. There is a large body of work on crosslingual transfer: (Lauscher et al., 2020;Turc et al., 2021;Choenni and Shutova, 2022), inter alia.

Data Collection
One of the major challenges in developing language technologies for tail languages is the scarcity of high-quality training data. In this work, we propose a lightweight methodology that is easily replicable for academic laboratories. We identify tail language data previously published by researchers, publishers and translators and then crawl or download them. By crawling a few websites and compiling data from around 150 different datasets, we amass more than 700 GB of text in 2266 language-scripts, where we define a languagescript as a unique combination of ISO 639-3 3 and script. From now on, we will refer to these sources of data as data sources. Our data covers many domains, including religious texts, news articles and scientific papers. Some of the data sources are high-quality, verified by native speakers, translators and linguists. Others are less reliable such as web crawls and Wikipedia dumps. It is therefore necessary to clean the data. For a list of data sources, see §C.

Language-Scripts
Some languages are written in multiple scripts; e.g., Tajik is written in both Cyrillic and Arabic scripts. Some data sources indicate the script, but others either do not or provide mixed text in multiple scripts. We detect the script for each sentence and treat each language-script as a separate entity.

Ngram LMs and Language Divergence
We train a 3-gram character-level language model M i for each language-script L i , using KenLM (Heafield, 2011). We refer to the perplexity calculated for the corpus of language L i using language model M j as PP(M j , L i ). Similar to Gamallo et al. (2017), we define a perplexity-based divergence measure of languages L i and L j as: We use D to filter out noisy data in 3.4 and study the effect of similar languages in LLM training in §6.7 and §6.8. For more details, see §A.

Data Cleaning
To detect and remove noise, we use both corpuslevel and chunk-level filters. A corpus-level filter detects if the majority of the corpus of a languagescript is noisy; e.g., the corpus is in another language or consists of non-meaningful content such as tabular data. Chunk-level filters are applied without performing sentence splitting. Instead, each chunk of text received from a data source is processed as a unit. While some sources are sentencesplit in the original, others provide multiple sentences (e.g., a paragraph) as one chunk. As chunklevel, filters we employ the sentence-level filters from BigScience ROOTS (Laurençon et al., 2022).
Sentence-Level Filters Some sentence-levels are based on the notion of word: we use white space tokenization whenever possible, or resort to senten-cePiece (Kudo and Richardson, 2018) trained by .
(SF1) Character repetition. If the ratio of repeated characters in a sentence is too high, it is likely that the sentence has not enough textual content. (SF2) Word repetition. A high ratio of repeated words indicates non-useful repetitive content. (SF3) Special characters. Sentences with a high ratio of special characters are likely to be crawling artifacts or computer code. (SF4) Insufficient number of words. Since training language models requires enough context, very small chunks of text are not useful. (SF5) Deduplication. If two sentences are identical after eliminating punctuation and white space, one is removed.

Corpus-Level Filters
(CF1) Language script mismatch. In case of mismatch between language and script, the corpus is removed; e.g., Chinese written in Arabic is unlikely to be Chinese. (CF2) Perplexity mismatch. For each languagescript L1, we find its nearest neighbor language-script L2: the language-script with the lowest perplexity divergence ( §3.3). If L1 and L2 are not in the same typological family, we check L1/L2 manually and take appropriate action such as removing the corpus (e.g., if it is actually English) or correcting the ISO code assigned to the corpus.

Training Data: Glot500-c
Among the 2000+ language-scripts we collected data for, after cleaning, most languages have too little data for pretraining language models. It is difficult to quantify the minimum amount needed for pretraining. Therefore, we pick a relatively high "safe" threshold, 30,000 sentences, for inclusion of language-scripts in model training. This allows us to train the model effectively and cover many low-resource languages. Table 1 gives Glot500-c statistics. See §B for a list of language-scripts. We train Glot500-m on Glot500-c, with the addition of Wikipedia data for head languages to mitigate catastrophic forgetting. We divide the corpus for each language into train/dev/test, reserving 1000 sentences each for  (Kudo, 2018) to train a tokenizer with a vocabulary size of 250K on Glot500-c. We sample data from different language-scripts according to a multinomial distribution, with α=.3. The amount we sample for head languages is the same as tail languages with the lowest amount; this favors tail languages -head languages are already well learned by XLM-R. We merge the obtained tokens with XLM-R's vocabulary About 100K new tokens were in fact old tokens, i.e., already part of XLM-R's vocabulary. We take the probabilities of the (genuinely) new tokens directly from Sen-tencePiece. After adding the 151K new tokens to XLM-R's vocabulary (which has size 250K), the vocabulary size of Glot500-m is 401K. Another approach to merge vocabularies is to calculate probabilities of existing and new tokens over a mixture of original XLM-R training corpus and Glot500-c (Chung et al., 2020). For head languages, the percentage of changed tokens using the new tokenizer compared to the original tokenizer ranges from 0.2% to 50%. However, we found no relationship between percentage of changed tokens and change in performance on downstream tasks. Thus, there was little effect of tokenization in our experiments.

Continued Pretraining
We create Glot500-m by continued pretraining of XLM-R-B (XLM-R-base) with the MLM objective. The optimizer used is Adam with betas (0.9, |head| |tail| measure (%)  Table 3: Evaluation tasks and measures. |head|/|tail|: number of head/tail languages per task. Sent.=Sentence 0.999). Initial learning rate: 5e-5. Each training step contains a batch of 384 training samples randomly picked from all language-scripts. The sampling strategy across language-scripts is the same as for vocabulary extension ( §4.1). We save checkpoints every 10K steps and select the checkpoint with the best average performance on downstream tasks by early stopping. Table 2 lists the respective sizes of XLM-R-B, XLM-R-L and Glot500-m. Glot500-m and XLM-R-B have the same size except that Glot500-m has a larger vocabulary ( §4.1). We train Glot500-m on a server with eight NVIDIA RTX A6000 GPUs for two weeks.
Similar to XLM-R, we concatenate sentences of a language-script and feed them as a continuous stream to the tokenizer. The resulting output is then divided into chunks of 512 tokens and fed to the model.

Experimental Setup
For most tail languages, there are no manually labeled evaluation data. We therefore adopt a mixed evaluation strategy: based partly on human labels, partly on evaluation methods that are applicable to many languages without requiring gold data. Table 3 lists all our evaluation tasks.
Perplexity Following Salazar et al. (2020), we calculate pseudoperplexity (PPPL) over the heldout test set. PPPL is based on masking tokens one-by-one (not left to right). Salazar et al. (2020) give evidence that PPPL is a better measure of linguistic acceptability compared to standard leftto-right perplexity.

Roundtrip Alignment
For assessing the quality of multilingual representations for a broad range of tail languages without human gold data, we adopt roundtrip evaluation (Dufter et al., 2018). We first word-align sentences in a parallel corpus based on the multilingual representations of an LLM. We then start from a word w in a sentence in language L1, follow the alignment links to its translations in language L2, then the alignment links from L2 to L3 and so on, until in the end we follow alignment links back to L1. If this "roundtrip" gets us back to w, then it indicates that the LLM has similar representations for the meaning of w in languages L1, L2, L3, etc. In other words, the cross-lingual quality of representations is high. Vice versa, failures to get back to w are signs of poor multilingual representations.
We use SimAlign (Jalili Sabet et al., 2020) and align on the sub-word level on the Bible part of the test set, based on the representations of the LLM computed by Transformer layer 8 as suggested in the original paper. We use the interaction symmetrization method in which each word in a sentence is aligned to at most one word in the other sentence.
As evaluation measure we compute the percentage of roundtrips that were successes, i.e., the roundtrip starts at w in L1 and returns back to w. For each language-script in the test set, we randomly select three language-scripts as intermediate points L2, L3, L4 in the roundtrip. Since the intermediate points influence the results, we run the experiment five times with different intermediate points and report the average. All models are evaluated with the same five sets of three intermediate languages.
Sequence Labeling We consider two sequence labeling tasks: Named Entity Recognition (NER) and Part-Of-Speech (POS) tagging. We use the WikiANN dataset (Pan et al., 2017) for NER and version v2.11 of Universal Dependencies (UD) (de Marneffe et al., 2021) for POS. Since training data does not exist for some languages, we finetune on English (with early stopping based on dev) and evaluate zero-shot transfer on all languages covered by WikiANN/UD. We set the learning rate to 2e-5 with Adam. Following (Hu et al., 2020), we use up to 1000 English-aligned sentences from Tatoeba (Artetxe and Schwenk, 2019) to evaluate sentence retrieval. We also use 500 English-aligned sentences from the Bible part of test. We find nearest neighbors using cosine similarity based on the average word embeddings in layer l = 8 -following Jalili Sabet et al. (2020) -and compute top10 accuracy. For fair comparison and because the architectures are the same, we do not optimize the hyperparameter l for Glot500-m and XLM-R-B.

Sentence Retrieval
Text Classification We evaluate on Taxi1500 (Ma et al., 2023). It provides gold data for text classification with six classes in a large number of languages of which Glot500-m supports 363. We finetune on English (with early stopping on dev) and evaluate zero-shot on test of the target languages. Learning rate: 2e-5, batch size: 16 (following Ma et al. (2023)).

Experiments
In this section, we discuss aggregated results. For detailed results, see §D and §E. Table 4 summarizes our results. Glot500-m outperforms XLM-R-B, on all tasks, for both head and tail languages, except for POS on head. That Glot500-m outperforms XLM-R-B is expected for tail languages (i.e., those not covered by XLM-R). For these languages the improvement margin is large. But the outperformance is counterintuitive for head languages (those covered by XLM-R) since Glot500-m has the same number of (nonembedding) parameters as XLM-R-B, but the number of covered languages has greatly increased, leaving less capacity per language. There are a few possible explanations. First, XLM-R may be undertrained, and the inclusion of more head language training data may improve their representations. Second, having more languages may improve multilinguality by allowing languages to synergize and enhance each other's representations and crosslingual transfer. Third, there may be languages similar to head languages among the tail languages, which in turn can aid the head languages.

Results
The gap between Glot500-m and the baselines for tail languages in sequence labeling is smaller. These tasks do not require as deep an understanding of language and thus transfer from head to tail languages is easier through shared tokens.
Glot500-m also outperforms XLM-R-L for tail languages for all tasks and on three of the tasks for head languages. This suggests that scaling up size is not the only way for improvements. We can also improve the quality of multilingual LLM representations by increasing the number of languages.  Figure 1: Progression of training for sentence retrieval and sequence labeling. x-axis: epochs. The improvement is fast in the beginning for tail languages, then gets slower and and reaches a plateau after around 50 epochs. This pattern is partially observed for head languages.

Language Coverage
use word-level normalization. For 69 of the head languages, Glot500-m performs worse compared to XLM-R-B, which is expected since Glot500-m's training data is small for these languages. Glot500-m performs better than XLM-R-B for 420 of the tail languages. There are eight tail languages for which Glot500-m performs worse than XLM-R-B. Five of them are tail languages with a similar head language where the two languages share a macrolanguage: ekk/Standard Estonian (est/Estonian is a head language), aln/Gheg Albanian (sqi/Albanian), nob/Norwegian Bokmal (nor/Norwegian), hbs/Serbo-Croatian (srp/Serbian), lvs/Standard Latvian (lav/Latvian). Since XLM-R-B's pretraining corpus is large for the five head languages, its performance is good for the close tail languages.
The other three languages all have a unique script: sat/Santali (Ol Chiki script), div/Dhivehi (Thaana script), iku/Inuktitut (Inuktitut syllabics). For these languages, XLM-R-B's tokenizer returns many UNK tokens since it is not trained on these scripts, resulting in an unreasonably optimistic estimate of pseudoperplexity by our implementation.
Glot500-m's token-level normalized pseudoperplexity ranges from 1.95 for lhu/Lahu to 94.4 for tok/Toki Pona. The average is 13.5, the median 10.6. We analyze the five language-scripts with the highest pseudoperplexity: tok_Latn, luo_Latn, acm_Arab, ach_Latn, and teo_Latn. tok/Toki Pona is a constructed language with a vocabulary size of 120. Its high perplexity could be due to either its small vocabulary size, which can result in a high level of ambiguity, or a lack of tail head all XLM-R-B XLM-R-L Glot500-m XLM-R-B XLM-R-L Glot500-m XLM-R-B XLM-R-L Glot500-m  Table 4: Evaluation of XLM-R base and large (XLM-R-B and XLM-R-L) and Glot500-m on six multilingual tasks across 5 seeds. Each number is an average over head languages, tail languages and all languages. See §D and §E for detailed results per task and language. Glot500-m outperforms XLM-R-B in all tasks for head (except for POS) and tail languages and XLM-R-L for tail languages. Best result per task/language set is in bold.
head languages tail languages Glot500-m is better 37 420 XLM-R-B is better 69 8 standardization and high diversity in writing styles among its writers. acm/Mesopotamian Arabic contains a large number of tweets in raw form. This may result in a larger number of difficult-to-predict tokens in the test set than for other languages.
luo/Luo, ach/Acoli and teo/Teso are related Nilotic languages spoken in Keyna, Tanzania, Uganda and South Sudan. They are tonal languages. However, in Glot500-c, the tones are removed, as simple Latin letters without diacritics are used to write these languages. This may increase ambiguity and result in high model perplexity. It is an open question what the right tradeoff is between text normalization, including removal of diacritics, (which increases ambiguity, on the one hand and keeping the original text as unchanged as possible (which makes tokenization more difficult and can result in words being tokenized into sequences of characters).

Training Progression
To gain a deeper understanding of the training procedure, we evaluate Glot500-m on sequence labeling and sentence retrieval at intervals of 10,000 steps. Figure 1 displays that performance improves rapidly at the onset of training, but then the rate of improvement slows down. This trend is particularly pronounced for tail languages in sentence retrieval. In comparison, sequence labeling is relatively straightforward, with the baseline (XLM-R-B, epoch 0) achieving high performance by correctly transferring prevalent classes such as verb and noun through shared vocabulary, resulting in a smaller improvement of Glot500-m vs. XLM-R-B.
For sentence retrieval, we observe larger improvements for the Bible than for Tatoeba. This is likely due to the higher proportion of religious data in Glot500-c, compared to XLM-R's training data (i.e., CC100).
The average performance on downstream tasks reaches a maximum at 48 steps. We have taken a snapshot of Glot500-m at this stage and released it.

Analysis across Language-Scripts
To analyze the effect of language-scripts, we select five tail languages each with the largest and smallest gain when comparing Glot500-m vs. XLM-R-B for four downstream tasks: Sentence Retrieval Tatoeba, Sentence Retrieval Bible, NER and POS. Table 6 shows that Glot500-m improves languages with scripts not covered by XLM-R (e.g., div/Dhivehi, Thaana script, see §6.2) by a large margin since XLM-R simply regards the uncovered scripts as unknown tokens and cannot compute meaningful representations for the input. The large amount of data we collected in Glot500-c also contributes to the improvement for tail languages, e.g., for tat_Cyrl (Tatar) in Sentence Retrieval Tatoeba and mlt_Latn (Maltese) in POS. See §6.7 for a detailed analysis of the effect of corpus size.
On the other hand, Glot500-m achieves just comparable or even worse results for some languagescripts. We see at least three possible explanations. (i) As discussed in §6.2, some tail languages (e.g., nob/Norwegian Bokmal) are close to a head language (e.g., nor/Norwegian), so Glot500-m has no advantage over XLM-R-B. (ii) A language is at the   low end of the corpus size range we consider (i.e., 30,000 sentences). This is the case for xav_Latn, Xavánte. (iii) Some languages are completely distinct from all other languages in Glot500-c, thus without support from any similar language. An example is mau_Latn, Huautla Mazatec. Glot500-m has a much harder time learning good representations in these cases. Table 7 compares Sentence Retrieval Bible performance XLM-R-B vs. Glot500-m for six languages with two scripts. Unsurprisingly, XLM-R performs much better for a language-script it was pretrained on ("head") than on one that it was not ("tail"). If we collect enough data for the script not covered by XLM-R, we can improve the performance of the corresponding language, even surpassing the language-script covered by XLM-R. For languages with two scripts not covered by XLM-R, the performance is better for the script for which we collect a larger corpus. For example, kaa_Cyrl (Kara-Kalpak) has about three times as much data as kaa_Latn. This explains why kaa_Cyrl outperforms kaa_Latn by 30%. In contrast, Dufter and Schütze (2020) found that, after training a multilingual model with two scripts for English (natural English and "fake English"), the model performed well at zero-shot transfer if the capacity of the model was of the right size (i.e., not too small, not too large). Our experiments with real data show the complexity of the issue: even if there is a "right" size for an LLM that supports both full acquisition of languages and multilingual transfer, this size is difficult to determine and it may be different for different language pairs in a large vertically scaled model like Glot500-m. Table 8 compares Sentence Retrieval Bible performance of Glot500-m vs. XLM-R-B for seven language families that have ten or more languagescripts in Glot500-c. We assign languages to fam-  Table 8: Average Sentence Retrieval Bible performance of Glot500-m and XLM-R-B for seven language families. The difference in coverage of a family by Glot500-m vs. XLM-R-B appears to be partially predictive of the performance difference. |L G |: number of language-scripts from family covered by Glot500-m. |L X |: number of language-scripts from family covered by XLM-R.

Analysis across Language Families
ilies based on Glottolog. 4 Generally, XLM-R has better (resp. worse) performance the more (resp. fewer) language-scripts from a language family are represented in its training data; e.g., performance is better for indo1319 and worse for maya1287. The results suggest that Glot500-m's improvement over XLM-R is the larger, the better our training corpus Glot500-c's coverage is of that family.

Effect of Amount of Training Data
We now examine Pearson's r correlation between pretraining corpus size and Glot500-m performance on zero-shot tasks. We focus on Sentence Retrieval Bible ( §5) since it supports more head and tail languages than any other task. We find that Pearson's r = .34, i.e., corpus size and performance are moderately, but clearly correlated. We suspect that the correlation is not larger because, in addition to corpus size of language l itself, corpus size of languages closely related to l is also an important factor (a similar finding for Norwegian is in §6.4.) We therefore also compute Pearson's r between (i) performance of language l on Sentence Retrieval Bible and (ii) joint corpus size of l and its k nearest neighbors (according to perplexity divergence, §3.3). In this case, Pearson's r = .44 (for both k = 3 and k = 4), indicating that the corpus size of nearest neighbor languages does play a role.

Support through Related Languages
Building on §6.7, there is another way we can investigate the positive effect of closely related languages on performance: We can compare performance (again on Sentence Retrieval Bible) of continued pretraining on just one language (we refer to 4 http://glottolog.org/glottolog/family  Table 9: Performance on Sentence Retrieval Bible of continued pretraining on just one language (Glot+1) vs. on Glot500-c (Glot500-m). Glot500-m underperforms on the top three and outperforms on the bottom three. Our explanation is that the second group is supported by closely related languages in Glot500-c; e.g., for Southern Quechua (quh), Glot500-m also covers closely related Cuzco Quechua (quz). For the first group this is not the case; e.g., the Wa language (wbm) has no close relative in Glot500-c.
this model as Glot+1) vs. on all 511 languages represented in Glot500-c (i.e., Glot500-m). Table 9 presents results for six language-scripts selected from various language families and suggests that some languages do not receive support from related languages (top three). In that case, Glot+1 can fully concentrate on learning the isolated language and does better than Glot500-c. Other languages (bottom three) do receive support from related languages. For example, Southern Quechua (quh) seems to receive support in Glot500-m from closely related Cuzco Quechua (quz), resulting in Glot500-m outperforming Glot+1.

Conclusion and Future Work
We collect and data-clean Glot500-c, a large corpus of hundreds of usually neglected tail (i.e., long-tail) languages and create Glot500-m, an LLM that is trained on Glot500-c and covers these languages.
We evaluate Glot500-m on six tasks that allow us to evaluate almost all languages. We observe large improvements for both head and tail languages compared to XLM-R. Our analysis shows that no single factor fully explains the quality of the representation of a language in a multilingual model. Rather, a combination of factors is important, including corpus size, script, "help" from related languages and the total capacity of the model. This work is the first to create a language model on a dataset of several hundreds of gigabytes and to make it publicly available for such a large and diverse number of low-resource languages. In future research, we would like to train larger models to further investigate the effect of model size, distill highly multilingual models for resource-efficient deployment, explore alternatives to continued pretraining and use models for more tail language downstream tasks.

Limitations
(1) We did not perform any comprehensive hyperparameter search, which would have further consolidated our results. This decision was made due to the high cost of training multiple models. (2) Compared to current very large models, Glot500-m is comparatively small. (3) Although we have tried to minimize the amount of noise in our data, some noise is still present.

Ethics Statement
There are two issues worth mentioning in regards to this project. First, it was not feasible for us to thoroughly examine the content of the data for all languages, thus we cannot confirm the absence of discrimination based on factors such as race or sexuality. The data was solely utilized as a textual corpus, and the content should not be interpreted as an endorsement by our team. If the model is subsequently utilized for generation, it is possible that the training data may be reflected in the generated output. However, addressing potential biases within the data is an area for future research. Second, it is important to note that while the data sources utilized in this study do not explicitly prohibit the reuse of data for research purposes, some sources do have copyright statements indicating that such use is permissible while others do not. Additionally, certain sources prohibit the redistribution of data. As such, data from these sources is omitted from the published version of Glot2000-c.

A N-grams LMs and Language Divergence
Perplexity and Language Divergence. Perplexity measures how well a model predicts a sample test data. Assuming a test data contains sequences of characters S = ch 1 , ch 2 , · · · , ch T , perplexity (PP) of S given an n-gram character level language model M is computed as follows: is computed as by dividing the observed frequency (C) of ch t−1 1 ch i by the observed frequency of ch t−1 1 in M training data: Given the definition of perplexity, we can determine how well a trained language model on language L 1 predicts the test text of language L 2 and vice-versa. The divergence between two languages is computed with the maximum of the perplexity values in both directions. Two reasons lead to the use of max: first, a symmetrical divergence is required, and second, languages differ in their complexity, so one direction of computing perplexity may result in a much lower perplexity than another. Thus, comparing perplexity results becomes difficult. As an example, the Kuanua language (ksd_Latn) has short words and a simple structure, which results in 3−gram models getting lower perplexity on its text compared to other languages. The lower the perplexity the smaller the divergence between languages. The divergence (D) between language L i and L j with trained language models of M Lz and test texts of S Lz , where L z is the corresponding language, computed as follows: Runs and Data. The data used to train and test the character level n-gram models is the same data used for the training and testing of the Glot500-m. The training of the models was limited to 100, 000 sentences per language-script. We use KenLM library (Heafield, 2011) to build n-gram models. This library uses an interpolated modified Kneser-Ney smoothing for estimating the unseen n-grams. Our evaluation has been performed over 7 n-gram models (3 ≤ n ≤ 9). Baseline and Evaluation. Language family trees were used as a baseline for evaluating the divergence measures of the proposed approach. We obtained language family tree data from Ethnologue online version (Eberhard et al., 2022). For each language, the family tree follows the general order from largest typological language family group to smallest. There is only one family tree for each language in the baseline data. Nodes in the family tree represent typological language family groups. Each node only has one parent, so if a node is common in the family tree of two languages, its parent is also common. We evaluate our perplexity method on the following binary classification task: Do the majority of a language L z 's k nearest neighbors belong to the same typological language family group as L z ? Assuming languages L i and L j , with the following family trees: These 2 languages belong to the same typological family group with family tree levels of l ∈ {1, 2}, but not with family tree levels of l = 3 and higher.
Result. When it comes to language families, the majority of studies only refer to the largest typological language family group (level l = 1). Here, we also assess our methodology for other levels. The results of classification accuracy for 3−gram model, k ∈ {1, 3, 7, 13, 21} and l ∈ {1, 2, 3, max} are shown in Table 10. In cases where the maximum level of a tree is less than the l parameter, the maximum level for that language is used. Languages without a family or no other family member in our data are excluded. We only report the 3−gram model results as it gets the best results in most configurations among other n-gram models. With increasing l, the accuracy decreases, since more languages fall outside the same typological family. As k increases, the accuracy decreases, because languages with faraway neighbors are being included but the number of languages in the language typological group family will remain the same. There are times when languages have a lot of loan words from other languages because of geological proximity or historical reasons (e.g, colonization), which makes them similar to the languages they borrowed words from in our method. However they are different when it comes to their typological families and our method fails in these cases. Aymara (Macrolanguage: aym_Latn) and Quechua (Macrolanguage: que_Latn), for example, had a great deal of contact and influence on each other, but they do not belong to the same typological group. As well, some of the typological families are not that large, which makes our results worse when k increases. This is the case, for in-stance, of the Tarascan typological family which only has two members.

B Languages
The list of languages used to train Glot500-m with the amount of available data for each language is available in Tables 11, 12 and 13.
On Macrolanguages The presence of language codes that are supersets of other language codes within datasets is not uncommon (Kreutzer et al., 2022). This issue becomes more prevalent in extensive collections. Within the ISO 639-3 standard, these languages are referred to as macrolanguages. When confronted with macrolanguages, if it is not feasible to ascertain the specific individual language contained within a dataset, the macrolanguage code is retained. Consequently, it is possible that in Glot2000-c and Glot500-c both the corpora for the macrolanguage and its individual languages have been included.

D Results for Each Task and Language
We report the detailed results for all tasks and languages in

E Perplexity Results for all Languages
Perplexity number for all languages is presented in