How to Adapt Your Pretrained Multilingual Model to 1600 Languages

Pretrained multilingual models (PMMs) enable zero-shot learning via cross-lingual transfer, performing best for languages seen during pretraining. While methods exist to improve performance for unseen languages, they have almost exclusively been evaluated using amounts of raw text only available for a small fraction of the world’s languages. In this paper, we evaluate the performance of existing methods to adapt PMMs to new languages using a resource available for close to 1600 languages: the New Testament. This is challenging for two reasons: (1) the small corpus size, and (2) the narrow domain. While performance drops for all approaches, we surprisingly still see gains of up to 17.69% accuracy for part-of-speech tagging and 6.29 F1 for NER on average over all languages as compared to XLM-R. Another unexpected finding is that continued pretraining, the simplest approach, performs best. Finally, we perform a case study to disentangle the effects of domain and size and to shed light on the influence of the finetuning source language.


Introduction
Pretrained multilingual models (PMMs) are a straightforward way to enable zero-shot learning via cross-lingual transfer, thus eliminating the need for labeled data for the target task and language. However, downstream performance is highest for languages that are well represented in the pretraining data or linguistically similar to a well represented language. Performance degrades as representation decreases, with languages not seen during pretraining generally having the worst performance.
In the most extreme case, when a language's script is completely unknown to the model, zero-shot performance is effectively random. While multiple methods have been shown to improve the performance of transfer to underrep- resented languages (cf. Section 2.3), previous work has evaluated them using unlabeled data from sources available for a relatively small number of languages, such as Wikipedia or Common Crawl, which cover 316 1 and 160 2 languages, respectively. Due to this low coverage, the languages that would most benefit from these methods are precisely those which do not have the necessary amounts of monolingual data to implement them as-is. To enable the use of PMMs for truly low-resource languages, where they can, e.g., assist language documentation or revitalization, it is important to understand how state-of-the-art adaptation methods act in a setting more broadly applicable to many languages.
In this paper, we ask the following question: Can we use the Bible -a resource available for roughly 1600 languages -to improve a PMM's zero-shot performance on an unseen target language? And, if so, what adaptation method works best? We investigate the performance of XLM-R (Conneau et al., 2020) when combined with continued pretraining (Chau et al., 2020), vocabulary extension, (Wang et al., 2020), and adapters (Pfeiffer et al., 2020b) making the following assumptions: (1) the only text available in a target language is the New Testament, and (2) no annotated training data exists in the target language.
We present results on 2 downstream tasks -partof-speech (POS) tagging and named entity recognition (NER) -on a typologically diverse set of 30 languages, all of which are unseen during the pretraining of XLM-R. We find that, surprisingly, even though we use a small corpus from a narrow domain, most adaptation approaches improve over XLM-R's base performance, showing that the Bible is a valuable source of data for our purposes. We further observe that in our setting the simplest adaptation method, continued pretraining, performs best for both tasks, achieving gains of up to 17.69% accuracy for POS tagging, and 6.29 F1 for NER on average across languages.
Additionally, we seek to disentangle the effects of two aspects of our experiments on downstream performance: the selection of the source language, and the restricted domain of the New Testament. Towards this, we conduct a case study focusing on three languages with Cyrillic script: Bashkir, Chechen, and Chuvash. In order to understand the effect of the choice of source language, we use a more similar language, Russian, as our source of labeled data. To explore the effect of the New Testament's domain, we conduct our pretraining experiments with an equivalent amount of data sampled from the Wikipedia in each language. We find that changing the source language to Russian increases average baseline performance by 18.96 F1, and we achieve the highest results across all settings when using both Wikipedia and Russian data.

Background
Prior to the introduction of PMMs, cross-lingual transfer was often based on word embeddings (Mikolov et al., 2013).  present monolingual embeddings for 294 languages using Wikipedia, succeeded by  who present embeddings for 157 languages trained on additional data from Common Crawl. For crosslingual transfer, monolingual embeddings can then be aligned using existing parallel resources, or in a completely unsupervised way (Bojanowski et al.,   2017; Artetxe et al., 2017;Artetxe et al., 2016). Although they use transformer based models, Artetxe et al. (2020) also transfer in a monolingual setting. Another method for cross-lingual transfer involves multilingual embeddings, where languages are jointly learned as opposed to being aligned (Ammar et al., 2016;Artetxe and Schwenk, 2019). For a more in-depth look at cross-lingual word embeddings, we refer the reader to Ruder et al. (2019). While the above works deal with generally improving cross-lingual representations, task-specific cross-lingual systems often show strong performance in a zero-shot setting. For POS tagging, in a similar setting to our work, Eskander et al. (2020) achieve strong zero-shot results by using unsupervised projection (Yarowsky et al., 2001) with aligned Bibles. Recent work for cross-lingual NER includes Mayhew et al. (2017) who use dictionary translations to create target-language training data, as well as Xie et al. (2018) who use a bilingual dictionary in addition to self-attention. Bharadwaj et al. (2016) use phoneme conversion to aid cross-lingual NER in a zero-shot setting. More recently, Bari et al. (2020) propose a model only using monolingual data for each language, and Qi et al. (2020) propose a language-agnostic toolkit supporting NER for 66 languages. In contrast to these works, we focus on the improvements offered by adaptation methods for pretrained models for general tasks.

Pretrained Multilingual Models
PMMs can be seen as the natural extension of multilingual embeddings to pretrained transformerbased models. mBERT was the first PMM, covering the 104 languages with the largest Wikipedias. It uses a 110k byte-pair encoding (BPE) vocabulary (Sennrich et al., 2016) and is pretrained on both a next sentence prediction and a masked language modeling (MLM) objective. Languages with smaller Wikipedias are upsampled and highly represented languages are downsampled. XLM is a PMM trained on 15 languages. XLM similarly trains on Wikipedia data, using a BPE vocabulary with 95k subwords and up-and downsamples languages similarly to mBERT. XLM also introduces translation language modeling (TLM), a supervised pretraining objective, where tokens are masked as for MLM, but parallel sentences are concatenated such that the model can rely on subwords in both languages for prediction. Finally, XLM-R is an improved version of XLM. Notable differences include the larger vocabulary of 250k subwords created using SentencePiece tokenization (Kudo and Richardson, 2018) and the training data, which is taken from CommonCrawl and is considerably more than for mBERT and XLM. XLM-R relies solely on MLM for pretraining and achieves stateof-the-art results on multiple benchmarks (Conneau et al., 2020). We therefore focus solely on XLM-R in our experiments.
Downstream Performance of PMMs While Pires et al. (2019) and Wu and Dredze (2019) show the strong zero-shot performance of mBERT, Wu and Dredze (2020) shine light on the difference in performance between well and poorly represented languages after finetuning on target-task data. Muller et al. (2020) observe varying zero-shot performance of mBERT on different languages not present in its pretraining data. They group them into 'easy' languages, on which mBERT performs well without any modification, 'medium' languages, on which mBERT performs well after additional pretraining on monolingual data, and 'hard' languages, on which mBERT's performs poorly even after modification. They additionally note the importance of script, finding that transliterating into Latin offers improvements for some languages. As transliteration involves language specific tools, we consider it out of scope for this work, and leave further investigation in how to best utilize transliteration for future work. Lauscher et al. (2020) focus on PMM finetuning, and find that for unseen languages, gathering labeled data for few-shot learning may be more effective than gathering large amounts of unlabeled data.
Additionally, Chau et al. (2020), Wang et al. (2020), and Pfeiffer et al. (2020b) present the adaptation methods whose performance we investigate here in a setting where only the Bible is available. We give a general overview of these methods in the remainder of this section, before describing their application in our experiments in Section 3.

Adaptation Methods
Continued Pretraining In a monolingual setting, continued pretraining of a language representation model on an MLM objective has shown to help downstream performance on tasks involving text from a domain distant from the pretraining corpora (Gururangan et al., 2020). In a multilingual setting, it has been found that, given a target language, continued pretraining on monolingual data from that language can lead to improvements on downstream tasks (Chau et al., 2020;Muller et al., 2020).
Vocabulary Extension Many pretrained models make use of a subword vocabulary, which strongly reduces the issue of out-of-vocabulary tokens. However, when the pretraining and targettask domains differ, important domain-specific words may be over-fragmented, which reduces performance. In the monolingual setting,  show that extending the vocabulary with indomain tokens yields performance gains. A similar result to that of continued pretraining holds in the multilingual setting: downstream performance of an underrepresented language benefits from additional tokens in the vocabulary, allowing for better representation of that language. Wang et al. (2020) find that extending the vocabulary of mBERT with new tokens and training on a monolingual corpus yields improvements for a target language, regardless of whether the language was seen or unseen. Chau et al. (2020) have similar results, and introduce tiered vocabulary augmentation, where new embeddings are learned with a higher learning rate. While both approaches start with a random initialization, they differ in the amount of new tokens added: Wang et al. (2020) limit new subwords to 30,000, while Chau et al. (2020) set a limit of 99, selecting the subwords which reduce the number of unknown tokens while keeping the subword-totoken ratio similar to the original vocabulary.
Adapters Adapters are layers with a small number of parameters, injected into models to help transfer learning (Rebuffi et al., 2017). Houlsby et al. (2019) demonstrate the effectiveness of taskspecific adapters in comparison to standard finetuning. Pfeiffer et al. (2020b) present invertible adapters and MAD-X, a framework utilizing them along with language and task adapters for crosslingual transfer. After freezing model weights, invertible and language adapters for each language, including English, are trained together using MLM. The English-specific adapters are then used along with a task adapter to learn from labeled English data. For zero-shot transfer, the invertible and language adapters are replaced with those trained on the target language, and the model is subsequently evaluated.

Data and Languages
Unlabeled Data We use the Johns Hopkins University Bible Corpus (JHUBC) from McCarthy et al. (2020), which contains 1611 languages, providing verse-aligned translations of both the Old and New Testament. However, the New Testament is much more widely translated: 86% of translations do not include the Old Testament. We therefore limit our experiments to the New Testament, which accounts to about 8000 verses in total, although specific languages may not have translations of all verses. For the 30 languages we consider, this averages to around 402k subword tokens per language. The specific versions of the Bible we use are listed in Table 5.
Labeled Data For NER, we use the splits of Rahimi et al. (2019), which are created from the WikiAnn dataset (Pan et al., 2017). For POS tagging, we use data taken from the Universal Dependencies Project (Nivre et al., 2020). As XLM-R utilizes a subword vocabulary, we perform sequence labeling by assigning labels to the last subword token of each word. For all target languages, we only finetune on labeled data in English.
Language Selection To select the languages for our experiments, we first compile lists of all lan-guages for which a test set exists for either downstream task and we have a Bible for. We then filter these languages by removing those present in the pretraining data of XLM-R. See Table 1 for a summary of languages, their attributes, and the downstream task we use them for.

PMM Adaptation Methods
Our goal is to analyze state-of-the-art PMM adaption approaches in a true low-resource setting where the only raw text data available comes from the New Testament and no labeled data exists at all. We now describe our implementation of these methods. We focus on the Base version of XLM-R (Conneau et al., 2020) as our baseline PMM.

Continued Pretraining
We consider three models based on continued pretraining. In the simplest case, +MLM, we continue training XLM-R with an MLM objective on the available verses of the New Testament. Additionally, as Bible translations are a parallel corpus, we also consider a model, +TLM, trained using translation language modeling. Finally, following the findings of Lample and Conneau (2019), we also consider a model using both TLM and MLM, +{M|T}LM. For this model, we alternate between batches consisting solely of verses from the target Bible and batches consisting of aligned verses of the target-language and source-language Bible. For NER, we pretrain +MLM and +TLM models for 40 epochs, and pretrain +{M|T}LM models for 20 epochs. For POS tagging, we follow a simlar pattern, training +MLM and +TLM for 80 epochs, and +{M|T}LM for 40 epochs.
Vocabulary Extension To extend the vocabulary of XLM-R, we implement the process of Wang et al. (2020). We denote this as +Extend. For each target language, we train a new SentencePiece (Kudo and Richardson, 2018) tokenizer on the Bible of that language with a maximum vocabulary size of 30,000. 3 To prevent adding duplicates, we filter out any subword already present in the vocabulary of XLM-R. We then add additional pieces representing these new subwords into the tokenizer of XLM-R, and increase XLM-R's embedding matrix accordingly using a random initialization. Finally, we train the embeddings using MLM on the Bible. For NER, we train +Extend models for 40 epochs, and for POS tagging, we train for 80 epochs.
Adapters For adapters, we largely follow the full MAD-X framework (Pfeiffer et al., 2020b), using language, invertible, and task adapters. This is denoted as +Adapters. To train task adapters, we download language and invertible adapters for the source language from AdapterHub (Pfeiffer et al., 2020a). We train a single task adapter for each task, and use it across all languages. We train language and invertible adapters for each target language by training on the target Bible with an MLM objective. As before, for NER we train for 40 epochs, and for POS we train for 80 epochs.

Hyperparameters and Training Details
For finetuning, we train using 1 Nvidia V100 32GB GPU, and use an additional GPU for adaptation methods. Experiments for NER and POS take around 1 and 2 hours respectively, totalling to 165 total training hours, and 21.38 kgCO 2 eq emitted (Lacoste et al., 2019). All experiments are run using the Huggingface Transformers library (Wolf et al., 2020). We limit sequence lengths to 256 tokens.
We select initial hyperparameters for finetuning by using the English POS development set. We then fix all hyperparameters other than the number of epochs, which we tune using the 3 languages which have development sets, Ancient Greek, Maltese, and Wolof. We do not use early stopping. For our final results, we finetune for 5 epochs with a batch size of 32, and a learning rate of 2e-5. We use the same hyperparameters for both tasks.
For each task and adaptation approach, we search over {10, 20, 40, 80} epochs, and select the epoch which gives the highest average performance across the development languages. We use the same languages as above for POS. For NER we use 4 languages with varying baseline performances: Bashkir, Kinyarwanda, Maltese, and Scots. We pretrain with a learning rate of 2e-5 and a batch size of 32, except for +Adapters, for which we use a learning rate of 1e-4 (Pfeiffer et al., 2020b).

Results
We present results for NER and POS tagging in Tables 2 and 3, respectively. We additionally provide plots of the methods' performances as compared to the XLM-R baseline in Figures 2 and 3  (+MLM, +TLM, +{M|T}LM), perform best, with 3.93 to 6.29 F1 improvement over XLM-R. Both +Extend and +Adapters obtain a lower average F1 than the XLM-R baseline, which shows that they are not a good choice in our setting: either the size or the domain of the Bible causes them to perform poorly. Focusing on the script of the target language (cf. Table 1), the average performance gain across all models is higher for Cyrillic languages than for Latin languages. Therefore, in relation to the source language script, performance gain is higher for target languages with a more distant script from the source. When considering approaches which introduce new parameters, +Extend and +Adapters, performance only increases for Cyrillic languages and decreases for all others. However, when considering continued pretraining approaches, we find a performance increase for all scripts. Looking at Figure 2, we see that the lower the baseline F1, the larger the improvement of the adaption methods on downstream performance, with all methods increasing performance for the language for which the baseline is weakest. As baseline performance increases, the benefit provided by these methods diminishes, and all methods underperform the baseline for Scots, the language with the highest baseline performance. We hypothesize that at POS Tagging Our POS tagging results largely follow the same trend as those for NER, with continued pretraining methods achieving the highest increase in performance: between 15.81 and 17.61 points. Also following NER and as shown in Figure 3, the largest performance gain can be seen for languages with a low baseline performance, and, as the latter increases, the benefits obtained from adaptation become smaller. However, unlike for NER, all methods show a net increase in performance, with +Adapters, the lowest performing adaptation model, achieving a gain of 9.01 points. We hypothesize that a likely reason for this is the domain and style of the Bible. While it may be too restrictive to significantly boost downstream NER performance, it is still a linguistically rich resource for POS tagging, a task that is less sensitive to domain in general.
Additionally, there is a notable outlier language, Coptic, on which no model performs better than random choice (which corresponds to 6% accuracy). This is because the script of this language is almost completely unseen to XLM-R, and practically all non-whitespace subwords map to the unknown token: of the 50% of non-whitespace tokens, 95% are unknown. While +Extend solves this issue, we believe that for a language with a completely unseen script the Bible is not enough to learn representations which can be used in a zero-shot setting.

Case Study
As previously stated, using the Bible as the corpus for adaptation is limiting in two ways: the extremely restricted domain as well as the small size.
To separate the effects of these two aspects, we repeat our experiments with a different set of data. We sample sentences from the Wikipedia of each target language to simulate a corpus of similar size to the Bible which is not restricted to the Bible's domain or content. To further minimize the effect of domain, we focus solely on NER, such that the domain of the data is precisely that of the target task. Additionally, we seek to further investigate the effect on the downstream performance gains of these adaptation methods when the source language is more similar to the target language. To this end, we focus our case study on three languages written in Cyrillic: Bashkir, Chechen, and Chuvash. We break up the case study into 3 settings, depending on the data used. In the first setting, we change the language of our labeled training data from English to Russian. While Russian is not necessarily similar to the target languages or mutually intelligible, we consider it to be more similar than English; Russian is written in the same script as the target languages, and there is a greater likelihood for lexical overlap and the existence of loanwords. In the second setting, we pretrain using Wikipedia and in the third setting we use both Wikipedia data as well as labeled Russian data. To create our Wikipedia training data, we extract sentences with WikiExtractor (Attardi, 2015) and split them with Moses SentenceSplitter (Koehn et al., 2007). To create a comparable training set for each language, we first calculate the total number of subword tokens found in the New Testament, and sample sentences from Wikipedia until we have an equivalent amount. In the setting where we use data from the New Testament and labeled Russian data, for +TLM and +{M|T}LM we additionally substitute the English Bible with the Russian Bible. When using Wikipedia, we omit results for +TLM and +{M|T}LM, as they rely on a parallel corpus.  Table 4: Case study: Cyrillic NER (F1). Setting describes the source of data for adaptation, either the (B)ible or (W)ikipedia, as well as the language of the finetuning data, (E)nglish or (R)ussian.

Results
We present the results of our case study in Table  4. In the sections below, we refer to case study settings as they are described in the table caption.
Effects of the Finetuning Language We find that using Russian as the source language (the "Russian baseline"; B-R w/ XLM-R) increases performance over the English baseline (B-E w/ XLM-R) by 18.96 F1. Interestingly, all of the adaptation methods utilizing the Bible do poorly in this set-ting (B-R), with +MLM only improving over the Russian baseline by 1.09 F1, and all other methods decreasing performance. We hypothesize that when adaptation data is limited in domain, as the source language approaches the target language in similarity, the language adaptation is mainly done in the finetuning step, and any performance gain from the unlabeled data is minimized. This is supported by the previous NER results, where we find that, when using English as the source language, the adaptation methods lead to higher average performance gain over the baseline for Cyrillic languages, i.e., the more distant languages, as opposed to Latin languages. The adaptation methods show a larger improvement when switching to Wikipedia data (W-R), with +MLM improving performance by 11.85 F1 over the Russian baseline. Finally, the performance of +Extend when using Russian labeled data is similar on average regardless of the adaptation data (B-R, W-R), but noticeably improves over the setting which uses Wikipedia and English labeled data.

Effects of the Domain Used for Adaptation
Fixing English as the source language and changing the pretraining domain from the Bible to Wikipedia (W-E) yields strong improvements, with +Adapters improving over the English baseline by 20.9 F1 and +MLM improving by 14.68 F1. However, we note that, while the average of +Adapters is higher than that of +MLM, this is due to higher performance on only a single language. When compared to the best performing pretraining methods that use the Bible (B-E), these methods improve by 11.29 F1 and 5.30 F1 respectively. When using both Wikipedia and Russian data, we see the highest overall performance, and +MLM increases over the English baseline by 30.81 F1 and the Russian baseline by 11.85 F1.

Limitations
One limitation of this work -and other works which involve a high number of languages -is task selection. While part-of-speech tagging and named entity recognition 4 are important, they are both low-level tasks largely based on sentence structure, with no requirement for higher levels of reasoning, unlike tasks such as question answering or natural language inference. While XTREME (Hu et al., 2020) is a great, diverse benchmark covering these higher level tasks, the number of languages is still limited to only 40 languages, all of which have Wikipedia data available. Extending these benchmarks to truly low resource languages by introducing datasets for these tasks will further motivate research on these languages, and provide a more comprehensive evaluation for their progress. Additionally, while the Bible is currently available in some form for 1611 languages, the available text for certain languages may be different in terms of quantity and quality from the Bible text we use in our experiments. Therefore, although we make no language-specific assumptions, our findings may not fully generalize to all 1611 languages due to these factors. Furthermore, this work focuses on analyzing the effects of adaptation methods for only a single multilingual transformer model. Although we make no model-specific assumptions in our methods, the set of unseen languages differs from model to model. Moreover, although we show improvements for the two tasks, we do not claim to have state-of-the-art results. In a low-resource setting, the best performance is often achieved through task-specific models. Similarly, translation-based approaches, as well as few-shot learning may offer additional benefits over a zero-shot setting. We also do not perform an extensive analysis of the target languages, or an analysis of the selected source language for finetuning. A better linguistic understanding of the languages in question would allow for a better selection of source language, as well as the ability to leverage linguistic features potentially leading to better results.
Finally, by using a PMM, we inherit all of that model's biases. The biases captured by word embeddings are well known, and recent work has shown that contextual models are not free of biases either (Caliskan et al., 2017;Kurita et al., 2019). The use of the Bible, and religious texts in general, may further introduce additional biases. Last, we acknowledge the environmental impact from the training of models on the scale of XLM-R (Strubell et al., 2019).

Conclusion
In this work, we evaluate the performance of continued pretraining, vocabulary extension, and adapters for unseen languages of XLM-R in a realistic lowresource setting. Using only the New Testament, we show that continued pretraining is the best per-forming adaptation approach, leading to gains of 6.29 F1 on NER and 17.69% accuracy on POS tagging. We therefore conclude that the Bible can be a valuable resource for adapting PMMs to unseen languages, especially when no other data exists. Furthermore, we conduct a case study on three languages written in Cyrillic script. Changing the source language to one more similar to the target language reduces the effect of adaptation, but the performance of the adaptation methods relative to each other is preserved. Changing the domain of the adaptation data to one more similar to the target task while keeping its size constant improves performance.

A Appendix
In Table 5, we provide the number of subwords created by the XLM-R tokenizer from the New Testament of each target language, in addition to the specific version of the Bible we use, as found in the JHU Bible Corpus. In Table 7 and 6 we provide the relative performance of all adaptation methods as compared to baseline performance.