Match the Script, Adapt if Multilingual: Analyzing the Effect of Multilingual Pretraining on Cross-lingual Transferability

Pretrained multilingual models enable zero-shot learning even for unseen languages, and that performance can be further improved via adaptation prior to finetuning. However, it is unclear how the number of pretraining languages influences a model’s zero-shot learning for languages unseen during pretraining. To fill this gap, we ask the following research questions: (1) How does the number of pretraining languages influence zero-shot performance on unseen target languages? (2) Does the answer to that question change with model adaptation? (3) Do the findings for our first question change if the languages used for pretraining are all related? Our experiments on pretraining with related languages indicate that choosing a diverse set of languages is crucial. Without model adaptation, surprisingly, increasing the number of pretraining languages yields better results up to adding related languages, after which performance plateaus.In contrast, with model adaptation via continued pretraining, pretraining on a larger number of languages often gives further improvement, suggesting that model adaptation is crucial to exploit additional pretraining languages.


Introduction
Pretrained multilingual language models (Devlin et al., 2019;Conneau et al., 2020) are now a standard approach for cross-lingual transfer in natural language processing (NLP). However, there are multiple, potentially related issues on pretraining multilingual models. Conneau et al. (2020) find the "curse of multilinguality": for a fixed model size, zero-shot performance on target languages seen during pretraining increases with additional pretraining languages only until a certain point, after which performance decreases. Wang et al. (2020b) also report "negative interference", where monolingual models achieve better results than multilingual models, both on subsets of high-and low-resource languages. However, those findings are limited to target languages seen during pretraining.
Current multilingual models cover only a small subset of the world's languages. Furthermore, due to data sparsity, monolingual pretrained models are not likely to obtain good results for many lowresource languages. In those cases, multilingual models can zero-shot learn for unseen languages with an above-chance performance, which can be further improved via model adaptation with targetlanguage text (Wang et al., 2020a), even for limited amounts (Ebrahimi and Kann, 2021). However, it is poorly understood how the number of pretraining languages influences performance in those cases. Does the "curse of multilinguality" or "negative interference" also impact performance on unseen target languages? And, if we want a model to be applicable to as many unseen languages as possible, how many languages should it be trained on?
Specifically, we ask the following research questions: (1) How does pretraining on an increasing number of languages impact zero-shot performance on unseen target languages? (2) Does the effect of the number of pretraining languages change with model adaptation to target languages? (3) Does the answer to the first research question change if the pretraining languages are all related to each other?
We pretrain a variety of monolingual and multilingual models, which we then finetune on English and apply to three zero-shot cross-lingual downstream tasks in unseen target languages: partof-speech (POS) tagging, named entity recognition (NER), and natural language inference (NLI). Experimental results suggest that choosing a diverse set of pretraining languages is crucial for effective transfer. Without model adaptation, increasing the number of pretraining languages im-proves accuracy on unrelated unseen target languages at first and plateaus thereafter. Last, with model adaptation, additional pretraining languages beyond English generally help.
We are aware of the intense computational cost of pretraining and its environmental impact (Strubell et al., 2019). Thus, our experiments in Section 4 are on a relatively small scale with a fixed computational budget for each model and on relatively simple NLP tasks (POS tagging, NER, and NLI), but validate our most central findings in Section 5 on large publicly available pretrained models.

Cross-lingual Transfer via Pretraining
Pretrained multilingual models are a straightforward cross-lingual transfer approach: a model pretrained on multiple languages is then fine-tuned on target-task data in the source language. Subsequently, the model is applied to target-task data in the target language. Most commonly, the target language is part of the model's pretraining data. However, cross-lingual transfer is possible even if this is not the case, though performance tends to be lower. This paper extends prior work exploring the cross-lingual transfer abilities of pretrained models for seen target languages depending on the number of pretraining languages to unseen target languages. We now transfer via pretrained multilingual models and introduce the models and methods vetted in our experiments.

Background and Methods
Pretrained Language Models Contextual representations such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) are not just useful for monolingual representations. Multilingual BERT (Devlin et al., 2019, mBERT), XLM (Lample and Conneau, 2019), and XLM-RoBERTa (Conneau et al., 2020, XLM-R) have surprisingly high cross-lingual transfer performance compared to the previous best practice: static cross-lingual word embeddings (Pires et al., 2019;Wu and Dredze, 2019). Multilingual models are also practicalwhy have hundreds of separate models for each language when you could do better with just one? Furthermore, Wu and Dredze (2020) report that models pretrained on 100+ languages are better than bilingual or monolingual language models in zero-shot cross-lingual transfer.

Model Adaptation to Unseen Languages
Adapting pretrained multilingual models such as mBERT and XLM-R to unseen languages is one way to use such models beyond the languages covered during pretraining time. Several methods for adapting pretrained multilingual language models to unseen languages have been proposed, including continuing masked language model (MLM) training (Chau et al., 2020;Müller et al., 2020), optionally adding Adapter modules (Pfeiffer et al., 2020), or extending the vocabulary of the pretrained models (Artetxe et al., 2020;Wang et al., 2020a). However, such adaptation methods assume the existence of sufficient monolingual corpora in the target languages. Some spoken languages, dialects, or extinct languages lack monolingual corpora to conduct model adaptation, which motivates us to look into languages unseen during pretraining. We leave investigation on the effect of target language-specific processing, e.g., transliteration into Latin scripts (Muller et al., 2021), for future work.

Research Questions
A single pretrained model that can be applied to any language, including those unseen during pretraining, is both more efficient and more practical than pretraining one model per language. Moreover, it is the only practical option for unknown target languages or for languages without enough resources for pretraining. Thus, models that can be applied or at least easily adapted to unseen languages are an important research focus. This work addresses the following research questions (RQ), using English as the source language for finetuning. RQ1: How does the number of pretraining languages influence zero-shot cross-lingual transfer of simple NLP tasks on unseen target languages?
We first explore how many languages a model should be pretrained on if the target language is unknown at test time or has too limited monolingual resources for model adaptation. On one hand, we hypothesize that increasing the number of pretraining languages will improve performance, as the model sees a more diverse set of scripts and linguistic phenomena. Also, the more pretraining languages, the better chance of having a related language to the target language. However, multilingual training can cause interference: other languages could distract from English, the finetuning source language, and thus, lower performance.

RQ2:
How does the answer to RQ1 change with model adaptation to the target language?
This question is concerned with settings in which we have enough monolingual data to adapt a pretrained model to the target language. Like our hypothesis for RQ1, we expect that having seen more pretraining languages should make adaptation to unseen target languages easier. However, another possibility is that adapting the model makes any languages other than the finetuning source language unnecessary; performance stays the same or decreases when adding more pretraining languages. RQ3: Do the answers to RQ1 change if all pretraining languages are related to each other?
We use a diverse set of pretraining languages when exploring RQ1, since we expect that to be maximally beneficial. However, the results might change depending on the exact languages. Thus, as a case study, we repeat all experiments using a set of closely related languages. On the one hand, we hypothesize that benefits due to adding more pretraining languages (if any) will be smaller with related languages, as we reduce the diversity of linguistic phenomena in the pretraining data. However, on the other hand, if English is all we use during fine-tuning, performance might increase with related languages, as this will approximate training on more English data more closely.

Experimental Setup
Pretraining Corpora All our models are pretrained on the CoNLL 2017 Wikipedia dump (Ginter et al., 2017). To use equal amounts of data for all pretraining languages, we downsample all Wikipedia datasets to an equal number of sequences. We standardize to the smallest corpus, Hindi. The resulting pretraining corpus size is around 200MB per language. 2 We hold out 1K sequences with around 512 tokens per sequence after preprocessing as a development set to track the models' performance during pretraining.
Corpora for Model Adaptation For model adaptation (RQ2), we select unseen target languages contained in both XNLI (Conneau et al., 2018b) and Universal Dependencies 2.5 (Nivre et al., 2019): Farsi (FA), Hebrew (HE), French (FR), Vietnamese (VI), Tamil (TA), and Bulgarian (BG). Model adaptation is typically done for low-resource languages not seen during pretraining  because monolingual corpora are too small (Wang et al., 2020a). Therefore, we use the Johns Hopkins University Bible corpus by McCarthy et al. (2020) following Ebrahimi and Kann (2021). 3 Tasks We evaluate our pretrained models on the following downstream tasks from the XTREME dataset (Hu et al., 2020): POS tagging and NLI. For the former, we select 29 languages from Universal Dependencies v2.5 (Nivre et al., 2019). For the latter, we use all fifteen languages in XNLI (Conneau et al., 2018b). We follow the default train, validation, and test split in XTREME.  languages and facilitate comparability between all pretraining setups, we use XLM-R's vocabulary and the SentencePiece (Kudo and Richardson, 2018) tokenizer by Conneau et al. (2020). We use masked language modeling (MLM) as our pretraining objective and, like Devlin et al. (2019), mask 15% of the tokens. We pretrain all models for 150K steps, using Adam W (Loshchilov and Hutter, 2019) with a learning rate of 1 × 10 −4 and a batch size of two on either NVIDIA RTX2080Ti or GTX1080Ti 12GB, on which it approximately took four days to train each model. When pretraining, we preprocess sentences together to generate sequences of approximately 512 tokens. For continued pretraining, we use a learning rate of 2 × 10 −5 and train for forty epochs, otherwise following the setup for pretraining. For finetuning, we use a learning rate of 2 × 10 −5 and train for an additional ten epochs for both POS tagging and NER, and an additional five epochs for NLI, following Hu et al. (2020).

Models and Hyperparameters
Languages Table 1 shows the languages used in our experiments. English is part of the pretraining data of all models. It is also the finetuning source language for all tasks, following Hu et al. (2020). We use two different sets of pretraining languages: "Diverse (Div)" and "Related (Rel)" (Table 2). We mainly focus on pretraining on up to five languages, except for POS tagging where the trend is not clear and we further experiment on up to ten.

Results
We now present experimental results for each RQ.

Findings for RQ1
POS Tagging Figure 1 shows the POS tagging accuracy averaged over the 17 languages unseen during pretraining. On average, models pretrained on multiple languages have higher accuracy on unseen languages than the model pretrained exclusively on English, showing that the model benefits from a more diverse set of pretraining data. However, the average accuracy only increases up to six languages. This indicates that our initial hypothesis "the more languages the better" might not be true. Figure 2 provides a more detailed picture, showing the accuracy for different numbers of pretraining languages for all seen and unseen target languages. As expected, accuracy jumps when a language itself is added as a pretraining language. Furthermore, accuracy rises if a pretraining language from the same language family as a target language is added: for example, the accuracy of Marathi goes up by 9.3% after adding Hindi during pretraining, and the accuracy of Bulgarian increases by 31.2% after adding Russian. This shows that related languages are indeed beneficial for transfer learning. Also, (partially) sharing the same script with a pretraining language (e.g., ES and ET, AR and FA) helps with zero-shot cross-lingual transfer even for languages which are not from the same But how important are the scripts compared to other features? To quantify the importance of it, we conduct a linear regression analysis on the POS tagging result. Table 3 shows the linear regression analysis results using typological features among target and pretraining languages. For the script and family features, we follow Xu et al. (2019) and encoded them into binary values set to one if a language with the same script or from the same family is included as one of the pretraining languages. For syntax and phonology features, we derive those vectors from the URIEL database using lang2vec (Littell et al., 2017) following Lauscher et al. (2020). We take the maximum cosine similarity between the target language and any of the pretraining languages. Table 3 further confirms that having a pretraining language which shares the same script contributes the most to positive cross-lingual transfer.
We sadly cannot give a definitive optimal number of pretraining languages. One consistent find-  Table 3: Regression analysis on the POS tagging accuracy with coefficients (Coef.), p-value, and 95% confidence interval (CI). A large coefficient with a low pvalue indicates that the feature significantly contributes to better cross-lingual transfer, which shows that the same script is the most important feature.
ing is that, for the large majority of languages, using only English yields the worst results for unseen languages. However, adding pretraining languages does not necessarily improve accuracy (Figure 1). This indicates that, while we want more than one pretraining language, using a smaller number than the 100 commonly used pretraining languages is likely sufficient unless we expect them to be closely related to one of the potential target languages.
NER Our NER results show a similar trend. Therefore, we only report the average performance in the main part of this paper (Figure 3), and full en Div-2 (+ru) Div-3 (+zh) Div-4 (+ar) Div-5 (+hi) Div-6 (+es) Div-7 (+el) Div-8 (+fi) Div-9 (+id) Div-10  details are available in Appendix A. For NER, transfer to unseen languages is more limited, likely due to the small subset of tokens which are labeled as entities when compared to POS tags.
NLI Our NLI results in Figure 4 show a similar trend: accuracy on unseen languages plateaus at a relatively small number of pretraining languages. Specifically, Div-4 has the highest accuracy for 8 target languages, while Div-5 is best only for two target languages. Accuracy again increases with related languages, such as an improvement of 3.7% accuracy for Bulgarian after adding Russian as a pretraining language. Full results are available in Appendix B.

Findings for RQ2
POS Tagging Figure 5a shows the POS tagging results for six languages after adaptation of the pretrained models via continued pretraining. As expected, accuracy is overall higher than in Figure 2. Importantly, there are accuracy gains in Farsi when adding Turkish (+9.8%) and in Hebrew when adding Greek (+7.7%), which are not observed before adapting models. We further investigate it in Section 5.  NER NER results in Figure 5b show similarities between POS tagging (e.g., improvement on Bulgarian after adding Russian). However, there is limited improvement on Farsi after adding Arabic despite partially shared scripts between the two languages. This indicates that the effect of adding related pretraining languages is partially task-dependent.

Findings for RQ3
POS Tagging In contrast to RQ1, POS tagging accuracy changes for most languages are limited when increasing the number of pretraining languages ( Figure 6). The unseen languages on which we observe gains belong to the Germanic, Romance, and Uralic language families, which are relatively (as compared to the other language fami-lies) close to English. The accuracy on languages from other language families changes by < 10%, which is smaller than the change for a diverse set of pretraining languages. This indicates that the models pretrained on similar languages struggle to transfer to unrelated languages.
NER F1 scores of EN, Rel-2, Rel-3, .219,.227,.236, and .237 respectively. Compared to Div-X, pretraining on related languages also improves up to adding five languages. However, these models bring a smaller improvement, similar to POS tagging.
NLI Figure 7 shows a similar trend for NLI: when adding related pretraining languages, accuracy on languages far from English either does not change much or decreases. In fact, for nine out of thirteen unseen target languages, Rel-5 is the worst.

More Pretraining Languages
Our main takeaways from the last section are:  when using more than one pretraining language, diversity is important. However, there are limitations in the experimental settings in Section 4. We assume the following: (1) relatively small pretraining corpora; (2) the target languages are included when building the model's vocabulary; (3) fixed computational resources; and (4) only up to ten pretraining languages. We now explore if our findings for RQ1 and RQ2 hold without such limitations. For this, we use two publicly available pretrained XLM models (Lample and Conneau, 2019), which have been pretrained on full size Wikipedia in 17 (XLM-17) and 100 (XLM-100) languages, and XLM-R base model trained on a larger Common Crawl corpus (Conneau et al., 2020) in 100 languages. We conduct a case study on low-resource languages unseen for all models, including unseen vocabularies: Maltese (MT), Wolof (WO), Yoruba (YO), Erzya (MYV), and Northern Sami (SME). All pretraining languages used in Div-X are included in XLM-17 except for Finnish, and all 17 pretraining languages for XLM-17 are a subset of the pretraining languages for XLM-100. We report the averages with standard deviations from three random seeds.

RQ1
For models without adaptation, accuracy does not improve for increasing numbers of source languages (Figure 8a). Indeed, the accuracy on both XLM-17 and XLM-100 are on par even though the former uses 17 pretraining languages and the latter uses 100. One exception is Northern Sami (Uralic language with Latin script) due to XLM-17 not seeing any Uralic languages, but XLM-100 does during pretraining. When further comparing Div-10 and XLM-17, increase in accuracy by additional pretraining languages is limited. Erzya remains constant from five to 100 languages (except for XLM-R), even when increasing the pretraining corpus size from downsampled (Div-X) to full Wikipedia (XLM-17 and XLM-100).
RQ2 For the models with adaptation (Figure 8b), there is a significant gap between XLM-17 and XLM-100. This confirms our findings in the last section: more pretraining languages is beneficial if the pretrained models are adapted to the target languages. Thus, a possible explanation is that one or more of XLM-100's pretraining languages is similar to our target languages and such languages can only be exploited through continued pretraining (e.g., Ukrainian included in XLM-100 but not in Div-X). Therefore, having the model see more languages during pretraining is better when the models can be adapted to each target language.

Related Work
Static Cross-lingual Word Embeddings Static cross-lingual word embeddings (Mikolov et al., 2013;Conneau et al., 2018a) embed and align words from multiple languages for downstream NLP tasks (Lample et al., 2018;Gu et al., 2018), including a massive one trained on 50+ languages (Ammar et al., 2016). Static cross-lingual embedding methods can be classified into two groups: supervised and unsupervised. Supervised methods use bilingual lexica as the cross-lingual supervision signal. On the other hand, pretrained multilingual language models and unsupervised cross-lingual embeddings are similar because they do not use a bilingual lexicon. Lin et al. (2019) explore the selection of transfer language using both data-independent (e.g., typological) features, and data-dependent features (e.g., lexical overlap). Their work is on static supervised cross-lingual word embeddings, whereas this paper explores pretrained language models.
Analysis of Pretrained Multilingual Models on Seen Languages Starting from Pires et al. (2019), analysis of the cross-lingual transferability of pretrained multilingual language models has been a topic of interest. Pires et al. (2019) hypothesize that cross-lingual transfer occurs due to shared tokens across languages, but Artetxe et al. (2020) show that cross-lingual transfer can be successful even among languages without shared scripts. Other work investigates the relationship between zero-shot cross-lingual learning and typological features (Lauscher et al., 2020), encoding language-specific features (Libovický et al., 2020), and mBERT's multilinguality (Dufter and Schütze, 2020). However, the majority of analyses have either been limited to large public models (e.g., mBERT, XLM-R), to up to two pretraining languages (K et al., 2020;Wu and Dredze, 2020), or to target languages seen during pretraining. One exception is the concurrent work by de Vries et al. (2022) on analyzing the choice of language for the taskspecific training data on unseen languages. Here, we analyze the ability of models to benefit from an increasing number of pretraining languages.

Conclusion
This paper explores the effect which pretraining on different numbers of languages has on unseen target languages after finetuning on English. We find: (1) if not adapting the pretrained multilingual language models to target languages, a set of diverse pretraining languages which covers the script and family of unseen target languages (e.g., 17 languages used for XLM-17) is likely sufficient; and (2) if adapting the pretrained multilingual language model to target languages, then one should pretrain on as many languages as possible up to at least 100.
Future directions include analyzing the effect of multilingual pretraining from different perspectives such as different pretraining tasks and architectures, e.g., mT5 (Xue et al., 2021), and more complex tasks beyond classification or sequence tagging.

A NER Results
We show additional experimental results on NER in Figures 9 and 10.

B NLI Results
Tables 5 and 6 shows the results without model adaptation, and Table 4 shows the full results with model adaptation.

C Notes on the Experimental Setup for Model Adaptation
Following are the additional notes on the setup of the model adaptation: • No vocabulary augmentation is conducted unlike Wang et al. (2020a). We use XLM-R's vocabulary throughout all experiments in this paper.
• The Bible is used instead of Wikipedia for the continued pretraining or model adaptation to minimize the corpus size and contents inconsistency across languages.   Figure 10: NER F1 score on diverse pretraining languages (EN, RU, ZH, AR, HI, ES, EL, FI, ID, TR) grouped by families of target languages, with Indo-European (IE) languages further divided into subgroups following XTREME. The accuracy gain is significant for seen pretraining languages, and also the languages from the same family of the pretraining languages when added.     Table 6: NLI accuracy on the 13 unseen languages using the models pretrained on related languages (EN, DE, SV, NL, DA), incrementally added one language at a time up to five languages.