UNKs Everywhere: Adapting Multilingual Language Models to New Scripts

Massively multilingual language models such as multilingual BERT offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks. However, due to limited capacity and large differences in pretraining data sizes, there is a profound performance gap between resource-rich and resource-poor target languages. The ultimate challenge is dealing with under-resourced languages not covered at all by the models and written in scripts unseen during pretraining. In this work, we propose a series of novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts. Relying on matrix factorization, our methods capitalize on the existing latent knowledge about multiple languages already available in the pretrained model’s embedding matrix. Furthermore, we show that learning of the new dedicated embedding matrix in the target language can be improved by leveraging a small number of vocabulary items (i.e., the so-called lexically overlapping tokens) shared between mBERT’s and target language vocabulary. Our adaptation techniques offer substantial performance gains for languages with unseen scripts. We also demonstrate that they can yield improvements for low-resource languages written in scripts covered by the pretrained model.


Introduction
Massively multilingual language models pretrained on large multilingual data, such as multilingual BERT (mBERT;Devlin et al., 2019) and XLM-R (Conneau et al., 2020a) are the current state-of-theart vehicle for effective cross-lingual transfer (Hu et al., 2020). However, while they exhibit strong transfer performance between resource-rich and similar languages (Conneau et al., 2020a;Artetxe et al., 2020), these models struggle with transfer to low-resource languages (Wu and Dredze, 2020) and languages not represented at all in their pre-  training corpora (Pfeiffer et al., 2020b;Müller et al., 2021;Ansell et al., 2021). The most extreme challenge is dealing with unseen languages with unseen scripts (i.e., the scripts are not represented in the pretraining data; see Figure 1), where the pretrained models are bound to fail entirely if they are used off-the-shelf without any further model adaptation.
Existing work focuses on the embedding layer and learns either a new embedding matrix for the target language (Artetxe et al., 2020) or adds new tokens to the pretrained vocabulary. While the former has only been applied to high-resource languages, the latter approaches have been limited to languages with seen scripts (Chau et al., 2020;Müller et al., 2021) and large pretraining corpora (Wang et al., 2020). Another line of work adapts the embedding layer as well as other layers of the model via adapters (Pfeiffer et al., 2020b;Üstün et al., 2020). Such methods, however, cannot be directly applied to languages with unseen scripts.
In this work, we first empirically verify that the original tokenizer and the original embedding layer of a pretrained multilingual model fail for languages with unseen script. This implies that dedicated in-language tokenizers and embeddings are a crucial requirement for any successful model adaptation. The key challenge is aligning new target language embeddings to the pretrained model's representations while leveraging knowledge encoded in the existing embedding matrix. We systematize existing approaches based on the pretrained information they utilize and identify lexically overlapping tokens that are present in both vocabularies as key carriers of such information (Søgaard et al., 2018). 1 We then present novel, effective, and dataefficient methods for adapting pretrained multilingual language models to resource-low languages written in different scripts. Beyond lexical overlap, our methods rely on factorized information from the embedding matrix and token groupings.
We evaluate our approaches in the named entity recognition (NER) task on the standard WikiAnn dataset (Rahimi et al., 2019) and Dependency Parsing (DP;Nivre et al., 2016). We use 4 diverse resource-rich languages as source languages, and transfer to 17 and 6 resource-poor target languages respectively, including 5 languages with unseen scripts (Amharic, Tibetan, Khmer, Divehi, Sinhala). We show that our adaptation techniques offer unmatched performance for languages with unseen scripts. They also yield improvements for lowresource and under-represented languages written in scripts covered by the pretrained model.

Contributions. 1)
We systematize and compare current model adaptation strategies for lowresource languages with seen and unseen scripts. 2) We measure the impact of initialization when learning new embedding layers, and demonstrate that non-random initialization starting from a subset of seen lexical items (i.e., lexically overlapping vocabulary items) has a strong positive impact on task performance for resource-poor languages. 3) We propose methods for learning lowdimensional embeddings, which reduce the number of trainable parameters and yield more efficient model adaptation. Our approach, based on matrix factorization and language clusters, extracts relevant information from the pretrained embedding matrix. 4) We show that our methods outperform previous approaches with both resourcerich and resource-poor languages. They substantially reduce the gap between random and lexicallyoverlapping initialization, enabling better model adaption to unseen scripts.
The code for this work is released at github.com/ Adapter-Hub/UNKs_everywhere.  2019;Wu et al., 2020;Hu et al., 2020;K et al., 2020). However, recent studies have also indicated that even current state-of-the-art models such as XLM-R (Large) still do not yield reasonable transfer performance across a large number of target languages (Hu et al., 2020). The largest drops are reported for resource-poor target languages (Lauscher et al., 2020), and (even more dramatically) for languages not covered at all during pretraining (Pfeiffer et al., 2020b).
Standard Cross-Lingual Transfer Setup with a state-of-the-art pretrained multilingual model such as mBERT or XLM-R is 1) fine-tuning it on labelled data of a downstream task in a source language and then 2) applying it directly to perform inference in a target language (Hu et al., 2020). However, as the model must balance between many languages in its representation space, it is not suited to excel at a specific language at inference time without further adaptation (Pfeiffer et al., 2020b).
Adapters for Cross-lingual Transfer. Adapterbased approaches have been proposed as a remedy (Rebuffi et al., 2017(Rebuffi et al., , 2018Houlsby et al., 2019;Stickland and Murray, 2019;Bapna and Firat, 2019;Pfeiffer et al., 2020aPfeiffer et al., , 2021). In the crosslingual setups, the idea is to increase the multilingual model capacity by storing language-specific knowledge of each language in dedicated parameters (Pfeiffer et al., 2020b;Vidoni et al., 2020). We start from MAD-X (Pfeiffer et al., 2020b), a state-of-the-art adapter-based framework for crosslingual transfer. For completeness, we provide a brief overview of the framework in what follows. MAD-X comprises three adapter types: language, task, and invertible adapters; this enables learning language and task-specific transformations in a modular and parameter-efficient way. As in prior work (Rebuffi et al., 2017;Houlsby et al., 2019), adapters are trained while keeping the parameters of the pretrained multilingual model fixed. Language adapters are trained via masked language modeling (MLM) on unlabelled target language data. Task adapters are trained via taskspecific objectives on labelled task data in a source language while also keeping the language adapters fixed. Task and language adapters are stacked: this enables the adaptation of the pretrained multilin-gual model to languages not covered in its pretraining data. MAD-X keeps the same task adapter while substituting the source language adapter with the target language adapter at inference.
In brief, the adapters A l at layer l consist of a down-projection D ∈ R h×d where h is the hidden size of the Transformer model and d is the dimension of the adapter, followed by a GeLU activation (Hendrycks and Gimpel, 2016) and an upprojection U ∈ R d×h at every layer l: h l and r l are the Transformer hidden state and the residual at layer l, respectively. The residual connection r l is the output of the Transformer's feedforward layer whereas h l is the output of the subsequent layer normalization. For further technical details, we refer the reader to Pfeiffer et al. (2020b).
Current model adaptation approaches (Chau et al., 2020;Wang et al., 2020) generally fine-tune all model parameters on target language data. Instead, we follow the more computationally efficient adapter-based paradigm where we keep model parameters fixed, and only train language adapters and target language embeddings. Crucially, while the current adapter-based methods offer extra capacity, they do not offer mechanisms to deal with extended vocabularies of many resource-poor target languages, and do not adapt their representation space towards the target language adequately. This problem is exacerbated when dealing with unseen languages and scripts. 2

Cross-lingual Transfer of Lexical Information
The embedding matrix of large multilingual models makes up around 50% of their entire parameter budget (Chung et al., 2021). However, it is not clear how to leverage this large amount of information most effectively for languages that are not adequately represented in the shared multilingual vocabulary due to lack of pretraining data.
A key challenge in using the lexical information encoded in the embedding matrix is to overcome a mismatch in vocabulary between the pretrained model and the target language. To outline this issue, in Table 1 we show for the languages in our NER evaluation the proportion of tokens in each 2 An alternative approach based on transliteration (Müller et al., 2021) side-steps script adaptation but relies on languagespecific heuristics, which are not available for most languages.    Table 1).
language that are effectively unknown (UNK) to mBERT: they occur in the vocabulary of a separately trained monolingual tokenizer ( §4.3), but cannot even be composed by subword tokens from the original mBERT's vocabulary. Table 1 also provides the proportion of lexically overlapping tokens, i.e., tokens that are present both in mBERT's and monolingual in-language vocabularies. The zero-shot performance of mBERT generally deteriorates with less lexical overlap and more UNKs in a target language: see Figure 2. Pearson's ρ correlation scores between the lexical overlap and proportion of UNKs (see Table 1) and NER performance are 0.443 and −0.798, respectively.  Table 2: Overview of our methods and related approaches together with the pretrained knowledge they utilize. We calculate the number of new parameters per language with V = 10k, D = 768, and D = 100. We do not include up-projection matrices G as these are learned only once and make up a comparatively small number of parameters.
Recent approaches such as invertible adapters (Pfeiffer et al., 2020b) that adapt embeddings in the pretrained multilingual vocabulary may be able to deal with lesser degrees of lexical overlap. Still, they cannot deal with UNK tokens. In the following, we systematize existing approaches and present novel ways of adapting a pretrained model to the vocabulary of a target language that can handle this challenging setting, present most acutely when adapting to languages with unseen scripts. We summarize the approaches in Table 2 based on what types of pretrained information they utilize. All approaches rely on a new vocabulary V , learned on the target language data.

Target-Language Embedding Learning
A straightforward way to adapt a pretrained model to a new language is to learn new embeddings for the language. Given the new vocabulary V , we initialize new embeddings X ∈ R |V |×D for all V vocabulary items where D is the dimensionality of the existing embeddings X ∈ R |V |×D , and only initialize special tokens (e.g. [CLS], [SEP]) with their pretrained representations. We train the new embeddings of the X with the pretraining task. This approach, termed EL-RAND, was proposed by Artetxe et al. (2020): they show that it allows learning aligned representations for a new language but only evaluate on high-resource languages. The shared special tokens allow the model to access a minimum amount of lexical information, which can be useful for transfer (Dufter and Schütze, 2020). Beyond this, this approach leverages knowledge from the existing embedding matrix only implicitly to the extent that the higher-level hidden representations are aligned to the lexical representations.

Initialization with Lexical Overlap
To leverage more lexical information, we can apply shared initialization not only to the special tokens but to all lexically overlapping tokens. Let us denote this vocabulary subset with V lex , and V rand = V \ V lex . In particular, we initialize the embeddings of all lexically overlapping tokens X lex from V lex with their pretrained representations from the original matrix X, while the tokens from V rand receive randomly initialized embeddings X rand . We then fine-tune all target language embeddings X = X lex ∪ X rand on the target language data. Wang et al. (2020) cast this as extending V with new tokens. In contrast, we seek to disentangle the impact of vocabulary size and pretrained information. As one variant of this approach, Chau et al. (2020) only add the 99 most common tokens of a new language to V .
Initialization with lexical overlap, termed EL-LEX, allows us to selectively leverage the information from the pretrained model on a per-token level based on surface-level similarity. Intuitively, this should be most useful for languages that are lexically similar to those seen during pretraining and have a substantial proportion of lexically overlapping tokens. However, such lexical overlap is a lot rarer for languages that are written in different scripts. For such languages, relying on surfacelevel string similarity alone may not be enough.

Embedding Matrix Factorization
We therefore propose to identify latent semantic concepts in the pretrained model's embedding matrix that are general across languages and useful for transfer. Further, to allow modeling flexibility we propose to learn a grouping of similar tokens. We achieve this by factorizing the pretrained embedding matrix X ∈ R |V |×D into lower-dimensional word embeddings F ∈ R |V |×D and C shared upprojections G 1 , . . . , G C ∈ R D ×D that encode general cross-lingual information: D is the dimensionality of the lower-dimensional embeddings.
(2) simplifies to X ≈ FG. As X is unconstrained, we follow Ding et al. (2008) and interpret this as a semi-non-negative matrix factorization (Semi-NMF) problem. In Semi-NMF, G is restricted to be non-negative while no restrictions are placed on the signs of F. The Semi-NMF is computed via an iterative updating algorithm that alternatively updates F and G where the Frobenius norm is minimized. 4 G is shared across all tokens and thus encodes general properties of the original embedding matrix X whereas F stores token-specific information. G only needs to be pretrained once and can be used and fine-tuned for every new language. To this end, we simply learn new low-dimensional embeddings F ∈ R |V |×D with the pretraining task, which are up-projected with G and fed to the model.

MF C
KMEANS - * . When C > 1, each token is associated with one of C up-projection matrices. Grouping tokens and using a separate up-projection matrix per group may help balance sharing information across typologically similar languages with learning a robust representation for each token (Chung et al., 2020). We propose two approaches to automatically learn such a clustering.
In our first, pipeline-based approach, we first cluster X into C clusters using KMeans. For each cluster, we then factorize the subset of embeddings X c associated with the c-th cluster separately using Semi-NMF equivalently as for MF 1 - * .
For a new language, we learn new low-dim embeddings F ∈ R |V |×D and a randomly initialized matrix Z ∈ R |V |×C , which allows us to compute the cluster assignment matrix I ∈ R |V |×C . Specifically, for token v, we obtain its cluster assignment as arg max of z v,· . As arg max is not differentiable, we employ the Straight-Through Gumbel-Softmax estimator (Jang et al., 2017) defined as: where τ is a temperature parameter, and g ∈ R |V | corresponds to samples from the Gumbel distri-bution g j ∼ − log(− log(u j )) with u j ∼ U(0, 1) being the uniform distribution. z v,· can be seen as "logits" used for assigning the v-th token a cluster. As τ → 0, the softmax becomes an arg max and the Gumbel-Softmax distribution approximates more closely the categorical distribution. I ∈ R |V |×C represents the one-hot encoded, indicator function over possible clusters, with learnable parameters Z. As before, i v,c = 1 iff new token v is associated with up-projection c, else 0.
MF C NEURAL - * . We can also learn the cluster assignment and up-projections jointly. Specifically, we parameterize G in Eq. (2) using a neural net where we learn the indicator matrix I equivalently to Eq. (3). The objective minimizes the L 2 -norm between the original and predicted embeddings: For a new language, we proceed analogously.
MF * * -RAND and MF * * -LEX. Finally, we can combine different initialization strategies (see §3.1 and §3.2) with the embedding matrix factorization technique. We label the variant which relies on random initialization, see §3.1, as MF 1 -RAND. The variant, which relies on lexically overlapping tokens from §3.2 can leverage both surface-level similarity as well as latent knowledge in the embedding matrix; we simply initialize the embeddings of overlapping lexical tokens (from V lex ) in F with their low-dim representations from F. The remaining tokens (from V rand ) are randomly initialized in F .
Factorizing the embedding matrix has the additional benefit of reducing the number of trainable parameters and correspondingly the amount of storage space required for each additional language. This is especially true when D D . 5

Experiments
Data. For pretraining, we leverage the Wikipedia dumps of the target languages. We conduct experiments on named entity recognition (NER) and dependency parsing (DP Languages. WikiAnn offers a wide language coverage (176 languages) and, consequently, a number of language-related comparative analyses. In order to systematically compare against state-of-the-art cross-lingual methods under different evaluation conditions, we identify 1) low-resource languages where the script has been covered by mBERT but the model has not been specifically trained on the language 2) as well as low-resource languages with scripts not covered at all by pretraining. In each case, we select four languages, taking into account variance in data availability and typological diversity. We select four high-resource source languages (English, Chinese, Japanese, Arabic) in order to go beyond English-centric transfer. We evaluate the cross-lingual transfer performance of these 4 languages to the 17 diverse languages. For DP, we chose the subset that occurs in UD. We highlight the properties of all 21 languages in Table 1.

Baselines mBERT (Standard Transfer Setup).
We primarily focus on mBERT as it has been shown to work well for low-resource languages (Pfeiffer et al., 2020b). mBERT is trained on the 104 languages with the largest Wikipedias. 6 In the standard crosslingual transfer setting (see §2), the full model is fine-tuned on the target task in the (high-resource) source language, and is evaluated on the test set of the target (low-resource) language.
MAD-X. We follow Pfeiffer et al. (2020b) and stack task adapters on top of pretrained language adapters (see §2). When training the model on source language task data, only the task adapter is trained while the original model weights and the source language adapter are frozen. At inference, the source language adapter is replaced with the target language adapter.
MAD-X 2.0. The adapter in the last transformer layer is not encapsulated between frozen transformer layers, and can thus be considered an extension of the prediction head. This places no constraints on the representation of the final adapter, possibly decreasing transfer performance when replacing the language adapters for zero-shot transfer.
In this work, we thus propose to drop the adapter in the last transformer layer, and also evaluate this novel variant of the MAD-X framework.

Methods
We experiment with the methods from Table 2 and discussed in §3, summarized here for clarity. EL-* We randomly initialize embeddings for all tokens-except special tokens-in the new vocabulary (EL-RAND) or initialize embeddings of lexically overlapping tokens with their pretrained representations (EL-LEX). MF 1 -* We randomly initialize lower-dimensional embeddings (MF 1 -RAND) or initialize lexically overlapping tokens with their corresponding lowerdimensional pretrained representation (MF 1 -LEX) while using a single pretrained projection matrix.

Experimental Setup
Previous work generally fine-tunes the entire model on the target task (Chau et al., 2020;Wang et al., 2020). To extend the model to a new vocabulary, Artetxe et al. (2020) alternatingly freeze and finetune embeddings and transformer weights for pretraining, and target task fine-tuning, respectively. We find that this approach largely underperforms adapter-based transfer as proposed by Pfeiffer et al.
(2020b), and we thus primarily focus on adapterbased training in this work. 7 Adapter-Based Transfer. We largely follow the experimental setup of Pfeiffer et al. (2020b), unless noted otherwise. We obtain language adapters for the high-resource languages from AdapterHub.ml (Pfeiffer et al., 2020a) and train language adapters and embeddings for the low-resource languages jointly while keeping the rest of the model fixed.
For zero-shot transfer, we replace the source language adapter with the target adapter, and also replace the entire embedding layer with the new embedding layer specialized to the target language. MAD-X 2.0 consistently outperforms MAD-X (see §5); we thus use this setup for all our methods.
Tokenizer. We learn a new WordPiece tokenizer for each target language with a vocabulary size of 10k using the HuggingFace tokenizer library. 8  Task Fine-tuning. Our preliminary experiments suggested that fine-tuning the model for a smaller number of epochs leads to better transfer performance in low-resource languages in general. We thus fine-tune all the models for 10 epochs, evaluating on the source language dev set after every epoch. We then take the best-performing model according to the dev F 1 score, and use it in zero-shot 9 https://github.com/cthurau/pymf transfer. We train all the models with a batch size of 16 on high resource languages. For NER we use learning rates 2e − 5 and 1e − 4 for full fine-tuning and adapter-based training, respectively. For DP, we use a transformer-based variant (Glavas and Vulic, 2021) of the standard deep biaffine attention dependency parser (Dozat and Manning, 2017) and train with learning rates 2e − 5 and 5e − 4 for full fine-tuning and adapter-based training respectively.

Results and Discussion
The main results are summarised in Table 3a for NER, and in Table 3b for DP. First, our novel MAD-X 2.0 considerably outperforms the MAD-X version of Pfeiffer et al. (2020b). However, while both MAD-X versions improve over mBERT for unseen scripts, the performance remains quite low on average. The corresponding mBERT tokenizer is not able to adequately represent unseen scripts: many tokens are substituted by UNKs (see Table 1), culminating in the observed low performance. For our approaches that learn new embedding matrices, we observe that for languages seen during pretraining, but potentially underrepresented by the   model (e.g. Georgian (ka), Urdu (ur), and Hindi (hi)), the proposed methods outperform MAD-X 2.0 for all tasks. This is in line with contemporary work (Rust et al., 2021), which emphasizes the importance of tokenizer quality for the downstream task. Consequently, for unseen languages with under-represented scripts, the performance gains are even larger, e.g., we see large improvements for Min Dong (cdo), Mingrelian (xmf), and Sindhi (sd). For unseen languages with the Latin script, our methods perform competitively (e.g. Maori (mi), Ilokano (ilo), Guarani (gn), and Wolof (wo)): this empirically confirms that the Latin script is adequately represented in the original vocabulary. The largest gains are achieved for languages with unseen scripts (e.g. Amharic (am), Tibetan (bo), Khmer (km), Divehi (dv), Sinhala (si)), as these languages are primarily represented as UNK tokens by the original mBERT tokenizer.
We observe improvements for most languages with lexical overlap initialization. This adds further context to prior studies which found that a shared vocabulary is not necessary for learning multilingual representations (Conneau et al., 2020b; Artetxe et al., 2020): while it is possible to generalize to new languages without lexical overlap, leveraging the overlap still offers additional gains.
The methods based on matrix factorization (MF * * - * ) improve performance over full-sized embedding methods (EL * * - * ), especially in the setting without lexical overlap initialization ( * -RAND). This indicates that by factorizing the information encoded in the original embedding matrix we are able to extract relevant information for unseen languages. Combining matrix factorization with lexical overlap initialization (MF * * -LEX), zero-shot performance improves further for unseen languages with covered scripts. This suggests that the two methods complement each other. For 6/9 of these languages, we find that encoding the embeddings in multiple up-projections (MF 10 * - * ) achieves the peak score. This in turn verifies that grouping similar tokens improves the robustness of token representations (Chung et al., 2020). For unseen languages with covered scripts, this model variant also outperforms MAD-X 2.0 on average.
For languages with unseen scripts we find that MF has smaller impact. While the encoded information supports languages similar to those seen by the model in pretraining, languages with unseen scripts are too distant to benefit from this latent multilingual knowledge. Surprisingly, lexical overlap is helpful for languages with unseen scripts.
Overall, we observe that both MF 10 KMEANS - * and MF 10 NEURAL - * perform well for most languages, where the KMEANS variant performs better for NER and the NEURAL variant performs better for UD. 6 Further Analysis

Lexically Overlapping (Sub)Words
We perform a quantitative analysis of lexically overlapping tokens, i.e. that occur both in mBERT's and monolingual in-language vocabularies; see Table 4. For languages with scripts not covered by the mBERT tokenizer, most lexically-overlapping tokens are single characters of a non-Latin script.   We further present the top 5 longest lexically overlapping (sub) words for four languages with scripts covered during pretraining (Min Dong, Maori, Ilokano, and Guarani) and four languages with unseen scripts (Tibetan, Khmer, Divehi, Sinhala) in Table 5. We observe that frequent lexically overlapping tokens are named entities in the Latin script, indicating that NER may not be the best evaluation task to objectively assess generalization performance to such languages. If the same named entities also occur in the training data of higherresource languages, the models will be more successful at identifying them in the unseen language, which belies a lack of deeper understanding of the low-resource language. This might also explain why greater performance gains were achieved for NER than for DP.

Sample Efficiency
We have further analyzed the sample efficiency of our approaches. We find that EL-LEX is slightly more sample-efficient than MF 10 KMEANS -LEX, where the latter outperforms the former with more data available. In Figure 3 we plot the zero-shot transfer performance where the adapters and embeddings were pretrained on different amounts of data. We further find that lower-dimensional embeddings tend to outperform higher-dimensional embeddings for the majority of languages. In Table 6 we compare 300 with 100 dimensional embeddings for the MF 10 KMEANS -LEX approach.

Script Clusters
We analyzed the KMeans clusters based on the tokens that consist of characters of a certain script in Figure 4 of the Appendix. We find distinct scriptbased groups; For instance, 5 clusters consist primarily of Latin-script tokens 10 , two clusters predominantly consist of Chinese, and a few Korean tokens. Interestingly, 2 clusters consisted of Cyrilic, and Arabic scripts as well as scripts used predominantly in India, varying slightly in their distribution. Lastly, one cluster included tokens of all except the Latin script.

Conclusion
We have systematically evaluated strategies for model adaptation to unseen languages with seen and unseen scripts. We have assessed the importance of the information stored within the original embedding matrix by means of leveraging lexically overlapping tokens, and extracting latent semantic concepts. For the latter, we have proposed a new method of encoding the embedding matrix into lower-dimensional embeddings and up-projections. We have demonstrated that our methods outperform previous approaches on NER and dependency parsing for both resource-rich and resource-poor scenarios, reducing the gap between random and lexical overlap initialisation, and enabling more effective model adaptation to unseen scripts.

A.2 Results: Named Entity Recognition
We present non-aggregated NER transfer performance when transferring from English, Chinese, Japanese, and Arabic in Tables 7, 8, 9, and 10 respectively. 12Ad indicates whether () or not () an adapter is placed in the 12th transformer layer. We additionally present the results for full model fine-tuning (FMT- * ).

A.3 Results: Dependency Parsing
We present non-aggregated DP transfer performance when transferring from English, Chinese, Japanese, and Arabic in Tables 11a, 11b, 11c, and 11d respectively.

A.4 Script Clusters
We present the groups of scripts within the 10 KMeans clusters in Figure 4. We follow Ács (2019) in grouping the scripts into languages. Table 12 lists all 104 languages and corresponding scripts which mBERT was pretrained on.

A.6 Hardware Setup
All experiments were conducted on a single NVIDIA V100 GPU with 32 Gb of VRAM.     Table 9: Mean F 1 NER test results averaged over 5 runs transferring from high resource language Japanese to the low resource languages. 12Ad indicates whether () or not () an adapter is placed in the 12th transformer layer. The top group (first three rows) includes models which leverage the original tokenizer which is not specialized for the target language. The second group (last 18 rows) include models with new tokenizers. Here we separate models with randomly initialized embeddings ( * -RANDINIT) from models with lexical initialization ( * -LEXINIT) by the dashed line. We additionally present the results for full model fine-tuning (FMT- * ).  Table 10: Mean F 1 NER test results averaged over 5 runs transferring from high resource language Arabic to the low resource languages. 12Ad indicates whether () or not () an adapter is placed in the 12th transformer layer. The top group (first three rows) includes models which leverage the original tokenizer which is not specialized for the target language. The second group (last 18 rows) include models with new tokenizers. Here we separate models with randomly initialized embeddings ( * -RANDINIT) from models with lexical initialization ( * -LEXINIT) by the dashed line. We additionally present the results for full model fine-tuning (FMT- * ).