How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

In this work, we provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolingual task performance. We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks. We first aim to establish, via fair and controlled comparisons, if a gap between the multilingual and the corresponding monolingual representation of that language exists, and subsequently investigate the reason for any performance difference. To disentangle conflating factors, we train new monolingual models on the same data, with monolingually and multilingually trained tokenizers. We find that while the pretraining data size is an important factor, a designated monolingual tokenizer plays an equally important role in the downstream performance. Our results show that languages that are adequately represented in the multilingual model’s vocabulary exhibit negligible performance decreases over their monolingual counterparts. We further find that replacing the original multilingual tokenizer with the specialized monolingual tokenizer improves the downstream performance of the multilingual model for almost every task and language.


Introduction
Following large transformer-based language models (LMs, Vaswani et al., 2017) pretrained on large English corpora (e.g., BERT, RoBERTa, T5;Devlin et al., 2019;Liu et al., 2019;Raffel et al., 2020), similar monolingual language models have been introduced for other languages (Virtanen et al., 2019; * Both authors contributed equally to this work. † PR is now affiliated with the University of Copenhagen. Our code is available at https://github.com/Adapter-Hub/hgiyt. Antoun et al., 2020;Martin et al., 2020, inter alia), offering previously unmatched performance in all NLP tasks. Concurrently, massively multilingual models with the same architectures and training procedures, covering more than 100 languages, have been proposed (e.g., mBERT, XLM-R, mT5; Devlin et al., 2019;Conneau et al., 2020;Xue et al., 2021). The "industry" of pretraining and releasing new monolingual BERT models continues its operations despite the fact that the corresponding languages are already covered by multilingual models. The common argument justifying the need for monolingual variants is the assumption that multilingual models-due to suffering from the so-called curse of multilinguality (Conneau et al., 2020, i.e., the lack of capacity to represent all languages in an equitable way)-underperform monolingual models when applied to monolingual tasks (Virtanen et al., 2019;Antoun et al., 2020;Rönnqvist et al., 2019, inter alia). However, little to no compelling empirical evidence with rigorous experiments and fair comparisons have been presented so far to support or invalidate this strong claim. In this regard, much of the work proposing and releasing new monolingual models is grounded in anecdotal evidence, pointing to the positive results reported for other monolingual BERT models (de Vries et al., 2019;Virtanen et al., 2019;Antoun et al., 2020).
Monolingual BERT models are typically evaluated on downstream NLP tasks to demonstrate their effectiveness in comparison to previous monolingual models or mBERT (Virtanen et al., 2019;Antoun et al., 2020;Martin et al., 2020, inter alia). While these results do show that certain monolingual models can outperform mBERT in certain tasks, we hypothesize that this may substantially vary across different languages and language properties, tasks, pretrained models and their pretraining data, domain, and size. We further argue that conclusive evidence, either supporting or refuting the key hypothesis that monolingual models currently outperform multilingual models, necessitates an independent and controlled empirical comparison on a diverse set of languages and tasks.
While recent work has argued and validated that mBERT is under-trained (Rönnqvist et al., 2019;Wu and Dredze, 2020), providing evidence of improved performance when training monolingual models on more data, it is unclear if this is the only factor relevant for the performance of monolingual models. Another so far under-studied factor is the limited vocabulary size of multilingual models compared to the sum of tokens of all corresponding monolingual models. Our analyses investigating dedicated (i.e., language-specific) tokenizers reveal the importance of high-quality tokenizers for the performance of both model variants. We also shed light on the interplay of tokenization with other factors such as pretraining data size.
Contributions. 1) We systematically compare monolingual with multilingual pretrained language models for 9 typologically diverse languages on 5 structurally different tasks. 2) We train new monolingual models on equally sized datasets with different tokenizers (i.e., shared multilingual versus dedicated language-specific tokenizers) to disentangle the impact of pretraining data size from the vocabulary of the tokenizer. 3) We isolate factors that contribute to a performance difference (e.g., tokenizers' "fertility", the number of unseen (sub)words, data size) and provide an in-depth analysis of the impact of these factors on task performance. 4) Our results suggest that monolingually adapted tokenizers can robustly improve monolingual performance of multilingual models.

Background and Related Work
Multilingual LMs. The widespread usage of pretrained multilingual Transformer-based LMs has been instigated by the release of multilingual BERT (Devlin et al., 2019), which followed on the success of the monolingual English BERT model. mBERT adopted the same pretraining regime as monolingual BERT by concatenating the 104 largest Wikipedias. Exponential smoothing was used when creating the subword vocabulary based on Word-Pieces (Wu et al., 2016) and a pretraining corpus. By oversampling underrepresented languages and undersampling overrepresented ones, it aims to counteract the imbalance of pretraining data sizes.
The final shared mBERT vocabulary comprises a total of 119,547 subword tokens.
Other multilingual models followed mBERT, such as XLM-R (Conneau et al., 2020). Concurrently, many studies analyzed mBERT's and XLM-R's capabilities and limitations, finding that the multilingual models work surprisingly well for cross-lingual tasks, despite the fact that they do not rely on direct cross-lingual supervision (e.g., parallel or comparable data, translation dictionaries; Pires et al., 2019;Wu and Dredze, 2019;Artetxe et al., 2020;Hu et al., 2020;K et al., 2020).
However, recent work has also pointed to some fundamental limitations of multilingual LMs. Conneau et al. (2020) observe that, for a fixed model capacity, adding new languages increases crosslingual performance up to a certain point, after which adding more languages results in performance drops. This phenomenon, termed the curse of multilinguality, can be attenuated by increasing the model capacity (Artetxe et al., 2020;Pfeiffer et al., 2020b;Chau et al., 2020) or through additional training for particular language pairs (Pfeiffer et al., 2020b;Ponti et al., 2020). Another observation concerns substantially reduced crosslingual and monolingual abilities of the models for resource-poor languages with smaller pretraining data (Wu and Dredze, 2020;Hu et al., 2020;Lauscher et al., 2020). Those languages remain underrepresented in the subword vocabulary and the model's shared representation space despite oversampling. Despite recent efforts to mitigate this issue (e.g., Chung et al. (2020) propose to cluster and merge the vocabularies of similar languages, before defining a joint vocabulary across all languages), the multilingual LMs still struggle with balancing their parameters across many languages.
Monolingual versus Multilingual LMs. New monolingual language-specific models also emerged for many languages, following BERT's architecture and pretraining procedure. There are monolingual BERT variants for Arabic (Antoun et al., 2020), French (Martin et al., 2020, Finnish (Virtanen et al., 2019), Dutch (de Vries et al., 2019, to name only a few. Pyysalo et al. (2020) released 44 monolingual WikiBERT models trained on Wikipedia. However, only a few studies have thus far, either explicitly or implicitly, attempted to understand how monolingual and multilingual LMs compare across languages. Nozza et al. (2020) extracted task results from the respective papers on monolingual BERTs to facilitate an overview of monolingual models and their comparison to mBERT. 1 However, they have not verified the scores, nor have they performed a controlled impartial comparison. Vulić et al. (2020) probed mBERT and monolingual BERT models across six typologically diverse languages for lexical semantics. They show that pretrained monolingual BERT models encode significantly more lexical information than mBERT. Zhang et al. (2020) investigated the role of pretraining data size with RoBERTa, finding that the model learns most syntactic and semantic features on corpora spanning 10M-100M word tokens, but still requires massive datasets to learn higher-level semantic and commonsense knowledge.
Mulcaire et al. (2019) compared monolingual and bilingual ELMo (Peters et al., 2018) LMs across three downstream tasks, finding that contextualized representations from the bilingual models can improve monolingual task performance relative to their monolingual counterparts. 2 However, it is unclear how their findings extend to massively multilingual LMs potentially suffering from the curse of multilinguality.
Rönnqvist et al. (2019) compared mBERT to monolingual BERT models for six languages (German, English, Swedish, Danish, Norwegian, Finnish) on three different tasks. They find that mBERT lags behind its monolingual counterparts in terms of performance on cloze and generation tasks. They also identified clear differences among the six languages in terms of this performance gap. They speculate that mBERT is under-trained with respect to individual languages. However, their set of tasks is limited, and their language sample is typologically narrow; it remains unclear whether these findings extend to different language families and to structurally different tasks.
Despite recent efforts, a careful, systematic study within a controlled experimental setup, a diverse language sample and set of tasks is still lacking. We aim to address this gap in this work.

Controlled Experimental Setup
We compare multilingual BERT with its monolingual counterparts in a spectrum of typologically diverse languages and across a variety of downstream tasks. By isolating and analyzing crucial factors contributing to downstream performance, such as tokenizers and pretraining data, we can conduct unbiased and fair comparisons.

Language and Task Selection
Our selection of languages has been guided by several (sometimes competing) criteria: C1) typological diversity; C2) availability of pretrained monolingual BERT models; C3) representation of the languages in standard evaluation benchmarks for a sufficient number of tasks.
Regarding C1, most high-resource languages belong to the same language families, thus sharing a majority of their linguistic features. Neglecting typological diversity inevitably leads to poor generalizability and language-specific biases (Gerz et al., 2018;Ponti et al., 2019;Joshi et al., 2020). Following recent work in multilingual NLP that pays particular attention to typological diversity (Clark et al., 2020;Hu et al., 2020;Ponti et al., 2020, inter alia), we experiment with a language sample covering a broad spectrum of language properties.
Regarding C2, for computational tractability, we only select languages with readily available BERT models. Unlike prior work, which typically lacks either language (Rönnqvist et al., 2019;Zhang et al., 2020) or task diversity (Wu and Dredze, 2020;Vulić et al., 2020), we ensure that our experimental framework takes both into account, thus also satisfying C3. We achieve task diversity and generalizability by selecting a combination of tasks driven by lower-level syntactic and higher-level semantic features (Lauscher et al., 2020).
Finally, we select a set of 9 languages from 8 language families, as listed in Table 1. 3 We evaluate mBERT and monolingual BERT models on five downstream NLP tasks: named entity recognition (NER), sentiment analysis (SA), question answering (QA), universal dependency parsing (UDP), and part-of-speech tagging (POS). 4   the results of the initialization that achieved the highest score on the development set.
Evaluation Measures. We report F 1 scores for NER, accuracy scores for SA and POS, unlabeled and labeled attachment scores (UAS & LAS) for UDP, and exact match and F 1 scores for QA.
Hyper-Parameters and Technical Details. We use AdamW (Kingma and Ba, 2015) in all experiments, with a learning rate of 3e − 5. 11 We train for 10 epochs with early stopping (Prechelt, 1998). 12 11 Preliminary experiments indicated this to be a well performing learning rate. Due to the large volume of our experiments, we were unable to tune all the hyper-parameters for each setting. We found that a higher learning rate of 5e − 4 works best for adapter-based fine-tuning (see later) since the task adapter parameters are learned from scratch (i.e., they are randomly initialized). 12 We evaluate a model every 500 gradient steps on the development set, saving the best-performing model based on the respective evaluation measures. We terminate training if no performance gains are observed within five consecutive evaluation runs (= 2,500 steps). For QA and UDP, we use the F1 scores and LAS, respectively. For FI and ID QA, we train for 20 epochs due to slower convergence. We train with batch size 32 and max sequence length 256 for all tasks except QA. In QA, the batch size is 24, max sequence length 384, query length 64, and document stride is set to 128.

Initial Results
We report our first set of results in Table 2. 13 We find that the performance gap between monolingual models and mBERT does exist to a large extent, confirming anecdotal evidence from prior work. However, we also notice that the score differences are largely dependent on the language and task at hand. The largest performance gains of monolingual models over mBERT are found for FI, TR, KO, and AR. In contrast, mBERT outperforms the In-doBERT (ID) model in all tasks except SA, and performs competitively with the JA and ZH monolingual models on most datasets. In general, the gap is particularly narrow for POS tagging, where all models tend to score high (in most cases north of 95% accuracy). ID aside, we also see a clear trend for UDP, with monolingual models outperforming fully fine-tuned mBERT models, most notably for FI and TR. In what follows, we seek to understand the causes of this behavior in relation to different factors such as tokenizers, corpora sizes, as well as languages and tasks in consideration.

Pretraining Corpus Size
The size of the pretraining corpora plays an important role in the performance of transformers (Liu et al., 2019;Conneau et al., 2020;Zhang et al., 2020, inter alia). Therefore, we compare how much data each monolingual model was trained on with the amount of data in the respective language that mBERT has seen during training. Given that mBERT was trained on entire Wikipedia dumps, we estimate the latter by the total number of words across all articles listed for each Wiki. 14 For the monolingual LMs, we extract information on pretraining data from the model documentation. If no exact numbers are explicitly stated, and the pretraining corpora are unavailable, we make estimations based on the information provided by the authors. 15 The statistics are provided in Figure 1a. For EN, JA, RU, and ZH, both the respective monolingual BERT and mBERT were trained on similar amounts of monolingual data. On the other hand, monolingual BERTs of AR, ID, FI, KO, and TR were trained on about twice (KO) up to more than 40 times (TR) as much data in their language than mBERT.
13 See Appendix Table 8 for the results on development sets. 14 Based on the numbers from https://meta.m.wikimedia.org/wiki/List of Wikipedias 15 We provide further details in Appendix A.2.

Tokenizer
Compared to monolingual models, mBERT is substantially more limited in terms of the parameter budget that it can allocate for each of its 104 languages in its vocabulary. In addition, monolingual tokenizers are typically trained by native-speaking experts who are aware of relevant linguistic phenomena exhibited by their target language. We thus inspect how this affects the tokenizations of monolingual data produced by our sample of monolingual models and mBERT. We tokenize examples from Universal Dependencies v2.6 treebanks and compute two metrics (Ács, 2019). 16 First, the subword fertility measures the average number of subwords produced per tokenized word. A minimum fertility of 1 means that the tokenizer's vocabulary contains every single word in the text. We plot the fertility scores in Figure 1b. We find that mBERT has similar fertility values as its monolingual counterparts for EN, ID, JA, and ZH. In contrast, mBERT has a much higher fertility for AR, FI, KO, RU, and TR, indicating that such languages are over-segmented. mBERT's fertility is the lowest for EN; this is due to mBERT having seen the most data in this language during training, as well as English being morphologically poor in contrast to languages such as AR, FI, RU, or TR. 17 The second metric we employ is the proportion of words where the tokenized word is continued across at least two sub-tokens (denoted by continuation symbols ##). Whereas the fertility is concerned with how aggressively a tokenizer splits, this metric measures how often it splits words. Intuitively, low scores are preferable for both metrics as they indicate that the tokenizer is well suited to the language. The plots in Figure 1c show similar trends as with the fertility statistic. In addition to AR, FI, KO, RU, and TR, which already displayed differences in fertility, mBERT also produces a proportion of continued words more than twice as high as the monolingual model for ID. 18

New Pretrained Models
The differences in pretraining corpora and tokenizer statistics seem to align with the variations in downstream performance across languages. In particular, it appears that the performance gains of monolingual models over mBERT are larger for languages where the differences between the respective tokenizers and pretraining corpora sizes are also larger (AR, FI, KO, RU, TR vs. EN, JA, ZH). 19 This implies that both the data size and the tokenizer are among the main driving forces of downstream task performance. To disentangle the effects of these two factors, we pretrain new models for AR, FI, ID, KO, and TR (the languages that exhibited the largest discrepancies in tokenization and pretraining data size) on Wikipedia data. We train four model variants for each language. First, we train two new monolingual BERT models on the same data, one with the original monolingual tokenizer (MONOMODEL-MONOTOK) and one with the mBERT tokenizer (MONOMODEL-MBERTTOK). 20 Second, similar to Artetxe et al. (2020), we retrain the embedding layer of mBERT, once with the respective monolingual tokenizer (MBERTMODEL-MONOTOK) and once with the mBERT tokenizer (MBERTMODEL-MBERTTOK). We freeze the transformer and only retrain the embedding weights, thus largely preserving mBERT's multilinguality. The reason we retrain mBERT's embedding layer with its own tokenizer is to further eliminate confounding factors when comparing to the version of mBERT with monolingually retrained embeddings. By comparing models 19 The only exception is ID, where the monolingual model has seen significantly more data and also scores lower on the tokenizer metrics, yet underperforms mBERT in most tasks. We suspect this exception is because IndoBERT is uncased, whereas the remaining models are cased. 20 The only exception is ID; instead of relying on the uncased IndoBERT tokenizer by Wilie et al. (2020), we introduce a new cased tokenizer with identical vocabulary size (30,521). trained on the same amount of data, but with different tokenizers (MONOMODEL-MONOTOK vs. MONOMODEL-MBERTTOK, MBERTMODEL-MBERTTOK vs. MBERTMODEL-MONOTOK), we disentangle the effect of the dataset size from the tokenizer, both with monolingual and multilingual LM variants.
Pretraining Setup. We pretrain new BERT models for each language on its respective Wikipedia dump. 21 We apply two preprocessing steps to obtain clean data for pretraining. First, we use WikiExtractor (Attardi, 2015) to extract text passages from the raw dumps. Next, we follow Pyysalo et al. (2020) and utilize UDPipe (Straka et al., 2016) parsers pretrained on UD data to segment the extracted text passages into texts with document, sentence, and word boundaries.
Following Liu et al. (2019); Wu and Dredze (2020), we only use the masked language modeling (MLM) objective and omit the next sentence prediction task. Besides that, we largely follow the default pretraining procedure by Devlin et al. (2019). We pretrain the new monolingual LMs (MONOMODEL-*) from scratch for 1M steps. 22 We enable whole word masking (Devlin et al., 2019) for the FI monolingual models, following the pretraining procedure for FinBERT (Virtanen et al., 2019). For the retrained mBERT models (MBERTMODEL-*), we train for 250,000 steps following Artetxe et al. (2020). 23 We freeze all parameters outside the embedding layer. 24 Results. We perform the same evaluations on downstream tasks for our new models as described  Table 3: Performance of our new MONOMODEL-* and MBERTMODEL-* models (see §A.5) fine-tuned for the NER, SA, QA, UDP, and POS tasks (see §3.1), compared to the monolingual models from prior work and fully fine-tuned mBERT. We group model counterparts w.r.t. tokenizer choice to facilitate a direct comparison between respective counterparts. We use development sets only for QA. Bold denotes best score across all models for a given language and task. Underlined denotes best score compared to its respective counterpart.
in §3, and report the results in Table 3. 25 The results indicate that the models trained with dedicated monolingual tokenizers outperform their counterparts with multilingual tokenizers in most tasks, with particular consistency for QA, UDP, and SA. In NER, the models trained with multilingual tokenizers score competitively or higher than the monolingual ones in half of the cases. Overall, the performance gap is the smallest for POS tagging (at most 0.4% accuracy). We observe the 25 Full results including development set scores are available in Table 9 of the Appendix. largest gaps for QA (6.1 EM / 4.4 F 1 in ID), SA (2.2% accuracy in TR), and NER (1.7 F 1 in AR). Although the only language in which the monolingual counterpart always comes out on top is KO, the multilingual counterpart comes out on top at most 3/10 times (for AR and TR) in the other languages. The largest decrease in performance of a monolingual tokenizer relative to its multilingual counterpart is found for SA in TR (0.8% accuracy).
Overall, we find that for 38 out of 48 task, model, and language combinations, the monolingual tokenizer outperforms the mBERT counterpart. We were able to improve the monolingual performance of the original mBERT for 20 out of 24 languages and tasks by only replacing the tokenizer and, thus, leveraging a specialized monolingual version. Similar to how the chosen method of tokenization affects neural machine translation quality (Domingo et al., 2019), these results establish that, in fact, the designated pretrained tokenizer plays a fundamental role in the monolingual downstream task performance of contemporary LMs.
In 18/24 language and task settings, the monolingual model from prior work (trained on more data) outperforms its corresponding MONOMODEL-MONOTOK model. 4/6 settings in which our MONOMODEL-MONOTOK model performs better are found for ID, where IndoBERT uses an uncased tokenizer and our model uses a cased one, which may affect the comparison. Expectedly, these results strongly indicate that data size plays a major role in downstream performance and corroborate prior research findings (Liu et al., 2019;Conneau et al., 2020;Zhang et al., 2020, inter alia).

Adapter-Based Training
Another way to provide more language-specific capacity to a multilingual LM beyond a dedicated tokenizer, thereby potentially making gains in monolingual downstream performance, is to introduce adapters (Pfeiffer et al., 2020b,c;Üstün et al., 2020), a small number of additional parameters at every layer of a pretrained model. To train adapters, usually all pretrained weights are frozen, while only the adapter weights are fine-tuned. 26 The adapterbased approaches thus offer increased efficiency and modularity; it is crucial to verify to which extent our findings extend to the more efficient and  more versatile adapter-based fine-tuning setup.
We evaluate the impact of different adapter components on the downstream task performance and their complementarity with monolingual tokenizers in Table 4. 27 Here, +A T ask and +A Lang implies adding task-and language-adapters respectively, whereas +MONOTOK additionally includes a new embedding layer. As mentioned, we only fine-tune adapter weights on the downstream task, leveraging the adapter architecture proposed by Pfeiffer et al. (2021). For the +A T ask + A Lang setting we leverage pretrained language adapter weights available at AdapterHub.ml (Pfeiffer et al., 2020a). Language adapters are added to the model and frozen while only task adapters are trained on the target task. For the +A T ask +A Lang + MONOTOK we train language adapters and new embeddings with the corresponding monolingual tokenizer equally as described in the previous section (e.g. MBERTMODEL-MONOTOK), task adapters are trained with a learning rate of 5e − 4 and 30 epochs with early stopping.
Results. Similar to previous findings, adapters improve upon mBERT in 18/24 language, and task settings, 13 of which can be attributed to the improved MBERTMODEL-MONOTOK tokenizer. Figure 2 illustrates the average performance of the different adapter components in comparison to the monolingual models. We find that adapters with dedicated tokenizers reduce the performance gap con-27 See Appendix Table 10 for the results on dev sets.  Figure 2: Task performance averaged over all languages for different models: fully fine-tuned monolingual (Mono), fully fine-tuned mBERT (mBERT), mBERT with task adapter (+A Task ), with task and language adapter (+A Task + A Lang ), with task and language adapter and embedding layer retraining (+A Task + A Lang + MONOTOK).
siderably without leveraging more training data, and even outperform the monolingual models in QA. This finding shows that adding additional language-specific capacity to existing multilingual LMs, which can be achieved with adapters in a portable and efficient way, is a viable alternative to monolingual pretraining.

Further Analysis
At first glance, our results displayed in Table 2 seem to confirm the prevailing view that monolingual models are more effective than multilingual models (Rönnqvist et al., 2019; Antoun et al., 2020; de Vries et al., 2019, inter alia). However, the broad scope of our experiments reveals certain nuances that were previously undiscovered. Unlike prior work, which primarily attributes gaps in performance to mBERT being under-trained (Rönnqvist et al., 2019; Wu and Dredze, 2020), our disentangled results (Table 3) suggest that a large portion of existing performance gaps can be attributed to the capability of the tokenizer.
With monolingual tokenizers with lower fertility and proportion-of-continued-words values than the mBERT tokenizer (such as for AR, FI, ID, KO, TR), consistent gains can be achieved, irrespective of whether the LMs are monolingual (the MONOMODEL-* comparison) or multilingual (a comparison of MBERTMODEL-* variants).
Whenever the differences between monolingual models and mBERT with respect to the tokenizer properties and the pretraining corpus size are small (e.g., for EN, JA, and ZH), the performance gap is typically negligible. In QA, we even find mBERT to be favorable for these languages. Therefore, we conclude that monolingual models are not superior to multilingual ones per se, but gain advantage in direct comparisons by incorporating more pretraining data and using language-adapted tokenizers.  Figure 3: Spearman's ρ correlation of a relative decrease in the proportion of continued words (Cont. Proportion), a relative decrease in fertility, and a relative increase in pretraining corpus size with a relative increase in downstream performance over fully finetuned mBERT. For the proportion of continued words and the fertility, we consider fully fine-tuned mBERT, the MONOMODEL-* models, and the MBERTMODEL-* models. For the pretraining corpus size, we consider the original monolingual models and the MONOMODEL-MONOTOK models. We exclude the ID models (see Appendix B.2 for the clarification).
Correlation Analysis. To uncover additional patterns in our results (Tables 2, 3, 4), we perform a statistical analysis assessing the correlation between the individual factors (pretraining data size, subword fertility, proportion of continued words) and the downstream performance. Although our framework may not provide enough data points to be statistically representative, we argue that the correlation coefficient can still provide reasonable indications and reveal relations not immediately evident by looking at the tables. Figure 3 shows that both decreases in the proportion of continued words and the fertility correlate with an increase in downstream performance relative to fully fine-tuned mBERT across all tasks. The correlation is stronger for UDP and QA, where we find models with monolingual tokenizers to outperform their counterparts with the mBERT tokenizer consistently. The correlation is weaker for NER and POS tagging, which is also expected, considering the inconsistency of the results. 28 Overall, we find that the fertility and the proportion of continued words have a similar effect on the monolingual downstream performance as the corpus size for pretraining; This indicates that the tokenizer's ability of representing a language plays a crucial role; Consequently, choosing a suboptimal tokenizer typically results in deteriorated downstream performance. 28 For further information, see Appendix B.2.

Conclusion
We have conducted the first comprehensive empirical investigation concerning the monolingual performance of monolingual and multilingual language models (LMs). While our results support the existence of a performance gap in most but not all languages and tasks, further analyses revealed that the gaps are often substantially smaller than what was previously assumed. The gaps exist in certain languages due to the discrepancies in 1) pretraining data size, and 2) chosen tokenizers, and the level of their adaptation to the target language.
Further, we have disentangled the impact of pretrained corpora size from the influence of the tokenizers on the downstream task performance. We have trained new monolingual LMs on the same data, but with two different tokenizers; one being the dedicated tokenizer of the monolingual LM provided by native speakers; the other being the automatically generated multilingual mBERT tokenizer. We have found that for (almost) every task and language, the use of monolingual tokenizers outperforms the mBERT tokenizer.
Consequently, in line with recent work by Chung et al. (2020), our results suggest that investing more effort into 1) improving the balance of individual languages' representations in the vocabulary of multilingual LMs, and 2) providing languagespecific adaptations and extensions of multilingual tokenizers (Pfeiffer et al., 2020c) can reduce the gap between monolingual and multilingual LMs. Another promising future research direction is completely disposing of any (language-specific or multilingual) tokenizers during pretraining (Clark et al., 2021).

A.1 Pretrained Models
All of the pretrained language models we use are available on the HuggingFace model hub 29 and compatible with the HuggingFace transformers Python library (Wolf et al., 2020). Table 5

A.4 Fine-Tuning Datasets
We list the datasets we used, including the number of examples per dataset split, in the Table 7.

A.5 Training Procedure of New Models
We pretrain our models on single Nvidia Tesla V100, A100, and Titan RTX GPUs with 32GB, 40GB, and 24GB of video memory, respectively. To support larger batch sizes, we train in mixedprecision (fp16) mode. Following Wu and Dredze (2020), we only use masked language modeling (MLM) as pretraining objective and omit the next sentence prediction task as Liu et al. (2019) find it does not yield performance gains. We otherwise 29 https://huggingface.co/models 30 https://meta.m.wikimedia.org/wiki/List of Wikipedias 31 We obtained the numbers for ID and TR on Dec 10, 2020 and for the remaining languages on Sep 10, 2020. 32 For JA, RU, and ZH, the authors do not provide exact word counts. Therefore, we estimate them using other provided information (RU, ZH) or scripts for training corpus reconstruction (JA). mostly follow the default pretraining procedure by Devlin et al. (2019). We pretrain the new monolingual models (MONOMODEL-*) from scratch for 1M steps with batch size 64. We choose a sequence length of 128 for the first 900,000 steps and 512 for the remaining 100,000 steps. In both phases, we warm up the learning rate to 1e − 4 over the first 10,000 steps, then decay linearly. We use the Adam optimizer with weight decay (AdamW) (Loshchilov and Hutter, 2019) with default hyper-parameters and a weight decay of 0.01. We enable whole word masking (Devlin et al., 2019) for the FI monolingual models, following the pretraining procedure for FinBERT (Virtanen et al., 2019). To lower computational requirements for the monolingual models with mBERT tokenizers, we remove all tokens from mBERT's vocabulary that do not appear in the pretraining data. We, thereby, obtain vocabularies of size 78,193 (AR), 60,827 (FI), 72,787 (ID), 66,268 (KO), and 71,007 (TR), which for all languages reduces the number of parameters in the embedding layer significantly, compared to the 119,547 word piece vocabulary of mBERT. For the retrained mBERT models (i.e., MBERTMODEL-*), we run MLM for 250,000 steps (similar to Artetxe et al. (2020)) with batch size 64 and sequence length 512, otherwise using the same hyper-parameters as for the monolingual models. In order to retrain the embedding layer, we first resize it to match the vocabulary of the respective tokenizer. For the MBERTMODEL-MBERTTOK models, we use the mBERT tokenizers with reduced vocabulary as outlined above. We initialize the positional embeddings, segment embeddings, and embeddings of special tokens ([CLS], [SEP], [PAD], [UNK], [MASK]) from mBERT, and reinitialize the remaining embeddings randomly. We freeze all parameters outside the embedding layer. For all pretraining runs, we set the random seed to 42.

A.6 Code
Our code with usage instructions for finetuning, pretraining, data preprocessing, and calculating the tokenizer statistics is available at https://github.com/Adapter-Hub/hgiyt. The repository also contains further links to a collection of our new pretrained models and language adapters.

B.1 Tokenization Analysis
In our tokenization analysis in §4.2 of the main text, we only include the fertility and the proportion of continued words as they are sufficient to illustrate and quantify the differences between tokenizers. In support of the findings in §4.2 and for completeness, we provide additional tokenization statistics here.
For each tokenizer, Table 5 lists the respective vocabulary size and the proportion of its vocabulary also contained in mBERT. It shows that the tokenizers scoring lower in fertility (and accordingly performing better) than mBERT are often not adequately covered by mBERT's vocabulary. For instance, only 5.6% of the AraBERT (AR) vocabulary is covered by mBERT. Figure 4 compares the proportion of unknown tokens ([UNK]) in the tokenized data. It shows that the proportion is generally extremely low, i.e., the tokenizers can typically split unknown words into known subwords.
Similar to the work byÁcs (2019), Figure 5 compares the tokenizations produced by the monolingual models and mBERT with the reference tokenizations provided by the human dataset annotators with respect to their sentence lengths. We find that the tokenizers scoring low in fertility and the proportion of continued words typically exhibit sentence length distributions much closer to the reference tokenizations by human UD annotators, indicating they are more capable than the mBERT tokenizer. Likewise, the monolingual models' and mBERT's sentence length distributions are closer for languages with similar fertility and proportion of continued words, such as EN, JA, and ZH.

B.2 Correlation Analysis
To uncover some of the hidden patterns in our results (Tables 2, 3, 4), we perform a statistical analysis assessing the correlation between the individual factors (pretraining data size, subword fertility, proportion of continued words) and the downstream performance. Figure 6b shows that both decreases in the proportion of continued words and the fertility correlate with an increase in downstream performance relative to fully fine-tuned mBERT across all tasks. The correlation is stronger for UDP and QA, where we found models with monolingual tokenizers to outperform their counterparts with the mBERT to-kenizer consistently. The correlation is weaker for NER and POS tagging, which is also expected, considering the inconsistency of the results.
Somewhat surprisingly, the tokenizer metrics seem to be more indicative of high downstream performance than the size of the pretraining corpus. We believe that this in parts due to the overall poor performance of the uncased IndoBERT model, which we (in this case unfairly) compare to our cased ID-MONOMODEL-MONOTOK model. Therefore, we plot the same correlation matrix excluding ID in Figure 3.
Compared to Figure 6b, the overall correlations for the proportion of continued words and the fertility remain mostly unaffected. In contrast, the correlation for the pretraining corpus size becomes much stronger, confirming that the subpar performance of IndoBERT is indeed an outlier in this scenario. Leaving out Indonesian also strengthens the indication that the performance in POS tagging correlates more with the data size than with the tokenizer, although we argue that this indication may be misleading. The performance gap is generally very minor in POS tagging. Therefore, the Spearman correlation coefficient, which only takes the rank into account, but not the absolute score differences, is particularly sensitive to changes in POS tagging performance.
Finally, we plot the correlation between the three metrics and the downstream performance under consideration of all languages and models, including the adapter-based fine-tuning settings, to gain an understanding of how pronounced their effects are in a more "noisy" setting.
As Figure 6a shows, the three factors still correlate with the downstream performance in a similar manner even when not isolated. This correlation tells us that even when there may be other factors that could have an influence, these three factors are still highly indicative of the downstream performance.
We also see that the correlation coefficients for the proportion of continued words and the fertility are nearly identical, which is expected based on the visual similarity of the respective plots (seen in Figures 1b and 1c).

C Full Results
For compactness, we have only reported the performance of our models on the respective test datasets in the main text. 33 For completeness, we also include the full tables, including development (dev) dataset performance averaged over three random initializations, as described in §3. Table 8 shows the full results corresponding to Table 2 (initial results), Table 9 shows the full results corresponding to Table 3 (results for our new models), and Table 10 shows the full results corresponding to     (b) For the proportion of continued words and the fertility, we consider fully fine-tuned mBERT, the MONOMODEL-* models, and the MBERTMODEL-* models. For the pretraining corpus size, we consider the original monolingual models and the MONOMODEL-MONOTOK models. Figure 6: Spearman's ρ correlation of a relative decrease in the proportion of continued words (Cont. Proportion), a relative decrease in fertility, and a relative increase in pretraining corpus size with a relative increase in downstream performance over fully finetuned mBERT.