Larger-Scale Transformers for Multilingual Masked Language Modeling

Recent work has demonstrated the effectiveness of cross-lingual language model pretraining for cross-lingual understanding. In this study, we present the results of two larger multilingual masked language models, with 3.5B and 10.7B parameters. Our two new models dubbed and outperform XLM-R by 1.8% and 2.4% average accuracy on XNLI. Our model also outperforms the RoBERTa-Large model on several English tasks of the GLUE benchmark by 0.3% on average while handling 99 more languages. This suggests larger capacity models for language understanding may obtain strong performance on high-resource languages while greatly improving low-resource languages. We make our code and models publicly available.


Introduction
The goal of this paper is to present a study of the impact of larger capacity models on crosslingual language understanding (XLU). We scale the capacity of XLM-R by almost two orders of magnitude while training on the same CC100 dataset . Our two new multilingual masked language model dubbed XLM-R XL and XLM-R XXL , with 3.5 and 10.7 billion parameters respectively, significantly outperform the previous XLM-R model on cross-lingual understanding benchmarks and obtain competitive performance with the multilingual T5 models (Raffel et al., 2019;Xue et al., 2020). We show that they can even outperform RoBERTa-Large  on the GLUE benchmark (Wang et al., 2018).
Recent multilingual masked language models (MLM) like mBERT (Devlin et al., 2018) or XLM (Lample and Conneau, 2019) improved crosslingual language understanding by pretraining large Transformer models (Vaswani et al., 2017) on mul-1 https://github.com/anonymous tiple languages at once. The XLM-R model  extended that approach by scaling the amount of data by two orders of magnitude, from Wikipedia to Common-Crawl and training longer, similar to RoBERTa . These models are particularly effective for lowresource languages, where both labeled and unlabeled data is scarce. They enable supervised cross-lingual transfer, where labeled data in one language can be used to solve the same task in other languages, and unsupervised cross-lingual transfer, where low-resource language self-supervised representations are improved using additional unlabeled data from higher-resource languages. Furthermore, they reduce the need for training one model per language, and allows the use of a single -potentially much larger -pretrained model that is then fine-tuned on annotated data from many languages.
The better performance of self-supervised crosslingual models on low-resource languages comes however at the cost of lower performance on higherresource languages (Arivazhagan et al., 2019). When the number of languages becomes large,  even observed an overall decrease of performance on all languages. It was hypothesized that when multilingual models get more capacity, they may showcase strong performance on both high-resource languages and lowresource languages. With only 550M parameters, the XLM-R model is now relatively small compared to new standards. Recent work scaled language models to hundreds of billions (Brown et al., 2020) or even multiple trillion parameters (Fedus et al., 2021), showing consistent gains in doing so. Recently, multilingual T5 showed impressive increase in performance by scaling the model capacity to tens of billions of parameters. Our study complements these findings by showing the impact of larger capacity models on the important pretraining task of multilingual masked language model-ing. We show promising results for cross-lingual understanding: XLM-R XXL can both obtain a new state of the art on some cross-lingual understanding benchmarks and outperform the RoBERTa-Large model on the English GLUE benchmark (Wang et al., 2018). This suggests that very large-scale multilingual models may be able to benefit from the best of both worlds: obtaining strong performance on high-resource languages while still allowing for zero-shot transfer and low-resource language understanding. We make the following contributions: • We scale XLM capacity by two orders of magnitude, and publicly release XLM-R XL and XLM-R XXL with 3.5B and 10.7B parameters.
• We show that those two models obtain very strong performance on cross-lingual benchmarks while outperforming RoBERTa Large on the GLUE benchmark.

Pretraining and evaluation
In this section, we describe the model we use and how we scale it, as well as the data and tasks we use for pretraining and evaluation.

Multilingual masked language models
We use a Transformer model (Vaswani et al., 2017) trained with the multilingual MLM objective (Devlin et al., 2018;Lample and Conneau, 2019) using only monolingual data. We sample streams of text from each language and train the model to predict the masked tokens in the input. We use the same learning procedure as XLM-R. We apply subword tokenization directly on raw text data using Sentence Piece (Kudo and Richardson, 2018) with a unigram language model (Kudo, 2018) just like in XLM-R. We sample batches from different languages using the same sampling distribution as , with α = 0.3, and without language embeddings. We use a large vocabulary size of 250K with a full softmax and train two different models: XLM-R XL (L = 36, H = 2560, A = 32, 3.5B params) and XLM-R XXL (L = 48, H = 4096, A = 32, 10.7B params). We pretrain the models on the CC100 dataset, which corresponds to 167B tokens in 100 languages. We compare our approach to previous results as well as the mT5 baselines, which were pretrained on the larger mC4 corpus of 6.4T tokens.

Evaluation
We consider three evaluation benchmarks. For cross-lingual understanding, we use cross-lingual natural language inference and question answering, and use the GLUE benchmark to evaluate the English performance.
Cross-lingual Natural Language Inference.
The XNLI dataset (Conneau et al., 2018) comes with ground-truth dev and test sets in 15 languages, and a ground-truth English training set. The training set has been machine-translated to the remaining 14 languages, providing synthetic training data for these languages as well. We evaluate our model on cross-lingual transfer from English to other languages. We also consider two machine translation baselines: (i) translate-test: dev and test sets are machine-translated to English and a single English model is used (ii) translate-train-all: the English training set is machine-translated to each language and we fine-tune a multilingual model on all training sets. For translations, we use the original XNLI data for consistency.
Cross-lingual Question Answering. We use the MLQA and XQuad benchmark from  and Artetxe et al. (2019), which extends the English SQuAD benchmark to more languages.
We report the F1 score as well as the exact match (EM) score for cross-lingual transfer from English.

Training details
We use model parallelism based on tensor parallel (Shoeybi et al., 2019) for scaling models. XLM-R XL uses model parallel size of 2 and XLM-R XXL used 8. Compared to previous XLM-R models, we reduce the batch size and number of updates significantly to keep the compute of the new models similar (see Table 5). For both models, we use batch size of 2048 and train for 500,000 updates. We use pre-LayerNorm setting for both the models which was more stable during training. For all the tasks in finetuning, we use batch size of 32 and train for 10 epochs. We do early stopping based on the average valid metrics across all languages and report test results. . We report the accuracy on each of the 15 XNLI languages and average accuracy, and specify the dataset and its corresponding size in number of tokens. We report results of XLM-R models with increasing capacity, from 270M (Base), 550M (Large), 3.5B (XL) to 10.7B (XXL) parameters.

Analysis and Results
In this section, we present our results and compare XLM-R XL and XLM-R XXL performance to other methods from previous work.
Cross-lingual understanding results. On XNLI, we observe in Table 1 Table 3).
Comparison to monolingual English model. For smaller-capacity models like the Base and Large version of XLM-R, it was shown that the more languages are considered the lower the perfor-mance , in particular on highresource languages. For instance, XLM-R Large was outperformed by RoBERTa Large by 1% accuracy on average on several downstream tasks from the GLUE benchmark, as illustrated in Ta-ble2. With larger capacity, we now observe that XLM-R XXL is able to outperform RoBERTa Large by 0.3 dev points, going from 92.9 to 93.2 average accuracy, while handling 99 more languages.
While a RoBERTa XXL model may outperform XLM-R XXL , we believe it interesting to notice that with more capacity, a multilingual model can get strong high-resource performance while not losing its cross-lingual transfer ability for lower-resource languages. Given the compute needed for training such large-scale models, the possibility of training a single very large model on hundreds of languages with state-of-the-art performance on high-resource languages is an encouraging and positive result.    Discussion and comparison to mT5. Both mT5 and XLM-R models obtain strong performance on cross-lingual understanding benchmarks, as well as high performance on English benchmarks (see the score of 91.6 of mT5 XXL on English XNLI). Many hyperparameters are however different between mT5 and XLM-R models which makes difficult an apple-to-apple comparison. First, as shown in Table 5, the mT5 models are pretrained on the much larger mC4 dataset which contains around 6.4T tokens, which is 38 times bigger than CC100 (167B tokens). While XLM-R Large was pretrained with more updates (6T tokens), the XLM-R XL and XLM-R XXL models have seen less tokens (0.5T) during pretraining than their mT5 counterparts, although it also uses a bigger batch size (2048 over 1024 for mT5). Another difference is the context sequence length of 512 for XLM-R and 1024 for mT5. The mT5-XXL model also has slightly more parameters (13B over 10.7B). The larger number of updates combined with the larger dataset size may explain the larger improvement from the XL model to the XXL model in the case of mT5 (+3 average accuracy on XNLI), in which the additional  capacity can exploit the large quantity of unlabeled mC4 data. We note however that the mT5 XL is outperformed by XLM-R XL on XNLI by 0.6% on average, on XQuad by 1.3% and on MLQA by 0.9% when considering average EM score. In comparison, gains of XLM-R from the XL to the XXL architecture are only of 0.6 on average. Another explanation may be that generative models scale better than masked language models. The difference in the nature of the pretraining dataset is particularly striking when looking at the variance of performance across languages. For example the mT5 XXL outperforms XLM-R XXL by 8.4 points on Swahili on XNLI zero-shot, while it only outperforms XLM-R XXL by 1.4 average accuracy. These results may suggest that the CC100 dataset gets saturated with current larger-capacity models.

Conclusion
In this study, we scaled the model capacity of the XLM-R model up to 10.7B parameters and obtained stronger performance than previous XLM-R models on cross-lingual understanding benchmarks. We also show that the additional capacity allows a multilingual model to outperform a the RoBERTa Large baseline on English benchmarks. Our technical study thus suggests that larger capacity multilingual model can obtain state-of-the-art cross-lingual understanding results while maintaining strong performance on high-resource languages.
Our work provides an alternative to mT5 models, with new state-of-the-art performance on some languages. We release our code and models publicly.