nmT5 - Is parallel data still relevant for pre-training massively multilingual language models?

Recently, mT5 - a massively multilingual version of T5 - leveraged a unified text-to-text format to attain state-of-the-art results on a wide variety of multilingual NLP tasks. In this paper, we investigate the impact of incorporating parallel data into mT5 pre-training. We find that multi-tasking language modeling with objectives such as machine translation during pre-training is a straightforward way to improve performance on downstream multilingual and cross-lingual tasks. However, the gains start to diminish as the model capacity increases, suggesting that parallel data might not be as essential for larger models. At the same time, even at larger model sizes, we find that pre-training with parallel data still provides benefits in the limited labelled data regime


Introduction
Recent works have shown that cross-lingual transfer learning in pre-trained multilingual models such as mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) could be improved further by using parallel data (Conneau and Lample, 2019;Hu et al., 2020a;Ouyang et al., 2020;Luo et al., 2020). In this paper, we continue this line of work by improving the recent mT5 model (Xue et al., 2020) by leveraging parallel corpora. We experiment with several text-to-text objectives that incorporate parallel data (spanning 198 language pairs) into mT5 pre-training. Our key findings are summarized below: • In the regime of very small fine-tuning datasets, objectives with parallel data improve results significantly.
• The gain from using parallel data decreases as we scale up the size of the pre-trained model.
• Simple objectives based on neural machine translation (NMT) perform better than the traditionally employed "translation language modeling" (TLM) objective.

Method
We focus on the mT5-Large model, which is a 24 layer encoder-decoder transformer model and has shown strong performance on a variety of crosslingual benchmarks (Xue et al., 2020). Instead of training a new model from scratch, we start from the publicly available mT5-Large checkpoint -which has been trained for over 1 trillion tokens -and do a second stage pre-training with a mix of monolingual and parallel data.

Objectives
The mT5 -multilingual version of T5 (Raffel et al., 2020) -series of models were pre-trained on a multilingual version of the C4 corpus with a masked language modeling "span-corruption" objective (Raffel et al., 2020), where the encoder is fed a chunk of text with random spans replaced with a mask token, and the decoder must reconstruct the masked-out tokens. One of their primary distinctions is the use of a unified "text-to-text" format for all text-based NLP problems.
In keeping with the text-to-text format, we experiment with the following objectives to incorporate parallel data into pre-training: • TLM -A text-to-text version of translation language modeling, proposed by Conneau and Lample (2019) and subsequently used in several prior works for encoder only pre-training. We trivially extend it to the encoder-decoder setting.
• NMT -Standard machine translation. The input is the source text and the target is its Figure 1: Example source and targets for different text-to-text style pre-training objectives incorporating parallel data. All objectives except TLM specify target language in the source sentence.
translation. A language code is prefixed to the input to inform the model of the target language (Johnson et al., 2017).
• Denoised-NMT -Similar to NMT, but we additionally mask spans in the source sentence. The model must now learn to implicitly perform language modeling of the source language while translating into the target language.
• Denoised-NMT+LM -Similar to Denoised-NMT, but instead of implicit language modeling, the model must explicitly predict the source text in addition to the translation. The target is a concatenation of the translation and source sentence, while the input is the masked source sentence.
We refer to the model trained with the standard NMT objective as nmT5.

Experiment Setup
Pre-training datasets For pre-training we use monolingual data from mC4 (Xue et al., 2020) and parallel data from OPUS-100 . OPUS-100 is a dataset of 55M translations covering 100 languages (198 language pairs, either into or from English). The mC4 corpus consists of unlabeled web text covering 101 languages, of which 81 overlap with the OPUS-100 languages.
Fine-tuning datasets For downstream evaluation, we use the following four tasks: • TyDi QA (Clark et al., 2020) -The GoldP subtask, which corresponds to extractive question answering. The input is a passage and a question, with the answer being a span from the passage.
• MTOP ) -Multilingual Task-Oriented Parsing. The task is one of structured prediction, where user queries must be parsed into a tree, capturing the domain, intent and slots.
• WikiAnn NER (Pan et al., 2019) -Named entity recognition task covering 40 languages featured in the XTREME benchmark (Hu et al., 2020b). There are 4 categories of entities -location, person, organization and miscellaneous.
• WikiLingua (Ladhak et al., 2020) -A recently introduced cross-lingual summarization dataset, where a document from an arbitrary language must be summarized in English.
Since the dataset does not come with training and evaluation splits, we randomly create validation and test sets of 1000 examples each, and the rest of the data is used for training. Baselines Our first baseline is the publicly available mT5-Large model (1.3 billion parameters). For a fair comparison, we also experiment with an mT5 model further pre-trained for 100k steps with only monolingual data from mC4 (see row 2: mT5+MLM in Table 2). This lets us assess whether improvements stem from using parallel data or just pre-training for longer.

Results
We report results in table 2. Overall, adding parallel data through neural machine translation objectives improves scores for all 4 tasks, with the NMT objective performing the best. Simply pre-training mT5 for longer with just monolingual data (MLM) leads to improved scores for all tasks. The TLM objective is not be able to effectively leverage the parallel data and performs on par with MLM. On the other hand, our three NMTbased objectives show gains over MLM across all tasks. Among these, NMT and Denoised-NMT are the best and perform similarly, while Denoised-NMT+LM fares slightly worse. Averaged across all tasks, NMT and Denoised-NMT outperform MLM by 4 points. Xue et al. (2020) find that cross-lingual performance of language models increases monotonically with model size. To study the impact of model capacity, we also experiment with larger model sizes. Even at the XL size (3.7B params, 3× larger than mT5-Large), we observe gains for all tasks with nmT5 (Table 3). However, the magnitude of the gains is largely diminished, hinting that the need for parallel data reduces as model capacity increases. This finding is particularly promising for low-resource languages, where it is difficult to obtain high-quality parallel data.

Model size
At the same time, nmT5-Large substantially reduces the performance gap between mT5-Large and mT5-XL, covering 70% of the headroom. Since bigger models are expensive to train and even more expensive to deploy, this opens up avenues for effectively using parallel data to improve performance of smaller language models. Turc et al. (2019) found that pre-training student models before model distillation is helpful, and using parallel data to improve student pre-training is another interesting avenue of future work.   When fine-tuned with SQuAD, nmT5 performs slightly better than mT5 for both Large and XL model sizes. However, in the few-shot setting, nmT5-Large improves over mT5-Large by 15 points. Even at the XL size, nmT5 is over 10 points higher than mT5. nmT5-Large even outperforms the much larger mT5-XL. Our experiments suggest that pre-training with parallel data is particularly useful in the limited labelled data setting.

Mixing ratio
So far, we have mixed parallel data into monolingual data at a 10% ratio. To assess how the mixing ratio impacts performance, we compare results with a 50% mix. With the 50% mix, average performance is slightly lower, validating our initial choice.

Performance on unseen languages
We also test downstream performance on languages previously unseen by the models. We randomly pick 30 languages from the WikiAnn NER dataset that are not covered in either mC4 1 or OPUS, and hence none of our models have seen them during pre-training.  Table 6: Performance on three randomly picked unseen languages. "Avg." is calculated by averaging performance across 30 unseen languages.

Related Work
Pre-trained multilingual models such as mBERT and XLM-R have shown to be effective at crosslingual transfer learning (Devlin et al., 2019;Conneau et al., 2020). Subsequently, many attempts have leveraged parallel data to improve crosslingual capability of these models. Conneau and Lample (2019) proposed translation language modeling (TLM), to encourage the model to align representations across languages. Alternating language modeling (Yang et al., 2020) and back-translation masked language modeling (Ouyang et al., 2020) used code-switched sentences and back-translation respectively to utilize parallel data. Other works using parallel data in this line of work include FIL-TER (Fang et al., 2020), AMBER (Hu et al., 2020a) and, MMTE (Siddhant et al., 2020). A key factor that differentiates this paper from these works is that our pre-trained models use a text-to-text architecture, having both an encoder and a decoder, while the aforementioned models only have the encoder. Other pretrained multilingual encoderdecoder models such as mT5 (Xue et al., 2020), mBART  and MASS (Song et al., 2019) do not make use of parallel data during pretraining.

Conclusion
In this work we attempted to improve mT5 pretraining by incorporating parallel data. We experimented with various text-to-text objectives and found that multi-tasking with the standard neural machine translation objective during pre-training leads to improved cross-lingual transfer. The improvements from parallel data are most pronounced in the limited labeled data scenario. Our experiments also indicate that smaller models, with the help of parallel data, can approach the performance of larger ones, while also suggesting that the need for parallel data is lesser as the model capacity increases.
A Per-Language Results on All Tasks