Multilingual Translation from Denoising Pre-Training

Recent work demonstrates the potential of training one model for multilingual machine translation. In parallel, denoising pretraining using unlabeled monolingual data as a starting point for ﬁnetuning bitext machine translation systems has demonstrated strong performance gains. However, little has been explored on the potential to combine denoising pretraining with multilingual machine translation in a single model. In this work, we ﬁll this gap by studying how multilingual translation models can be created through multilingual ﬁnetuning . Fintuning multilingual model from a denoising pretrained model incorporates the beneﬁts of large quantities of unlabeled monolingual data, which is particularly important for low resource languages where bitext is rare. Further, we create the ML50 benchmark to facilitate re-producible research by standardizing training and evaluation data. On ML50, we show that multilingual ﬁnetuning signiﬁcantly improves over multilingual models trained from scratch and bilingual ﬁnetuning for translation into English. We also ﬁnd that multilingual ﬁne-tuning can signiﬁcantly improve over multilingual models trained from scratch for zero-shot translation on non-English directions. Finally, we discuss that the pretraining and ﬁnetuning paradigm alone is not enough to address the challenges of multilingual models for to-Many directions performance.


Introduction
A slow but increasingly growing focus on languages beyond English has contributed a large wave of models, data, and tasks for non-English languages. Much work has been dedicated to the area of translation, with increasing exploration in massively multilingual models. Despite advances in multilingual natural language processing, resources * This work was completed when the first author was at Facebook AI. are highly unbalanced across different languages. This is an obstacle for tasks requiring large quantities of labeled data, such as translation systems, which traditionally leverage hundreds of thousands of professional human translations.
A promising avenue of research is to remove the requirement for large quantities of labeled data by leveraging unlabeled monolingual data, often in the form of large-scale pretraining (Lample and Conneau, 2019;Conneau et al., 2020;Liu et al., 2020;Tran et al., 2020;Brown et al., 2020). Monolingual data is far more prevalent for low resource languages, particularly in resources such as Wikipedia or Commoncrawl, a version of the web. Recent work has explored monolingual denoising pretraining (Liu et al., 2020) for bilingual models finetuning for individual translation directions (for simplicity we will refer to monolingual denoising pretraining as pretraining from now on). However, bilingual finetuning alone does not leverage the benefit of the potential of transfer learning across languages. On the other hand, recent work (Arivazhagan et al., 2019b;Fan et al., 2020) has also demonstrated much potential for performance improvement from multilingual translation models in a single model (for simplicity from now on we will use multilingual translation model or multilingual model to refer to a single model which performs machine translation for multiple languages), but these approaches do not leverage unlabeled monolingual data directly. Little has been explored regarding the combination of the two approaches. Thus, this work studies the effectiveness of combining both large scale pretraining and all-in-one multilingual translation towards universal automatic translation across human languages.
In this work, we finetune pretrained models into multilingual translation models 1 . We analyze the effectiveness of multilingual finetuning -finetuning a single model to perform translation for multiple languages -across low, mid, and high resource translation settings to understand the benefits and limits of both pretraining and the transfer learning across languages. First, we demonstrate how to extend pretrained models to support additional languages using only monolingual data via denoising training criteria. Next, we show how to perform effective finetuning to create one-model multilingual translation. Finally, we evaluate the multilingual translation across a variety of settings to understand the strength of starting with pretraining. Ultimately, we demonstrate that finetuning to create one-model multilingual translation provides large BLEU improvements in the Many-to-English setting, but starting with pretraining is not sufficient to achieve strong English-to-Many performance.
2 Related work

Multilingual Neural Machine Translation
Training a universal translation system between multiple languages (Firat et al., 2016;Johnson et al., 2017) has shown enormous improvement for translating low-resource languages (Gu et al., 2018), even enabling zero-shot translation (Lakew et al., 2018;Gu et al., 2019;Arivazhagan et al., 2019a;Garcia et al., 2020). Previous multilingual translation work began with multitask learning (Dong et al., 2015). Subsequently, work focused on the the model capacity bottleneck, leading to exploration of various parameter sharing strategies (Blackwood et al., 2018;Platanios et al., 2018;Lu et al., 2018). Models tuned models, and the ML50 dataset downloading scripts at https://github.com/pytorch/fairseq/tree/ master/examples/multilingual. for all languages (Ha et al., 2016) have also been explored and extended to incorporate language information (Tan et al., 2019). Bitext data pretraining and finetuning aiming at creating multiple machine translation models for different translation directions has also be explored (Dabre et al., 2019;Lin et al., 2020). Arivazhagan et al. (2019b);Fan et al. (2020) indicate that it is essential to train gigantic models with enough capacity to fully leverage massive multilingual corpora. A closely related concurrent work, Siddhant et al. (2020) shows it is possible to train a multilingual system jointly with monolingual datasets based on Song et al. (2019). In contrast, in this work we focus on unlabeled data denoising pretraining instead of bitext data pretraining to utilize almost unlimitedly available unlabeled texts. We aim at creating a single universal translation model across multiple languages via finetuning multilingual translation systems from a pretrained model.

Multilingual Translation Datasets
Working in a multilingual setting remains challenging, as various different datasets, evaluation settings, and preprocessing such as tokenization are used. Benchmarks for sentence embeddings (Hu et al., 2020), natural language inference (Conneau et al., 2018), and question answering (Lewis et al., 2020b) exist, but there is not yet a setting for machine translation data with different resource levels and language families at sufficiently large scale and variety. Zhang et al. (2020) propose OPUS100 with 100 languages, but the training and evaluation data are not human translated. Arivazhagan et al. (2019b) use proprietary data to train and evaluate. In contrast, we contribute the ML50 benchmark, a dataset of 50 languages with publicly available training and evaluation sets, including high, mid, and extremely low resource directions, and open source this benchmark.

Multilingual Translation from Monolingual Denoising Pretraining
Masked language modeling and denoising pretraining have been successful across a wide variety of tasks, including creating bilingual translation models. We describe the pretrained multilingual BART model and present multilingual finetuning, a technique to convert pretrained models into multilingual machine translation systems.
mBART Multilingual BART (mBART) (Liu et al., 2020) is a sequence-to-sequence generative pretraining scheme. The model incorporates N languages by concatenating data: D = {D 1 , ..., D N } where each D i is a collection of monolingual documents in language i. mBART is trained as a denoising autoencoder, training to predict the original text X given g(X) where g is a noising function that corrupts text. We maximize L θ : where x is an instance in language i and the distribution P is defined by the seq-to-seq model. This model is pretrained using two types of noise in grandom span masking and order permutation -as described in (Liu et al., 2020).

Multilingual Finetuning
To leverage pretraining to create translation systems, previous work (Liu et al., 2020) used mBART as a starting point and then performed bilingual finetuning. Concretely, the seq-to-seq model was finetuned on language i to language j translation. However, bilingual finetuning does not leverage the full capacity of multilingual pretraining, as the resulting translation model can only translate between two languages. Recent work on multilingual translation (Aharoni et al., 2019;Arivazhagan et al., 2019b) demonstrates that strong translation models can be created by doing multilingual training. Thus, we propose to perform multilingual finetuning (ML-FT) to retain the benefits of both multilingual translation models and unlabeled data pretraining. Multilingual translation models allow languages to transfer the learning from each other. Pretraining utilizes large amount of monolingual data to complement the lack of bitext data.
To perform multilingual finetuning, we collect bitexts of different language pairs (i, j) into a large collection B i,j = {(x i , y j )} for each direction (i, j). We augment each bitext pair (x i , y j ) by adding a source language token and a target language token at the beginning of x and y respectively to form a target language token augmented pair (x , y ). We then initialize a transformer based seq-to-seq model by the pretrained mBART, and provide the multilingual bitexts B = i,j B i,j to finetune the pretrained model.

Multilingual Translation Model Variants
We explore 3 configurations to create different versions of multilingual translation models: Many-to-one (M→1), one-to-Many (1→M), and Many-to-Many (M↔M) via a pivot language. Given the presence of English language in large scale bitext data, we follow (Arivazhagan et al., 2019b) using English as the pivot language to create Many-to-Many models: the Many-to-one model encodes N languages and decodes to English, while the one-to-Many model encodes English and decodes into N languages. Finally, the Many-to-Many model encodes and decodes N languages.
Temperature Sampling When training multilingual models with many languages, the training dataset sizes are imbalanced as different languages have different quantities of bitext. Thus, we train with temperature upsampling, which upsamples lower resource pairs so that the high resource languages do not dominate the training data. We follow Arivazhagan et al. (2019b) and use the following temperature based sampling function with temperature T to sample data for each direction:

Experimental Setting
We examine the impact of multilingual finetuning over pretrained models. First, we create the ML50 benchmark to include 50 different languages of various resource levels and language families that we can obtain from publicly available, high quality data sources. The ML50 benchmark standardizes training data, evaluation data, and evaluation procedure across different languages. Second, we detail how we obtain mBART50 pretrained models by extending mBART25. Third, we describe three strong baselines: bilingual translation models from scratch, bilingual finetuning from mBART50 pretrained models, and multilingual translation models from scratch. Finally, we describe our evaluation and generation procedure. In the next section, (Section 5), we will detail the results of the experiments.

ML50 Benchmark
To investigate the usefulness of pretraining and multilingual finetuning compared to existing alternatives, we create the ML50 Benchmark. ML50 contains training and evaluation data across 50 different languages, from extremely low resource languages like Xhosa and Gujarati to high resource languages like French and German. The full list of languages is shown in Table 1. We group the languages into five categories based on the amount of available training data: more than 10M pairs (8 languages), 1M to 10M pairs (5 languages), 100k to 1M pairs (17 languages), 10K to 100K pairs (13 languages), and finally, less than 10K pairs of training data (5 languages). While considering the resource levels, we also choose the ML50 dataset to include languages in multiple language families, from Germanic and Romance languages to Indic and African ones. Many additional languages we contribute are lower resource, compared to the languages in the original mBART25.
Training Data We gather bitext data between English and 49 other languages to form ML50, to enable the training of machine translation models. We select these 49 languages based on the amount of bitext and monolingual data to cover languages with different amount of resources and under different language families. All of the data is publicly available, such as WMT (Bojar et al., 2013(Bojar et al., , 2014(Bojar et al., , 2016(Bojar et al., , 2017(Bojar et al., , 2018Barrault et al., 2019Barrault et al., , 2020 . For multilingual training, each language pair can include data from multiple sources. We simply concatenate them together and remove duplicated source-target sentence pairs for each language pair. We use fasttext (Joulin et al., 2017) to perform language identification on both source and target sentences, and we remove sentences pairs if either source or target sentence is not predicted as expected language. We further filter out training data that match to any source or target side sentences in evaluation datasets. Compared to other datasets such as OPUS100 (Zhang et al., 2020), the ML50 benchmark contains around 4 times more training data. The full list of languages, data sources, and amount of resulting data can be found in Table 6.
Evaluation Data To ensure high quality evaluation of languages covered in ML50, we include publicly available, widely used evaluation sets. We source these evaluation datasets from translation workshops such as WMT, IWSLT, WAT, and other published research works. We follow the evaluation protocol, including tokenization, used for each of these evaluation sets, to ensure our results are comparable with existing work. We release these scripts to make it easier for others 2 . Compared to other datasets such as OPUS100, we choose to use high quality existing evaluation datasets rather than use part of the training data as evaluation. This is because training data, particularly for low resource languages, is often very noisy and unreliable.

Creating mBART50
While multilingual pretrained models have shown strong performance in a variety of tasks (Liu et al., 2020;Conneau et al., 2020), they remain limited as they are trained on a fixed number of languages. For example, mBART was trained on 25 languages, all fairly high resource. Pretraining fully from scratch is computationally intensive -mBART trained for 2.5 weeks on 256 Nvidia V100 GPUs (Liu et al., 2020). However, there are hundreds of different languages in the world, so restarting pretraining from scratch to add any of them to mBART would be difficult. Instead, we take the existing mBART model, trained on 25 languages, and extend it to more than 50 languages. We take the public available mBART25 checkpoint (Liu et al., 2020) in the fairseq library  to continue the pretraining process. We extend mBART25 embedding layers with randomly initialized vectors for an extra set of 25 language tokens. To be consistent with mBART, we reuse its 250K sentencepiece (Kudo and Richardson, 2018) model which was trained using monolingual data for 100 languages from XLMR (Conneau et al., 2020), and thus already supports languages beyond the original mBART25 was trained on 3 . To create this extended mBART model, we combine the monolingual data of original 25 languages and the new 25 languages from XLMR (Conneau et al., 2020). For pretraining, we train mBART50 for an additional 500K updates with batch size of maximum 9216 tokens per GPU using 64 V100 GPUs. We also release the pretrained mBART50 model, which will be useful for a variety of text generation tasks beyond translation.  Table 1: Languages in ML50 Benchmark. We display the languages included in the ML50 Benchmark and the quantity of training data in bitext pairs. Full breakdown is provided in Table 6.

Multilingual Finetuning from mBART50
We finetune the mBART50 model into Many-to-one (M→1), one-to-Many (1→M), and Many-to-Many (M↔M) models with the ML50 training dataset using English as pivot as described in Section 3.1. We finetune the models for 300K updates and sweep through different batch sizes (4096 and 8000 maximum tokens per GPU), learning rates (1e−4, 2e−4, 5e−4) , and upsampling temperature (1.5, 3, 5) for best performing multilingual models on validation, using 32 GPUs for each training instance.

Baselines
We compare our proposed multilingual finetuning to three strong baselines: bilingual training from scratch, bilingual finetuning, and multilingual models trained from scratch.
Bilingual Trained from Scratch (BL-SC) We train bilingual translation models with standard Transformer (Vaswani et al., 2017) models for translation into and from English to 49 languages. For directions with more than 1 million bitext training data (de, cs, fr, ja, es, ru, pl, zh, fi, lv, lt, and hi), we train Transformer Big models as there is more data to benefit from additional model capacity. For directions with more than 10 million bitext training data (de, cs, fr, ja, es, ru, pl, and zh), we also train Transformer Large models as there is even more data to benefit from additional model capacity. The best performing bilingual model is selected as the Bilingual Train from Scratch baseline. Please refer to Table 5 for details of these architectures.
Bilingual Finetuning (BL-FT) Bilingual finetuning adapts the mBART model into bilingual machine translation models by training for longer on translation bitext. For each language direction, we follow Liu et al. (2020) and finetune for 40K updates to obtain the Bilingual Finetuning baseline.

Evaluation and Generation
We evaluate performance with tokenized BLEU, following the tokenization in mBART (Liu et al., 2020). To generate, we decode using beam search with beam size N = 5 with length penalty= 1.0 on the validation set. We do not perform checkpoint averaging. To select the best performing model in a sweep, we compare BLEU on the validation set.

Multilingual Finetuning Performance
We evaluate the performance of multilingual finetuning on the ML50 Benchmark -we compare multilingual finetuning models with bilingual training from scratch, bilingual finetuning, and multilingual training from scratch. Results of multilingual finetuning comparing to all baselines are displayed in Table 2 (per direction comparison is available in Figure 1). The results demonstrate strong improvement over the baselines on many-to-English and comparable performance on English-to-many directions. We also evaluate multilingual finetuning many-to-many models zero-shot performance on non-English directions without bitext data. Our results demonstrates multilingual finetuning models' strong improvement on zero-shot directions comparing to multilingual models trained from scratch.   (left), while (b) performs similarly over bilingual finetuning and multilingual from scratch with significant improvement over bilingual from scratch for translation from English (right). Numbers are average BLEU difference between multilingual finetuning models and the corresponding baselines. Per direction comparison is available in Figure 1.

Comparison to Bilingual Finetuning
To understand whether the benefit of transfer learning across languages can be stacked on top of finetuning pretrained models, we analyze the improvement of multilingual finetuning with the same model size as bilingual finetuning in Table 2.
In the Many-to-one setting, every language pair is improved by multilingual finetuning except one. Some low resource languages see substantial improvement of more than 10 BLEU points, with the largest improvement being over 15 BLEU points. On average, multilingual finetuning improves 3.5 BLEU across all directions into English. In the oneto-Many setting, performance is about the same between multilingual finetuning and bilingual finetuning with average gap of −0.5 BLEU. In manyto-many setting, on average multilinugal finetuning improves the performance of translation into English by 1.8 BLEU while with −1.0 BLEU behind for translation from English. We hypothesize that the benefit of pretraining is diminished by the challenge of decoding into many target languages in multilingual compared to bilingual finetuning.

Comparison to Multilingual from Scratch
To understand the impact of pretraining-finetuning paradigms for multilingual translation, we examine our proposed multilingual finetuning method comparing to multilingual models trained from scratch. As shown in Table 2, in Many-to-One setting, multilingual finetuning performs consistently better than multilingual model trained from scratch by 3.1 BLEU on average. For low resource directions (4k-10k bitexts), the improvement is as high as 5.8 BLEU. However, in the One-to-Many and Many-to-Many settings, multilingual finetuning does not perform better than multilingual training from scratch. For translation from English, One-to-Many multilingual finetuning performs −0.1 BLEU points worse than multilingual from scratch on average; many-to-many multilingual finetuning model performs −0.4 BLEU worse than multilingual from scratch on average. On translation into English, we also observe that many-to-many multilingual finetuning models performs −0.1 BLEU worse than multilingual from scratch on average. Again we hypothesize that the benefit of monolingual data pretraining is dominated by the challenges of a large amount of decoding tasks for individual target languages. We will discuss the challenges of to-many translation further in Section 6.1.

Comparison to Bilingual from Scratch
To understand the combined benefits of pretrainingfinetuning and multilingual transfer learning, we examine the improvement of multilingual finetuning with the same model size over bilingual from scratch in Table 2. In the Many-to-one setting, every language pair is improved by multilingual finetuning -on average multilingual finetuning improves over bilingual models by 12.0 BLEU. Some low and mid resource languages see substantial improvement of more than 20 BLEU points (see Figure 1). In the one-to-Many setting, multilingual finetuning outperforms almost all bilingual models except for 5 directions with minor gaps (mostly less than 1 BLEU). In many-to-many setting, multilingual finetuning improves translation into English by 10.3 BLEU while with 5.8 BLEU improvement for translation from English. Thus concludes that multilingual finetuning can achieve the significant improvement over bilingual baselines across all directions translation into English and from English.

Zero-shot on Non-English Directions
We study the impact of multilingual finetuning on zero-shot non-English directions without any bitext training data. We evaluate multilinugal many-tomany scratch and finetuning over WMT 13 and 20 test data (fr-de and de-fr test data are from WMT20 (Barrault et al., 2020) and the other test data is from WMT 13 (Bojar et al., 2013)). As shown in Table 4, many-to-many multilingual finetuning model outperforms many-to-many multilingual from scratch models by a large margin with average 11.9 BLEU improvement. We hypothesize that the zero-shot non-English translation performance gain is from two factors (1) that pretrained mBART multilingual encoders and decoders are well-trained with monolingual data; (2) that pretrained mBART decoders are not coupled with specific source languages as multilingual scratch models. Note that decoders of multilingual models from scratch are always trained with English as the source language in the encoders while multilingual finetuning models' decoders are trained with both English and the target   In the Many-to-one setting, large improvements are obtained by using pretrained models as a starting point. Multilingual modeling increases the quantity of target-side English data seen by the model. For example, compared to bilingual finetuning, our multilingual finetuning model is exposed to English target side data from 50 different language pairs. However, in the one-to-Many setting and the Many-to-Many setting, models must decode into 50 different languages in both multilingual paradigms -being either trained from scratch or pretrainedand-finetuned. As shown in Table 2 (and Table 9, Figure 1), multilingual models -either from scratch or multilingual finetuning -perform worse than bilingual finetuning for English to Many. This indicates that the challenge of decoding into many languages is a dominating factor in the multilingual models, even with pretraining. Note that there are 49 decoding tasks in One-to-Many and 50 decoding tasks in Many-to-Many, while only 1 in Many-to-One. Additional research, for example following the study framework used in (Grönroos et al., 2020), is needed to understand (1) the interaction between pretraining and finetuning and multiple decoding tasks, and (2) the difference between multiple encoding tasks and multiple decoding tasks.

Continuous Pretraining is Effective
Pretraining models at large scale is costly. By proposing multilingual finetuning, we introduce a dependency on pretrained models for multilingual translation, which can be a limitation if the pretrained model does not cover the desired languages for translation. Thus, we examine the possibility and effectiveness of incrementally extending pretrained models to support additional languages. We found that for the languages which are supported by the original pretrained models, bilingual finetuning from both previously pretrained and continuously pretrained models demonstrate the almost exactly the same performance (see Figure 3 for our analysis of the bilingual finetuning performance of both models over the original 25 languages). Thus, extending pretraining does not hurt performance on the originally supported languages, despite doubling the number of languages supported by the pretrained model. This removes a big limitation of using pretrained models -that users are often limited to choices made during the original pretraining, and thus if languages are not supported, they cannot be used.
We also examine the effectiveness of such continued pretraining. We find that mBART50 has stronger bilingual finetuning performance (see Figure 2) than mBART25 over the newly supported 25 languages on average, indicating that pretrained models are able to be extended to support additional languages if model capacity allows.

Conclusion
We demonstrate that multilingual translation models can be created from pretrained models such as mBART using multilingual finetuning. While using pretrained models could theoretically limit the number of languages, we show that mBART can be extended to double the number of original languages without loss of performance. To train and evaluate on 50 languages, we develop and release the ML50 benchmark. We show that by performing multilingual finetuning, strong improvements can be achieved in the Many-to-one setting. However, pretraining and finetuning paradigm alone is not enough to address the challenges of multilingual models for One-to-Many. Our future work will include analysis of improved strategies for One-to-Many translation, model capacity and inference latency trade-off, an in-depth study of zero-shot translation, training strategies for better data efficiency, and applications of the universal text representation and generation frameworks in other crosslingual tasks.    For each language, we list the size of training data after the filtering steps, the source of training/evaluation data, and the size of evaluation data. We notice that part of the available dataset are missing due to human error for a few language pairs. We mark these languages with asterisk and we will release next version of the ML50 benchmark data to include the missing data.   11.6 13.1 14.1 18.9 15.0 8.7 10.6 10.9 10.0 9.7 All 8.5 9.0 9.5 12.0 10.3 6.8 6.4 6.1 6.3 5.8 Table 9: Multilingual Finetuning on 50 languages comparing to bilingual models. Numbers are average BLEU difference compared to bilingual models trained from scratch.