PARADISE: Exploiting Parallel Data for Multilingual Sequence-to-Sequence Pretraining

Despite the success of multilingual sequence-to-sequence pretraining, most existing approaches rely on monolingual corpora and do not make use of the strong cross-lingual signal contained in parallel data. In this paper, we present PARADISE (PARAllel &Denoising Integration in SEquence-to-sequence models), which extends the conventional denoising objective used to train these models by (i) replacing words in the noised sequence according to a multilingual dictionary, and (ii) predicting the reference translation according to a parallel corpus instead of recovering the original sequence. Our experiments on machine translation and cross-lingual natural language inference show an average improvement of 2.0 BLEU points and 6.7 accuracy points from integrating parallel data into pretraining, respectively, obtaining results that are competitive with several popular models at a fraction of their computational cost.

These models are usually pretrained on combined monolingual corpora in multiple languages using some form of denoising objective.More concretely, given a sequence x, they proceed to noise x with a noising function g φ , and maximize the probability of recovering x given g φ (x): Common noising functions include sentencepermutation and span masking (Lewis et al., 2020;Liu et al., 2020).While these methods obtain strong cross-lingual performance without using any parallel data, they are usually trained at a scale that is prohibitive for most NLP practitioners.At the same time, it has been argued that the strict unsupervised scenario is not realistic, as there is usually some amount of parallel data available (Artetxe et al., 2020), which could potentially provide a stronger training signal and reduce the computational budget required to pretrain these models.
Motivated by this, we propose PARADISE, a pretraining method for sequence-to-sequence models that exploits both word-level and sentence-level parallel data.The core idea of our approach is to augment the conventional denoising objective introduced above by (i) replacing words in the noised sequence according to a bilingual dictionary, and (ii) predicting the reference translation rather than the input sequence.Despite their simplicity, we find that both techniques bring substantial gains over conventional pretraining on monolingual data, as evaluated both in machine translation and zero-shot cross-lingual transfer.Our results are competitive with several popular models, despite using only a fraction of the compute.

Proposed method
As illustrated in Figure 1, we propose two methods for introducing parallel information at both the word-level and the sentence-level: dictionary denoising and bitext denoising.
Dictionary denoising.Our first method encourages learning similar representations at the wordlevel by introducing anchor words through multilingual dictionaries (Conneau et al., 2020b).Let D l (w) denote the translation of word w into language l ∈ L according to the dictionary D. Given  the source sentence x = (x 1 , x 2 , . . ., x n ), we define its noised version g ψ (x) = (x 1 , x2 , . . ., xn ), where xi = D l (x i ) with probability pr |L| and xi = x i otherwise (i.e.we replace each word with its translation into a random language with probability p r ).We set p r = 0.4.Given the dictionary-noised sentence, we train our model using the denoising auto-encoding objective in Eq. 1: Bitext denoising.Our second approach encourages learning from both monolingual and parallel data sources, by including translation data in the pretraining process.Given a source-target bitext pair (x, y) in the parallel corpus, assumed to be semantically equivalent, we model the following: bitext (x, y) = − log P y|g φ (x) (3) in which we optimize the likelihood of generating the target sentence y conditioned on the noised version of the source sentence, g φ (x). 2 Combined objective.Our final objective combines mono , dict and bitext . 3Given that our corpus contains languages with varying data sizes, we sample sentences using the exponential sampling technique from Conneau and Lample (2019).We use α mono = 0.5 to sample from the monolingual corpus, and α bitext = 0.3 to sample from the parallel corpus.To prevent over-exposure to English on the decoder side when sampling from the parallel corpus, we halve the probability of to-English directions and renormalize the probabilities.In addition, given that we have fewer amounts of parallel data (used for bitext ) than monolingual data (used for mono and dict ), we sample between each task using α task = 0.3.
2 To make our pretraining sequence length consistent with mono and dict, we concatentate randomly sampled sentence pairs from the same language pair to fit the maximum length.
3 We use the same noising function g φ used by Lewis et al. (2020) and Liu et al. (2020).

Pretraining
Data.We use Wikipedia as our monolingual corpus, and complement it with OSCAR (Ortiz Suárez et al., 2020), and CC100 (Conneau et al., 2020a) for low-resource languages.For a fair comparison with monolingually pretrained baselines, we use the same parallel data as in our downstream machine translation experiments (detailed in §3.2).In addition, we train a separate variant (detailed below) using additional parallel data from ParaCrawl (Esplà et al., 2019), UNPC (Ziemski et al., 2016), CCAligned (El-Kishky et al., 2020), and OpenSubtitles (Lison and Tiedemann, 2016). 4We tokenize all data using SentencePiece (Kudo and Richardson, 2018) with a joint vocabulary of 125K subwords.We use bilingual dictionaries from FLoRes5 (Guzmán et al., 2019) for Nepalese and Sinhala, and MUSE6 (Lample et al., 2018) for the rest of languages.Refer to Appendix A for more details.
Models.We use the same architecture as BARTbase (Lewis et al., 2020), totaling ∼196M parameters, and train for 100k steps with a batch size of ∼520K tokens.This takes around a day on 32 NVIDIA V100 16GB GPUs.As discussed before, we train two variants of our full model: PAR-ADISE, which uses the same parallel data as the machine translation experiments, and PARADISE++, which uses additional parallel data.To better understand the contribution of each objective, we train two additional models without dictionary denois- ing, which we name PARADISE (w/o dict.) and PARADISE++ (w/o dict.).Finally, we train a baseline system using the monolingual objective alone, which we refer to as mBART (ours).This follows the original mBART work (Liu et al., 2020), but is directly comparable to the rest of our models in terms of data and hyperparameters.

Downstream Settings
Machine translation.Following Liu et al.
(2020), we evaluate our models on sentence-level machine translation from and to English using the following datasets: IWSLT (Cettolo et al., 2015(Cettolo et al., , 2017) ) for Vietnamese, Japanese and Arabic, WMT (Callison-Burch et al., 2009a,b;Bojar et al., 2016Bojar et al., , 2017) ) for Spanish, French, Romanian and Turkish, FLoRes (Guzmán et al., 2019) for Sinhala and Nepalese, and IITB (Kunchukuttan et al., 2018) for Hindi.We report performance in BLEU (Papineni et al., 2002) as detailed in Appendix C. We finetune our models using the same setup as mBART, warming up the learning rate to 3 × 10 −5 over 2500 iterations and then decaying with a polynomial schedule.We use 0.3 dropout and label smoothing = 0.2.
Cross-lingual classification.We evaluate our models on zero-shot cross-lingual transfer, where we finetune on English data and test performance on other languages.To that end, we use the XNLI natural language inference dataset (Conneau et al., 2018) and the PAWS-X adversarial paraphrase identification dataset (Yang et al., 2019).Following Hu et al. (2021), we use all the 15 languages in XNLI, and English, German, Spanish, French and Chinese for PAWS-X.We develop a new approach for applying sequence-to-sequence models for classification: feeding the sequence into both the encoder and decoder, and taking the concatenation of the encoder's <s> representation and the decoder's </s> representation as the input of the classification head.We provide an empirical rationale for this in Table 4.We finetune all models with a batch size of 64 and a learning rate of 2 × 10 −5 for a maximum of 100k iterations, performing early stopping on the validation set.

Results
We next report our results on machine translation ( §4.1) and cross-lingual classification ( §4.2), and compare them to prior work ( §4.3).

Machine Translation
We report our main results in Table 1.We observe that PARADISE consistently outperforms our mBART baseline across all language pairs.Note that these two models have seen the exact same corpora, but mBART uses the parallel data for finetuning only, whereas PARADISE also uses it at the pretraining stage.This suggests that incorporating parallel data into pretraining helps learn better representations, which results in better downstream performance.
Table 2 reports additional ablation results on a subset of languages.As can be seen, removing dictionary denoising hurts, but is still better than our mBART baseline.This shows that both of our proposed approaches-dictionary denoising and bitext denoising-are helpful and complementary.Finally, PARADISE++ improves over PARADISE, indicating that a more balanced corpus with more parallel data is helpful.

Cross-lingual Classification
We report XNLI results in Table 3 and PAWS-X results in Appendix D. Our proposed approach outperforms mBART in all languages by a large margin.To our surprise, we also observe big gains in English.We conjecture that this could be explained by bitext denoising providing a stronger training signal from all tokens akin to ELECTRA (Clark et al., 2020), whereas monolingual denoising only gets effective signal from predicting the masked portion.In addition, given that we are using parallel data between English and other languages, PARADISE ends up seeing much more English text compared to mBART-yet a similar amount in the rest of languages-which could also contribute to its better performance in this language.Interestingly, our improvements also hold when using the TRANSLATE-TRAIN-ALL approach, which indirectly uses parallel data to train the underlying machine translation system.Finally, we observe that all of our different variants perform similarly in English, but incorporating dictionary denoising and using additional parallel data both reduce the cross-lingual transfer gap.Table 4 compares our proposed finetuning approach, which combines the representations from both the encoder and the decoder (see §3), to using either of them alone. 7While prior work either min-imally used the decoder if at all (Siddhant et al., 2019;Xue et al., 2021), or only added a classification head on top of the decoder (Lewis et al., 2020), we find that taking the best of both worlds performs best.

Comparison with prior work
So as to put our results into perspective, we compare our models with several popular systems from the literature.As shown in Table 5, our proposed approach obtains competitive results despite being trained at a much smaller scale. 8Just in line with our previous results, this suggests that incorporating parallel data makes pretraining more efficient.Interestingly, our method also outperforms XLM, MMTE and mT6, which also use parallel data, as well as AMBER, showing evidence contrary to Hu et al. (2021)'s suggestion that using dictionaries may hurt performance.Detailed per-language results for each task can be found in Appendix D.

Related Work
Most prior work on large-scale multilingual pretraining uses monolingual data only (Pires et al., 2019;Conneau et al., 2020a;Song et al., 2019;Liu et al., 2020;Xue et al., 2021).XLM (Lample and Conneau, 2019) was first to incorporate parallel data through its translation language modeling (TLM) objective, which applies masked language modeling to concatenated parallel sentences.Unicoder (Huang et al., 2019), AMBER (Hu et al., 2021) and InfoXLM (Chi et al., 2021b) introduced additional objectives over parallel corpora.Similar to our dictionary denoising objective, some previous work has also explored replacing words of the decoder only, following Lewis et al. (2020).according to a bilingual dictionary (Conneau et al., 2020b;Chaudhary et al., 2020;Dufter and Schütze, 2020).However, all these approaches operate with encoder-only models, while we believe sequenceto-sequence models are more flexible and provide a more natural way of integrating parallel data.
In that spirit, Siddhant et al. (2019) showed that vanilla machine translation models are already competitive in cross-lingual classification.Our approach combines translation with denoising and further incorporates bilingual dictionaries and monolingual corpora, obtaining substantially better results.Closer to our work, Chi et al. (2021a) incorporated parallel corpora into sequence-to-sequence pretraining by feeding concatenated parallel sentences to the encoder and using different masking strategies, similar to TLM.In contrast, our approach feeds a noised sentence into the encoder, and tries to recover its translation in the decoder side, obtaining substantially better results with a similar computational budget.Concurrent to our work, Kale et al. (2021) extended T5 to incorporate parallel corpora using a similar approach to our bitext denoising.

Conclusions
In this work, we proposed PARADISE, which introduces two new denoising objectives to integrate bilingual dictionaries and parallel corpora into sequence-to-sequence pretraining.Experimental results on machine translation and cross-lingual sentence classification show that PARADISE provides significant improvements over mBART-style pretraining on monolingual corpora only, obtaining results that are competitive with several popular models at a much smaller scale.In future work, we look to see whether these techniques and findings scale with model size.

Figure 1 :
Figure 1: Our proposed techniques for integrating parallel data into sequence-to-sequence pretraining.

Table 2 :
Ablation results on machine translation.