Zero-Shot Cross-Lingual Transfer of Neural Machine Translation with Multilingual Pretrained Encoders

Previous work mainly focuses on improving cross-lingual transfer for NLU tasks with a multilingual pretrained encoder (MPE), or improving the performance on supervised machine translation with BERT. However, it is under-explored that whether the MPE can help to facilitate the cross-lingual transferability of NMT model. In this paper, we focus on a zero-shot cross-lingual transfer task in NMT. In this task, the NMT model is trained with parallel dataset of only one language pair and an off-the-shelf MPE, then it is directly tested on zero-shot language pairs. We propose SixT, a simple yet effective model for this task. SixT leverages the MPE with a two-stage training schedule and gets further improvement with a position disentangled encoder and a capacity-enhanced decoder. Using this method, SixT significantly outperforms mBART, a pretrained multilingual encoder-decoder model explicitly designed for NMT, with an average improvement of 7.1 BLEU on zero-shot any-to-English test sets across 14 source languages. Furthermore, with much less training computation cost and training data, our model achieves better performance on 15 any-to-English test sets than CRISS and m2m-100, two strong multilingual NMT baselines.


Introduction
Multilingual pretrained encoder (MPE) such as mBERT (Wu and Dredze, 2019), XLM (Ruder et al., 2019), and XLM-R (Conneau et al., 2020) has shown remarkably strong results on zero-shot cross-lingual transfer mainly for natural language understanding (NLU) tasks, including named entity recognition, question answering (QA) and natural language inference (NLI). These methods jointly train a Transformer model (Vaswani et al., 2017) to perform masked language modeling task in multiple languages. The pretrained model is then finetuned on a downstream task using labeled data in a single language and evaluated on the same task in other languages. With this pretraining and finetuning approach, the multilingual model is able to * Contribution during internship at Microsoft Research. Zero-shot translate Figure 1: In the zero-shot cross-lingual NMT transfer task, the model is trained with only one parallel dataset (such as De-En) and a multilingual pretrained encoder. The model is tested on many-to-one language pairs (like Fi/Hi/Zh-En) in a zero-shot manner. Monolingual text of the to-be-tested source languages is not available in this task.
generalize to other languages even without labeled data in those languages.
Neural machine translation (Bahdanau et al., 2015;Vaswani et al., 2017, NMT) aims to translate an input sentence from the source language to the target one. The encoder learns contextualized representations of the input, then the decoder generates the target sentence with the encoder output. Given that MPE has achieved great success in other tasks, a question worthy of research is how to perform zero-shot cross-lingual transfer in the NMT task by leveraging the MPE. Some works (Zhu et al., 2020;Yang et al., 2020;Weng et al., 2020) focus on improving NMT performance by incorporating monolingual pretrained model such as BERT (Devlin et al., 2019). However, simply replacing the monolingual pretrained model in previous works with MPE hardly works for cross-lingual transfer of NMT (see BERT-fused model in Table 2). Others propose to fine-tune the encoder-decoder based multilingual pretrained model for cross-lingual transfer in NMT task (Liu et al., 2020b;Lin et al., 2020;Xue et al., 2020). However, it is still unclear how to conduct cross-lingual transfer for NMT model with existing multilingual pretrained encoders such as XLM-R.
In this paper, we focus on a zero-shot crosslingual transfer NMT task (see Figure 1), which aims at translating multiple unseen languages by leveraging an MPE. Different from unsupervised NMT, multilingual NMT or traditional NMT transfer learning task, only an MPE and one parallel dataset such as German-English are available in this task. Any monolingual or parallel data for other languages is not accessible in the training set.
The trained model is directly tested on many-to-one testsets in a zero-shot manner. The challenge to this task is how to leverage an MPE for machine translation while preserving the cross-lingual transfer ability obtained from the MPE.
We propose a SImple cross-lingual(X) Transfer NMT model (SixT) which can directly translates languages unseen during supervised training. We initialize the encoder and decoder embeddings of SixT with XLM-R and propose a two-stage training schedule that trades off between supervised performance and transfer ability. At the first stage, we only train the decoder layers, while at the second stage, all model parameters are jointly optimized except the encoder embedding. We further improve the model by introducing a position disentangled encoder and a capacity-enhanced decoder. The position disentangled encoder enhances cross-lingual transfer by removing residual connection in one of the encoder layers and making the encoder outputs more language agnostic. The capacity-enhanced decoder leverages a bigger decoder than vanilla Transformer to fully utilize the labelled dataset. Although it is trained with only one language pair, the SixT model alleviates the effect of 'catastrophic forgetting' and can be transferred to unseen languages. We compare SixT with CRISS (Tran et al., 2020) and m2m-100 (Fan et al., 2020a), two strong multilingual NMT models, on many-to-English testsets. The SixT model gets better translation performance with much smaller training data and less computation cost.

Problem Statement
The zero-shot cross-lingual NMT transfer task (see Figure 1) explores approaches to enhance the crosslingual transfer ability of NMT model. Given a multilingual pretrained encoder (MPE) and one parallel dataset of language pair l s -to-l t (i.e. seen language pair), we aim to training an NMT model that can be transferred to unseen language pairs (l z -to-l t , where l z = l s and l z is supported by the MPE). The learned NMT model is directly tested between the unseen language pair l z -to-l t in a zeroshot manner. The major challenge of this task is to learn an NMT model with a language-agnostic encoder without losing the transfer ability of MPE.
Different from multilingual and unsupervised NMT, neither the parallel data nor the monolingual data in the language l z is directly accessible in the ZeXT task. The model has to rely on the knowledge about language l z from the off-the-shelf MPE. Some previous works (Yang et al., 2020;Zhu et al., 2020) focus on improving the translation ability on the language pair of the given parallel data (l s -tol t ). Different from them, we aim to improve the cross-lingual transfer ability of NMT model that enables zero-shot translation on unseen language pairs.
The ZeXT task can be applied to evaluate and compare the cross-lingual transfer abilities on NMT tasks among multilingual pretrained encoders. In addition, the NMT model in the ZeXT task provides a good start point for training the lowresource or multilingual NMT models, the motivation of which is similar with that of model-agnostic meta-learning (Finn et al., 2017;Gu et al., 2018). It can be fast adapted to unseen language pairs and get further improvement with the parallel or monolingual dataset of the unseen languages. In this paper, we focus on the ZeXT task with XLM-R, a state-of-the-art multilingual pretrained encoder.
• Fix: initialized from the MPE and fixed; • FT: initialized from the MPE and trained. We compare different fine-tuning strategies for these modules in a greedy manner. Starting from vanilla Transformer where all parts are randomly initialized, we explore the best method for encoder embedding, encoder layers, then decoder embedding, and decoder layers, respectively. The details of experimental settings are in Section 4. From the results shown in Table 1, we observe that it is the best to initialize the encoder embedding, the encoder layers and the decoder embedding with the pretrained encoders and keep their parameters frozen, while randomly initializing the decoder layers (see Figure 2). More comparisons and discussions are in Section 4.2.
Two-stage training Since we freeze the encoder and only fine-tune the decoder layers, the model is able to performing translation while preserving the transfer ability of the encoder. However, freezing most of the parameters limits the capacity of the NMT model, especially when the training data goes large. Therefore, we propose a second training stage to further improve the translation performance by jointly fine-tuning all parameters except encoder embedding of the NMT. 1 Since the decoder has been well adapted to the encoder at the first stage, we expect the model can be slightly finetuned to improve the translation capacity without 1 According to our preliminary experiment, the average BLEU is 0.2 points lower when the encoder embedding is also learned in the second stage. Besides, the frozen encoder embedding leads to higher computational efficiency. (1) Training in the first stage (2) Training in the second stage Figure 2: The best strategy for training NMT model for ZeXT task. The blue iced blocks are initialized with an MPE and frozen, while the red fired blocks are initialized randomly or from the first stage.
losing the transfer ability of the encoder.
Position disentangled encoder The representations from XLM-R initialized encoder have a strong positional correspondence to the source sentence. The word order information inside is language-specific and may hinder the cross-lingual transfer from supervised language to unseen languages. Inspired by Liu et al. (2020a), we propose to relax this structural constraint in the second training stage and make the encoder outputs less position-and language-specific. In the second stage, we remove the residual connection after the self-attention sublayer in one of the encoder layers i during training and inference. 2 The other encoder layers remain the same. The hidden states in this i th encoder layer are calculated as the following pseudo code: where SelfAttn is the encoder self-attention sublayer, FFN is the feed forward sublayer and LayerNorm is the layer normalization. Liu et al. (2020a) aim at training a language agnostic encoder for NMT using parallel corpus from scratch. Compared with them, our method shows that it's possible to make a pretrained multilingual encoder more language agnostic by relaxing the position constraint during fine-tuning.
Capacity-enhanced decoder Some previous works (Zhu et al., 2020;Yang et al., 2020) that incorporate BERT into NMT resembles Vaswani et al. (2017) to configure decoder size. For example, to train an NMT on Europarl De-En training dataset, a default decoder is configured as transformer_base. However, our model relies more on the decoder to learn from the labeled data, as the encoder is mainly responsible for crosslingual transfer. This is also reflected in our training strategy: at the first stage only the decoder parameters are optimized while at the second stage, the encoder is only slightly fine-tuned to preserve its transfer ability. Therefore, the model capacity of SixT is smaller than vanilla Transformer with the same size. We propose to enhance the model capacity with larger decoder that can digest more labeled data. In the experiments, the performance of SixT is improved with big decoder, but vanilla Transformer fails to be improved with larger decoder (see Section 4.2). The result confirms our assumption.

Setup
Dataset We focus on the many-to-English translations for the ZeXT task. The Europarl-v7 German and English is used as training set. We evaluate the cross-lingual transfer abilities of NMT models on a variety of languages from different language groups 3 : German group (Nl, De), Romance group (Ro, Es), Slavic group (Ru, Pl), Uralic group (Fi, Lv) and Turkic group (Tr). A concatenation of Fr-En and Cs-En which are from different language groups is used as validation dataset for all any-to-English translation tasks. The dataset details are in the appendix. Note that none of the monolingual dataset for the tested source languages is available in all experiments.
To be compatible with XLM-R model, all texts are tokenized with the same XLM-R sentencepiece (Kudo, 2018) model. The <bos> token is added at the beginning of each source sentence when the NMT model initializes encoder with XLM-R. The source sentence length is limited within 512 tokens. The source vocabulary uses the same 250k vocabulary of XLM-R.
Model settings We use the XLM-R base model as the off-the-shelf MPE, which is trained on 100 languages and has 270M parameters. Transformer model is implemented on fairseq toolkit (Ott et al., 2019). We set Transformer encoder the same size as the XLM-R base model. The dimension of decoder hidden states is the same with encoder due to the requirement of encoder-decoder attention module. For the preliminary exploration, we use 6 decoder layers and set the FFN dimension as 2048 (SmallDec, see Strategy (1)-(6) in Table 1) when the decoder is randomly initialized. When initializing decoder with XLM-R, we use 12 layers, set the FFN dimension as 3072 (BigDec) and randomly initialize the encoder-decoder attention module. The rest parameters are then copied from XLM-R. For all experiments, we tie the decoder input and output embeddings. We remove the residual connection at the 11-th (penultimate) encoder layer, which is selected on the validation dataset.
We use adam (Kingma and Ba, 2015) and label smoothing for training. The learning rate is 0.0005 and warmup step is 4000 in the first stage. For the second training stage, we set the learning rate as 0.0001 and do not use warmup. All the drop-out probabilities are set to 0.3. The batch size is set as 32k tokens. Maximum updates number is 200k for the first stage and 30k for the second stage. For all experiments, the result is on single model with no checkpoints ensemble. We use beam search (beam size is 5) and do not tune length penalty. We evaluate the results with sacrebleu 4 (Post, 2018). If not particularly indicated, the best checkpoint is selected by zero-shot cross-lingual transfer performance on the validation set for all experiments.
Baselines We compare our model with vanilla Transformer and three conventional methods to apply pretrained language models on down-stream tasks.
• Vanilla Transformer. The decoder is with base (F2048L6) and XLM-R size (F3072L12), where F and L are the dimension of feed forward network and the number of decoder layers. Other hyperparameters are the same as SixT.
• +Fine-tune encoder. This method initializes the encoder with XLM-R and directly fine-tunes all parameters.
• +Fine-tune all. All parameters are initialized with XLM-R except those of cross attention module and directly fine-tuned.
• +XLM-R as encoder embedding. This method leverages XLM-R output from the last layer as the encoder input of the NMT and keeps the XLM-R frozen.
• BERT-fused model (Zhu et al., 2020). The XLM-R output is fused into encoder and decoder separately with attention mechanism. The parameters of XLM-R are frozen.

Results
Preliminary exploration results The results of the preliminary exploration on Europarl De-En training set is shown in Table 1. Since Strategy (8)-(9) use a larger decoder than the rest ones, we add strategy (10) whose decoder size is the same as Strategy (8)-(9) for fair comparison. Overall, we observe that it is best to use a big decoder and initialize the decoder embedding and all encoder parameters with XLM-R model, and to train the decoder layers from scratch (Strategy (10)). First, the result of Strategy (2)-(5) confirms that it is necessary to freeze the parameters of XLM-R initialized encoder in order to prevent pre-trained weights from being washed out by supervised training. Second, Strategy (5)- (7) show that freezing decoder embeddings with XLM-R parameters improves transfer performance. However, given that Strategy (10) is better than Strategy (8)-(9), it is best to train the decoder layers from scratch. Third, we notice that Strategy (10) improves over Strategy (7) with 0.9 BLEU score. These two strategies use the same initialization and training methods, except using decoders of different sizes. Since the encoder is frozen, the model has to rely more on the decoder for translation. Therefore, we hypothesize that it is necessary to use a larger decoder than vanilla Transformer base model in our setting.
To verify this, we train vanilla Transformer with the base and big (XLM-R sized) decoder using the same training corpus from scratch and compare their performance on the supervised task (De-En). 5 The model with base and big decoder obtains a BLEU of 23.5 and 22.9 on the De-En test set. The result that the BLEU of BigDec even degrades confirms our assumption. Therefore, we follow the settings of Strategy (10) in Table 1 for the rest experiments unless otherwise stated. Table 2 illustrates the performance of the proposed SixT comparing with the baselines. SixT gets 19.4 average BLEU, improving the best baseline by 8.6 BLEU. It shows that SixT successfully trades off between supervised translation and cross-lingual transfer tasks. Typical methods that incorporate BERT into NMT work poorly when simply replacing BERT with XLM-R. For example, BERT-fused model gets 1.7 average BLEU. This observation demonstrates the necessity to study the ZeXT task. XLM-R has achieved success on zero-shot crosslingual transfer for NLU tasks. However, on natural language generation (NLG) tasks like NMT, its cross-lingual transfer ability is still under-explored. Previous works illustrate that sequence to sequence pretrained models such as mBART (Lewis et al., 2019), mT5 (Xue et al., 2020) Table 4: Comparison with mBART-FT, CRISS and m2m-100 on zero-shot many-to-English testsets. mBART-FT follows the original paper (Liu et al., 2020b) for fine-tuning. 'Data' is the number of sentences in the NMT training set. 'Avg.' is the average BLEU across all these test directions.

Main results
et al., 2020) are able to generalize to unseen source languages for the NLG tasks. In our experiments, we demonstrate that multilingual pretrained encoders also have such ability if fine-tuned properly on the downstream NLG tasks. We leave the exploration of cross-lingual transfer using XLM-R on other NLG tasks as future work.
Ablation study We conduct an ablation study with the proposed SixT on the Europarl De-En training dataset, as shown in Table 3. We first remove the position disentangled encoder by keeping all residual connections at the second training stage (w/o Resdrop). Then we gradually remove the second training stage (w/o TwoStage) and replace the big decoder with Transformer base decoder (w/o BigDec). It shows that three methods improve the performance on the zero-shot translation task. TwoStage brings the highest improvements of 2.0 averaged BLEU. BigDec and Resdrop improve 0.9 and 0.6 averaged BLEU, respectively. We note that the supervised task (De-En) improves with TwoStage and BigDec while degrades with Resdrop. This is expected since Resdrop builds a more language-agnostic encoder. In addition, we see that the zero-shot performance is related to both supervised performance and transfer ability. By either enhancing the supervised direction (from TwoStage and BigDec) or the zero-shot direction (from Resdrop), the overall performance of zeroshot translation can be improved.

More Explorations
Comparison with Multilingual NMT We compare our proposed SixT model with mBART (Liu et al., 2020b), CRISS (Tran et al., 2020) and m2m-100 (Fan et al., 2020a). mBART 6 is a strong pretrained multilingual language model for NMT. We follow their setting and directly fine-tune all model 6 https://github.com/pytorch/fairseq/ blob/master/examples/mbart parameters on WMT19 De-En training set. We select the model using the cross-lingual validation set. CRISS (Tran et al., 2020) and m2m-100 (Fan et al., 2020a) are the SOTA unsupervised and supervised multilingual NMT models, respectively. The CRISS model is initialized with the mBART model and fine-tuned on 180 language directions (90 language pairs) from CCMatrix. m2m-100 is a large Transformer trained on huge parallel data across 2200 language directions and with 7.5B parallel sentences from CCMatrix and CCAligned as well as additional back-translations. Both CRISS and m2m-100 are many-to-many multilingual NMT models.
To compare with these models, we train a manyto-one SixT model with WMT19 German-English training data, which only consists 43M sentences pairs. It only requires a pretrained XLM-R large model and do not contain any data in other languages. To make the model size comparable, the decoder uses the XLM-R base configuration and is trained from scratch. The hyper-parameters are the same with that in Section 4.1. We compare these models on the testsets from various language groups, such as German group(De, Nl), Romance group(Es, Ro, It), Uralic group (Fi, Lv, Et), Indo-Aryan group (Hi, Ne, Si, Gu) and East Asian languages (Zh, Ja, Ko). More details of the datasets are in the appendix. The official models of CRISS 7 (680M parameters) and m2m-100 8 (418M parameters) are used for comparison.
From the results in Table 4, the SixT model gets better results than mBART and slightly better performance than CRISS and m2m-100. The averaged BLEU across all languages is 7.1, 0.5 and 1.4 higher than mBART, CRISS and m2m-100, respectively. First, the results of SixT are surprising given  Table 5: The BLEU results of SixT using training data of different language pairs. The best BLEU of each language pair is bold and underlined. The Avg. column is the average BLEU over all language pairs. that the SixT model do not use any monolingual or parallel texts except German-English training data. Second, the performance gain over mBART shows that with proper fine-tuning strategy, the pretrained multilingual encoder has better cross-lingual transfer ability on NMT tasks. Finally, our model can transfer to resource-poor languages like Ne and Si and serve as a good start point for multilingual NMT. The performance of our model can be further improved with the training data of more languages. We leave this as future direction.  Performance on the supervised language pair Will the SixT model gain the cross-lingual transfer ability at the cost of losing translation ability on the given language pair l s -l t ? We compare the vanilla Transformer large model and SixT model on the same language pairs with the training set. Different from previous experiments, the best checkpoint of vanilla Transformer is selected on validation set of l s -l t this time. The validation set for SixT is the same with that in the Section 5. The performance of SixT is slightly lower than that of vanilla Transformer for large training set, but it gets better performance when the training set contains fewer sentences. The Hindi-to-English is an exception that SixT also has lower BLEU. In conclusion, the SixT model gains cross-lingual transfer ability on zero-shot language pairs while maintaining its performance on the language pair appeared in the training corpus.
Performance vs. training corpus size To examine the relationship between cross-lingual transfer ability and training data size, we compare the zeroshot BLEU scores of models trained on different dataset of De-En, namely Europarl (2M), WMT16 (4.6M) and WMT19 (43M). The results are shown in Table 7. It concludes that increasing training data   size can help to improve the zero-shot translation performance.
Performance on non-English-centric language pairs Can the SixT model be transferred on non-English language pairs? We examine the SixT model on many-to-German zero-shot translation tasks. The model is trained with XLM-R base on WMT16 En-De and WMT19 En-De, respectively. The Fi-De dataset is used as validation set. From the results shown in Table 8, the proposed SixT model can be also transferred to unseen source languages when target language is not English. Again, the results confirm that the cross-lingual transfer ability improves with larger training data.

Related Work
Zero-shot cross-lingual transfer learning Multilingual pretrained models (MPM), such as mBERT (Wu and Dredze, 2019), XLM-R (Conneau et al., 2020), mBART (Liu et al., 2020b), and mT5 (Xue et al., 2020), have achieved success on zero-shot cross-lingual transfer for various NLP tasks. The models are pretrained on large size of multilingual corpora with a shared vocabulary. After pretrained, it is fine-tuned on labeled data of downstream tasks in one language and directly tested in other languages in a zeroshot manner. While multilingual pretrained models with sequence-to-sequence architecture (Liu et al., 2020b;Xue et al., 2020;Chi et al., 2020) work well on cross-lingual transfer for NLG tasks (Hu et al., 2020), multilingual pretrained encoders (Wu and Dredze, 2019;Conneau and Lample, 2019;Conneau et al., 2020) are mainly applied to cross-lingual NLU tasks Clark et al., 2020). In this work, we explore how to fine-tune an off-the-shelf multilingual pretrained encoder for zero-shot cross-lingual transfer in neural machine translation, a typical NLG task.
Pretrained models for NMT Some previous works (Yang et al., 2020;Weng et al., 2020;Ma et al., 2020) explore to integrate pretrained language models into NMT model with various strategies. Zhu et al. (2020) propose a BERT-fused model that incorporates BERT into both the encoder and decoder via the attention mechanism and a drop-net trick. Recently, Liu et al. (2020b); Tang et al. (2020) fine-tune the pretrained mBART model for supervised NMT task. Stickland et al. (2020) also leverage the mBART model for multilingual NMT by fine-tuning on multilingual bitexts. They suggest freezing the decoder except the cross attention. Different from these work, we focus on zero-shot cross-lingual transfer in the NMT task where the texts of the to-be-transferred source languages are not available.

Conclusion
In this paper, we focus on the zero-shot crosslingual NMT transfer (ZeXT) task which aims at leveraging an MPE for machine translation while preserving its ability of cross-lingual transfer. In this task, only a multilingual pretrained encoder such as XLM-R and one parallel dataset such as German and English are available. We propose SixT for this task, which enables zero-shot crosslingual transfer for NMT by making full use of the labelled data and enhancing the transfer ability of XLM-R. Extensive experiments demonstrate the effectiveness of SixT . In particular, SixT outperforms mBART on ZeXT , a pretrained sequenceto-sequence model explicitly designed for NMT. It also gets better performance than CRISS and m2m-100, two strong multilingual NMT models, on many-to-English testsets with less training data and computation cost.

A Dataset
The dataset is from WMT translation task, CCAligned corpus 9 , WAT21 translation task 10 , Flores Testset 11 and Tatoeba testsets 12 . More details are in Table 9 to    Tatoeba   Table 11: Dataset used for English-to-German translation in Section 5.