Towards Making the Most of Cross-Lingual Transfer for Zero-Shot Neural Machine Translation

This paper demonstrates that multilingual pretraining and multilingual fine-tuning are both critical for facilitating cross-lingual transfer in zero-shot translation, where the neural machine translation (NMT) model is tested on source languages unseen during supervised training. Following this idea, we present SixT+, a strong many-to-English NMT model that supports 100 source languages but is trained with a parallel dataset in only six source languages. SixT+ initializes the decoder embedding and the full encoder with XLM-R large and then trains the encoder and decoder layers with a simple two-stage training strategy. SixT+ achieves impressive performance on many-to-English translation. It significantly outperforms CRISS and m2m-100, two strong multilingual NMT systems, with an average gain of 7.2 and 5.0 BLEU respectively. Additionally, SixT+ offers a set of model parameters that can be further fine-tuned to other unsupervised tasks. We demonstrate that adding SixT+ initialization outperforms state-of-the-art explicitly designed unsupervised NMT models on Si<->En and Ne<->En by over 1.2 average BLEU. When applied to zero-shot cross-lingual abstractive summarization, it produces an average performance gain of 12.3 ROUGE-L over mBART-ft. We conduct detailed analyses to understand the key ingredients of SixT+, including multilinguality of the auxiliary parallel data, positional disentangled encoder, and the cross-lingual transferability of its encoder.


Introduction
Neural machine translation (NMT) systems (Sutskever et al., 2014;Bahdanau et al., 2015;Vaswani et al., 2017) have demonstrated superior performance with large amounts of parallel data.However, the performance of most existing NMT systems will degrade when the labeled data is limited (Koehn and Knowles, 2017;Goyal et al., 2021).To address this problem, unsupervised NMT, in which no parallel corpora are available, is drawing increasing attention.Some prior work (Johnson et al., 2017;Chen et al., 2017;Gu et al., 2019;Zhang et al., 2020) use pivot-based methods for zero-shot translation between unseen language pairs.In this setting, both source and target languages have parallel data with a pivot language.However, these approaches are infeasible for rare languages where a parallel dataset of any kind is hard to collect.Another line of work (Guzmán et al., 2019;Ko et al., 2021;Garcia et al., 2021) build unsupervised NMT through back-translation and further enhance its performance by cross-lingual transfer from auxiliary languages.These methods are usually complicated with multiple iterations of back-translation and a combination of various training objectives.Moreover, their models can only support one or several pre-specified translation directions.Recently, Chen et al. (2021) propose SixT, a transferabilityenhanced fine-tuning method that better adapts XLM-R (Conneau et al., 2020) for translating unseen source languages.SixT is trained once to support all languages involved in the XLM-R pretraining as the source language.However, they focus on exploring a proper fine-tuning approach and build SixT with the parallel dataset from one auxiliary language, which heavily limits the model's zero-shot translation performance.
In this paper, we present SixT+, a strong manyto-English NMT model that can support as many as 100 source languages with parallel datasets from only six language pairs.SixT+ is trained by applying SixT to multilingual fine-tuning with largescale data.We first initialize the encoder and embeddings of SixT+ with XLM-R and then train it with a two-stage training method.At the first stage, we only train the decoder layers, while at the sec-ond stage, we disentangle the positional information of the encoder and jointly optimize all parameters except the embeddings.SixT+ improves over SixT by keeping the decoder embeddings frozen during the whole training process, which speeds up the model training while reducing the model size.SixT+ is trained once to support all source languages and can be further extended to many-tomany NMT that can support multiple target languages.It is not only a strong multilingual NMT model but can also be fine-tuned for other unsupervised tasks, including unsupervised NMT, zeroshot cross-lingual transfer for natural language understanding (NLU), and natural language generation (NLG) tasks.
Extensive experiments demonstrate that SixT+ works remarkably well.For translating to English, SixT+ significantly outperforms all baselines across 17 languages, including CRISS and m2m-100, two strong unsupervised and supervised multilingual NMT models trained with 1.8B and 7.5B sentence pairs.The many-to-many SixT+ gets better performance than m2m-100 in 6 out of 7 target languages on the Flores101 testset.When serving as a pretrained model, SixT+ also performs impressively well.For unsupervised NMT of rare languages, SixT+ initialization achieves better unsupervised performance than various explicitly designed unsupervised NMT models with an average gain over 1.2 BLEU.For zero-shot crosslingual transfer for NLU, it significantly outperforms XLM-R on sentence retrieval tasks, while maintaining the performance on most other tasks.On the zero-shot cross-lingual abstractive summarization task, SixT+ improves mBART-ft by 12.3 average ROUGE-L across 5 zero-shot directions.Finally, we conduct detailed analyses to understand the key ingredients of SixT+, including multilinguality of the auxiliary parallel data, positional disentangled encoder, and the cross-lingual transferability of its encoder. 1

SixT+
SixT+ aims at building a strong many-to-English NMT model, especially for the zero-shot directions.We argue that multilingual pretraining and multilingual fine-tuning are both critical for this goal.Therefore, we initialize SixT+ with XLM-R large and fine-tune SixT+ on the multilingual parallel 1 The code and pretrained models are available at https: //github.com/ghchen18/acl22-sixtp.
dataset with a simple two-stage training method.

Data: AUX6 corpus
We utilize De, Es, Fi, Hi, Ru, and Zh as the auxiliary source languages, which are high-resource languages from different language families.We do not add more auxiliary languages to limit the computation cost and the training data size.The training data is from the WMT and CCAligned dataset, consisting of 120 million sentence pairs.We concatenate the validation sets of auxiliary languages for model selection.We denote this dataset as AUX6.More dataset details are in the appendix.Following Conneau and Lample (2019), sentences of the i th language pair are sampled according to the multinomial distribution calculated as follows: where p j is the percentage of each language in the training dataset and we set the hyper-parameter α to be 0.2.In all experiments, all texts are tokenized with the same sentencepiece (Kudo, 2018) tokenizer as XLM-R.Self-att sublayer in layer  Stage 1: Decoder Training.To preserve the cross-lingual transferability of XLM-R, we first train the decoder by keeping the encoder frozen:

Model
where D = {D 1 ; ...; D K } is a collection of parallel dataset in K auxiliary languages, x, y is a parallel sentence pair with source language i and θ dec is the parameter set of the decoder layers.
Stage 2: Fine-tuning.Freezing the encoder parameters limits the NMT model capacity, especially for the large-scale training data.Therefore, we jointly train the full model in another stage: where θ is the parameter set of both encoder and decoder layers.Different from SixT which fine-tunes the decoder embedding, we keep the embeddings fixed during the whole training process (see Figure 1).Our preliminary experiments find that this strategy leads to higher computational efficiency without degrading the performance.
Positional Disentangled Encoder Positional Disentangled Encoder (PDE) is reported to improve zero-shot NMT in the previous work (Liu et al., 2021;Chen et al., 2021).The positional correspondence between the input tokens and the encoder representations is one of the factors that makes the encoder representations language-specific.PDE relaxes such correspondence by removing residual connections in an encoder layer.We refer the readers to Liu et al. (2021); Chen et al. (2021) for more details.In SixT+, we remove the residual connection after the self-attention sublayer of the 23 th (penultimate) encoder layer at the second training stage, as suggested by Chen et al. (2021).For simplicity, we denote the two-stage training method with PDE as TransF in the following sections.
3 Zero-Shot Neural Machine Translation

Experiment Settings
For the many-to-English translation task, we evaluate the performance of SixT+ on the test sets of 23 language pairs from 9 various language groups2 : German group (De, Nl), Romance group (Es, Ro, It), Uralic and Baltic group (Fi, Lv, Et), Slavic group (Ru, Pl), Arabic group (Ar, Ps), Indo-Aryan group (Hi, Ne, Si, Gu), Turkic group (Tr, Kk), East Asian group (Zh, Ja, Ko) and Khmer group (My, Km).The dataset details are in the appendix.For decoding, we use beam-search with beam size 5 for all translation directions and do not tune length penalty.We report detokenized BLEU for all directions using sacrebleu3 .
We compare SixT+ with SixT and four other baselines.Among the four baselines, XLM-R ftall and mBART-ft use the same training data as SixT+, while CRISS and m2m-100 are trained on 1.8B and 7.5B sentence pairs.As SixT+, CRISS, and m2m-100 have different model sizes, support different numbers of languages and are trained with different training datasets, the comparisons are not completely fair, but the results can still demonstrate the strong performance of SixT+.
• CRISS (Tran et al., 2020).This model is the stateof-the-art unsupervised many-to-many multilingual NMT model.It is initialized with mBART and finetuned on 180 translation directions from CCMatrix.
It only supports 25 input languages.
• m2m-100 (Fan et al., 2020).This model is a strong supervised many-to-many multilingual NMT model.It is a large Transformer trained on huge parallel data across 2200 translation directions and with 7.5B parallel sentences from CC-Matrix and CCAligned as well as additional backtranslations.The official 1.2B model is evaluated.
• SixT (Chen et al., 2021).This model motivates SixT+.The SixT model trained with XLM-R large on WMT19 De-En is evaluated and compared.
• mBART-ft (Liu et al., 2020;Tang et al., 2020).mBART 4 is a strong pretrained multilingual seq2seq model.We follow their setting and directly fine-tune all model parameters on the AUX6 corpus.
• XLM-R ft-all (Conneau and Lample, 2019).This method is the same as SixT+ but utilizes a different fine-tuning method that directly optimizes all 4 We use mBART50 from Tang et al. (2020).
model parameters.

Main Results
As shown in Table 1, SixT+ outperforms all baselines with an average gain of 5.0-7.2BLEU.The performance of SixT+ is impressive given that it does not use any other monolingual or parallel texts except the 0.12B parallel sentence pairs.First, the significant improvement over mBART-ft demonstrates that the multilingual pretrained encoder XLM-R can also build a strong zero-shot many-to-one translation model if fine-tuned properly.Second, SixT+ is significantly better than XLM-R ft-all and SixT+ (1st), proving that a proper fine-tuning method is important for zero-shot translation.Finally, the gain of SixT+ over SixT shows that adding more auxiliary languages and more parallel data benefits the performance.SixT+ achieves new state-of-the-art performance on unsupervised many-to-English translation.It is significantly better than CRISS in all 14 unsupervised directions.When comparing with supervised models, SixT+ improves over m2m-100 on 17 out of 23 translation directions.Although CRISS and m2m-100 are many-to-many NMT models that may face the insufficient modeling capacity prob- lem (Zhang et al., 2020), they are strong many-to-English baselines trained with much more data (1.8 billion for CRISS and 7.5 billion for m2m-100) and computation cost.Moreover, the model size of m2m-100 is much larger than SixT+.
Different from previous unsupervised NMT models built with back-translation on monolingual data (Lample et al., 2018a,b) or parallel data mining (Tran et al., 2020), SixT+ illustrates that better unsupervised NMT can be achieved by crosslingual transfer from auxiliary languages.It improves on the test sets whose languages are in the same family as the auxiliary languages.For languages that are not in the same family of auxiliary languages, SixT+ also works well.For instance, it improves My→En from 6.7 to 15.3 BLEU, Ps→En from 10.9 to 14.9 BLEU, and Kk→En from 20.7 to 27.3 BLEU.

Analysis
Many-to-Many SixT+ The SixT+ can be extended to support other or multiple target languages.Following Zhang et al. (2020), we build a manyto-many SixT+ (SixT+ m2m) model and switch between different target languages by a targetlanguage-aware linear projection layer between the encoder and the decoder.The linear layers are randomly initialized and trained in both training stages.The model is also trained on AUX6, but additionally includes the En→{De,Es,Fi,Hi,Ru,Zh} translation directions during supervised training and validation.All the other training details are the same.We evaluate the performance of SixT+ m2m on the Flores 101 testset (Goyal et al., 2021), which is a multilingual aligned benchmark that covers 101 different languages.Following previous work (Fan et al., 2020), we report tokenized BLEU when Hindi5 and Chinese6 are the target language and the detokenized BLEU for other target languages.We compare it with the m2m-100 (1.2B) model, as shown in sults on each source language are in Table 12 of the appendix.Overall, our model outperforms m2m-100 in 6 out of 7 target languages.This is impressive given that our model is unsupervised.The SixT+ m2m performs more evenly in different source languages (see Table 12 in the appendix).In contrast, the performance of m2m-100 varies across languages.Our model learns to translate through effective crosslingual transfer, while m2m-100 relies heavily on the scale and quality of the direct parallel dataset.We also compare SixT+ m2En and SixT+ m2m for translating to English on this testset and get an average BLEU of 30.5 and 29.8, respectively (see Table 12 in the appendix).The results demonstrate that SixT+ m2m successfully supports seven target languages while keeping most of the performance of SixT+ m2En on the many-to-English testset.
Effect of the Multilinguality of Auxiliary Languages Previous studies report that adding more parallel data and more auxiliary languages improves performance for unsupervised NMT (García et al., 2020;Bai et al., 2020;Garcia et al., 2021).In this experiment, we examine whether increasing multilinguality under a fixed data budget improves the zero-shot performance of SixT+.We fix the amount of auxiliary parallel sentence pairs to 8 million and vary the number of auxiliary languages.
We report the results in Table 3.It is observed that the model trained with four auxiliary languages (De, Es, Fi, Ru, each has the same data size) outperforms that of one auxiliary language (De), with an average gain of 3.7 BLEU.Note that for both cases, we use auxiliary languages which are not in the Indo-Aryan group to remove the impact of language similarity.This observation demonstrates the necessity of utilizing multiple auxiliary languages in the training dataset.

Effect of Positional Disentangled Encoder
In this part, we conduct a comprehensive study on the effect of the positional disentangled encoder (PDE) (Liu et al., 2021;Chen et al., 2021).Table 4 presents the results.We find that on the small-scale Europarl dataset, PDE improves the zero-shot performance with an average gain of 1.0 BLEU.However, when the training data goes large or/and becomes more multilingual, the gain decreases (see results on WMT19 and AUX6).To confirm this, we also conduct experiments on SixT+ m2m (see Table 12 in the appendix).For translating to English, the models with and without PDE perform comparably well.However, for translating to other languages, PDE improves in 5 out of 6 directions, with an average gain of 0.4 BLEU.This is expected as these directions include only one source language (En) and much less training data (7M∼41M) than translating to English (120M).In summary, when large-scale multilingual training data are available for all target languages, it is fine to remove PDE.We suspect the model has already learned language-agnostic encoder representations in this case.Otherwise, PDE benefits zero-shot performance.
Performance on Cross-lingual NLU Tasks To better understand the encoder representation produced by SixT+, we evaluate the zero-shot crosslingual transfer performance of the SixT+ encoder on the XTREME benchmark (Hu et al., 2020).The XTREME includes 9 target tasks of natural language understanding.We do not report results on XQuAD and MLQA as they have no held-out test data (Phang et al., 2020).For all other XTREME tasks, we follow the training and evaluation protocol in Hu et al. (2020) and implement with the jiant toolkit (Phang et al., 2020).As NMT training can be regarded as an intermediate task (Pruksachatkun et al., 2020), we include previous results on using English intermediate NLU tasks to improve XLM-R on XTREME as a reference (Phang et al., 2020).
Table 5 provides the average results for each task.
The detailed results are in the appendix.Overall, SixT+ encoders achieve 8.3% and 31.6%performance gain over XLM-R and XLM-R ft-all across the seven tasks, which verifies that our model learns a more language-agnostic encoder representations.Our encoder may learn better sentence-level representation and capture better semantic alignments among parallel sentences through multilingual NMT training, therefore it generally performs better on sentence pair (XNLI and PAWS-X) and sentence retrieval tasks (BUCC and Tatoeba).The results show the potential of leveraging NLG task as the intermediate task for improving performance on XTREME.We leave a more detailed exploration of why NMT training as well as other NLG intermediate tasks could be beneficial for a given NLU task as future work.

SixT+ as a Pretrained Model
SixT+ learns language-agnostic encoder representation and performs impressively well on translating various source languages.In this part, we extend SixT+ to two cross-lingual NLG tasks where the direct labeled data is scarce, namely unsupervised NMT for low-resource languages and zero-shot cross-lingual abstractive summarization.

Unsupervised NMT for Low-resource Language
Given a low-resource language pair where the parallel dataset is unavailable, early work on unsupervised NMT build the translation model by training denoising autoencoding and back-translation concurrently (Lample et al., 2018b,a;Artetxe et al., 2018).However, these methods may lack robustness when languages are distant (Kim et al., 2020;Marchisio et al., 2020).For example, Guzmán et al. (2019) report BLEU scores of less than 1.0 on distant language pair Nepali-English using the method in Lample et al. (2018b).Recent work improves by better initializing the unsupervised NMT model either with a multilingual pretrained language model (Liu et al., 2020;Song et al., 2019;Ko et al., 2021, MulPLM) or a multilingual NMT model (Lin et al., 2020).In this part, we follow this line and offer an alternative initialization option for building strong unsupervised NMT models.We first initialize the L LR →En model with SixT+.As SixT+ only supports En as the target language, we initialize the En→L LR model with XLM-R following how SixT+ is initialized.Then we iteratively improve these two models with back-translation.For simplicity, we do not update the L LR →En model and only train the reverse model once.We train it with a synthetic backtranslation dataset from L LR monolingual data using the two-stage training method7 .We do not apply other unsupervised NMT techniques, such as iterative back-translation (Lample et al., 2018b), cross-translation (Garcia et al., 2021) or iterative mining of sentence pairs (Tran et al., 2020).These methods could be complementary to our method.We leave the in-depth exploration as future work.
Experimental Settings We evaluate our method on Ne and Si, two commonly used benchmark languages for evaluating low-resource language translation.The monolingual dataset of Ne and Si consists of 7 million sentences that are sampled from CC100 and CCNet dataset.The test sets are from the Flores dataset (Guzmán et al., 2019).We set the beam size to 5 during the offline back-translation and select the model with unsupervised criterion in Lample et al. (2018a).We compared with stateof-the-art supervised and unsupervised baselines.Please refer to the appendix for more details.

Results
The results are illustrated in Table 6.Our model outperforms all unsupervised baselines for all translation directions, improving the best performing unsupervised baseline with an average gain of 1.2 BLEU.In addition, it even outperforms all supervised baselines and achieves new stateof-the-art performance on Ne→En and En→Ne translations.It is impressive given that the supervised baselines Guzmán et al. (2019) and Liu et al. (2020) are very strong.Both methods are trained on around 600k parallel corpus and more than 70M monolingual corpora with supervised translation and iterative back-translation.Our method is also computationally efficient and easy to implement.As SixT+ offers a ready-to-use L LR →En NMT model, we only run back-translation once for building the reverse model.However, for the baselines (ID 2-3, 5-7), they run iterative back-translation for no less than two rounds and involve crosstranslation, denoising autoencoding, or adversarial loss.They are much more complex and computational costly compared with our method.

Zero-shot Cross-lingual Generation
In zero-shot generation with the source-side transfer, the NLG model is directly tested on unseen source languages during supervised training.As cross-lingual labeled data are scarce, such zeroshot generation is useful in the cross-lingual generation where the languages of input and output   Table 7: ROUGE results for zero-shot cross-lingual abstractive summarization.For ROUGE score, higher value is better.The 'Avg' is the average score of all zero-shot directions.
text are different.In this experiment, we focus on utilizing SixT+ for zero-shot cross-lingual abstractive summarization (ZS-XSUM).We believe such a framework can be easily extended to other zero-shot cross-lingual generation tasks.
The ZS-XSUM task is challenging, as we require the model to summarize (from document to abstract), translate (from input language to output language) and transfer (from auxiliary input language to target input language) at the same time.SixT+ already has the ability to translate and transfer, thus it offers a set of initialization parameters that can ease the learning of the ZS-XSUM model.Specifically, we initialize the ZS-XSUM model with SixT+ (1st)8 and then train on labeled data of abstractive summarization with the TransF method.The trained model is tested on the cross-lingual summarization in a zero-shot manner where the source language is unseen during training.
Experiment Settings To build a strong ZS-XSUM model, we collect 1.2 million public document-summary pairs to form the training dataset, where the document is in the languages among En/De/Es/Fr/It/Pt/Ru and the summary is in En.We evaluate the performance on the Wikilingua dataset with Hi/Zh/Cs/Nl/Tr as source languages and English as the target language.All the test languages are unseen during training and validation.The dataset details are in the appendix.We compare the proposed method with the mBARTft method which directly fine-tunes all mBART parameters and our proposed method in building SixT+ which is denoted as 'Ours w/o NMT pretraining'.

Results
As shown in Table 7, both of our methods outperform mBART-ft on all zero-shot directions by an average gain of 8.1 and 12.3 ROUGE-L.This is impressive given that mBART is a widely used MulPLM for the cross-lingual generation.We also observe that initializing with SixT+ is much better than XLM-R with the same TranF training method, demonstrating that the NMT pretraining is beneficial for the ZS-XSUM task.To build a cross-lingual generation model without labeled data, previous works usually resort to the translateand-train or translate-and-test approaches or their extensions (Shen et al., 2018;Duan et al., 2019).For these approaches, an NMT system is required to translate either at the training or testing time.However, translate-and-train can only develop models for a few pre-specified source languages, while the decoding speed of translate-and-test is slow, especially for summarization where the input text is long.Besides, both approaches rely heavily on the performance of the NMT system.SixT+ shows that it is possible to build a strong universal crosslingual NLG model that can support 100 source languages.This is promising, especially for lowresource languages which the NMT system translates poorly.Our model can also serve as a start point which can be further improved by fine-tuning on genuine or synthesized (produced by an NMT system) cross-lingual corpus.We leave more indepth exploration as future work.
5 Related Work

Multilingual Neural Machine Translation
Early works on multilingual NMT show its zeroshot translation capability, where the tested translation direction is unseen during supervised training (Johnson et al., 2017;Ha et al., 2016).To fur-ther improve the zero-shot performance, one direction is to learn language-agnostic encoder representations and make the most of cross-lingual transfer.Some approaches modify the encoder architecture to facilitate language-independent representations.Lu et al. (2018) incorporate an explicit neural interlingua after the encoder.Liu et al. (2021); Chen et al. (2021) remove the residual connection at an encoder layers to relax the positional correspondence.Some other works introduce auxiliary training objectives to encourage similarity between the representations of different languages (Arivazhagan et al., 2019;Al-Shedivat and Parikh, 2019;Pham et al., 2019;Pan et al., 2021).For example, Pan et al. (2021) utilize contrastive loss to explicitly align representations of a bilingual sentence pair.Recently, multilingual pretraining has demonstrated to implicitly learn language-agnostic representation (Liu et al., 2020;Conneau et al., 2020;Hu et al., 2020).Inspired by this, some studies initialize multilingual NMT with the MulPLM or introducing the training objectives of MulPLM to multilingual NMT (Gu et al., 2019;Ji et al., 2020;Liu et al., 2020;Chen et al., 2021;Garcia et al., 2021).Our work follows the last line but improves over them by making the most of MulPLM with a simple yet effective fine-tuning method and largescale multilingual parallel dataset.

Zero-shot Translation with Multilingual Pretrained Language Model
For NLG tasks like neural machine translation, most work leverage multilingual pretrained seq2seq language models such as mBART (Liu et al., 2020), mT5 (Xue et al., 2021) and ProphetNet-X (Qi et al., 2021) for cross-lingual transfer.For example, Liu et al. (2020) fine-tune mBART with the parallel dataset of one language pair and test on unseen source languages.Considering the great success of the multilingual pretrained encoder (MulPE) such as XLM-R (Conneau et al., 2020) and mBART (Wu and Dredze, 2019) in zero-shot cross-lingual transfer for NLU tasks (Hu et al., 2020), their use for cross-lingual transfer in NLG tasks is still underexplored.Wei et al. (2021) fine-tunes their proposed MulPE to conduct zero-shot translation but use the [CLS] representation as the encoder output.
Our work is most similar to SixT (Chen et al., 2021), as indicated by the name itself.However, since SixT focuses on designing a novel fine-tuning method, it conducts experiments with one auxiliary language, which heavily limits the model's performance.In addition, SixT only works on NMT, while SixT+ can not only perform translation but also serve as a pretrained model for various zeroshot cross-lingual generation tasks, such as lowresource NMT and cross-lingual abstractive summarization.

Conclusion
In this paper, we introduce SixT+, a strong manyto-English NMT model that supports 100 source languages but is trained once with the parallel dataset from only six source languages.Our model makes the most of cross-lingual transfer by initializing with XLM-R and conducting multilingual fine-tuning on the large-scale dataset with a simple yet effective two-stage training method.Extensive experiments demonstrate that SixT+ outperforms all baselines on many-to-English translation.When serving as a pretrained model, adding SixT+ initialization achieves new state-of-the-art performance for unsupervised NMT of low-resource and significantly outperforms mBART and XLM-R on zero-shot cross-lingual summarization.Table 12: BLEU comparison of our many-to-many NMT model (SixT+ m2m) with m2m-100 on zero-shot translations.We use a target-language-aware linear projection layer to generate different target languages for the SixT+ m2m model.Ours (m2En) is the many-to-English SixT+ model trained with the AUX6 dataset.We include the result of SixT+ m2m w/o PDE to help study the effect of PDE.The best average BLEU for each target language is bold and underlined.
• Unsupervised baselines.We include the results of three unsupervised methods.Guzmán et al. (2019) utilize Hi as auxiliary language and train with auxiliary supervised translation and iterative back-translation.Garcia et al. (2021) utilize six languages as auxiliary languages and present a three-stage method with various loss functions, including auxiliary supervised translation, iterative back-translation, denoising autoencoding and cross translation.Ko et al. (2021) fine-tune mBART on the parallel dataset from Hi and monolingual data in an iterative manner with auxiliary supervised translation, back-translation, denoising autoencoding and adversarial objective.Note that these methods utilize much more monolingual data than ours.

G XTREME benchmark results
All models are evaluated on the XTREME benchmark (Hu et al., 2020) with jiant toolkit20 .We follow the same settings with Phang et al. ( 2020) for fine-tuning and testing.The detailed results for each languages on each task are shown in Table 14 to Table 20. #Sent

Figure 1 :
Figure 1: Our proposed two-stage training framework (TransF) for building cross-lingual NLG model with XLM-R.The blue icy blocks are initialized with XLM-R and frozen, while the red fiery blocks are initialized randomly or from the first stage.'SA' denotes the self-attention sublayer.We remove the residual connection at the 23 th (penultimate) encoder layer at the second stage, namely i = 23 in the figure.
for reference.These two methods are very strong.Both methods are trained on around 600k parallel corpus and more than 70M monolingual corpora with supervised translation and iterative back-translation.Liu et al. (2020) initialize the model with mBART whileGuzmán et al. (2019) use auxiliary parallel corpus from related language for the Ne↔En translations.

Table 2 :
Averaged BLEU comparison of SixT+ m2m and m2m-100 on zero-shot translations.The detailed results are in the Table12of the appendix.

Table 3 :
BLEU comparison of SixT+ trained with the same size of training data that consists of different number of auxiliary languages.'4 Aux.Langs' is a combination of {De,Es,Fi,Ru}-En parallel datasets.

Table 4 :
The average BLEU of SixT+ with and without positional disentangled encoder (PDE).Note that AUX6 includes more source languages.The detailed scores are in the Table13of the appendix.

Table 5 :
XTREME benchmark results of our models and baselines.The results for individual languages can be found from Table14to Table20in the appendix.

Table 6 :
BLEU comparison of different models on the low-resource language translation.Results with ' †' are quoted from the original paper.The best unsupervised method for each translation direction is bold, while the best supervised method is underlined.