VECO: Variable and Flexible Cross-lingual Pre-training for Language Understanding and Generation

Existing work in multilingual pretraining has demonstrated the potential of cross-lingual transferability by training a unified Transformer encoder for multiple languages. However, much of this work only relies on the shared vocabulary and bilingual contexts to encourage the correlation across languages, which is loose and implicit for aligning the contextual representations between languages. In this paper, we plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages. It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language. More importantly, when fine-tuning on downstream tasks, the cross-attention module can be plugged in or out on-demand, thus naturally benefiting a wider range of cross-lingual tasks, from language understanding to generation. As a result, the proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark, covering text classification, sequence labeling, question answering, and sentence retrieval. For cross-lingual generation tasks, it also outperforms all existing cross-lingual models and state-of-the-art Transformer variants on WMT14 English-to-German and English-to-French translation datasets, with gains of up to 1 2 BLEU.


Introduction
Cross-lingual pre-trained models like mBERT (Devlin et al.,  tial on a variety of cross-lingual understanding and generation tasks. Behind the great success, two major factors play the role of aligning the contextual representations between languages: 1) build the shared vocabulary across languages through subword tokenization, which supports the simple extension of masked language modeling (MLM) from English corpus to multilingual corpus; 2) capture the alignment in parallel data via concatenating two sentences as input, called translation language modeling (TLM). However, both of these two mechanisms rely on the self-attention module (query=key/value) of the Transformer encoder to implicitly enhance the interdependence between languages, which may lead to few attention patterns across languages. Taking Figure 1 as an example, even though inputting a pair of parallel sentences, both models only attend to the English context to build the representation of English tokens, while ignoring the se- Figure 2: A schematic comparison of cross-lingual pre-training tasks and their attention matrices. When predicting the masked words of different languages: a) MLM can only attend to the context in its own language; b) TLM implicitly attend to a part of words across languages (as shown in Figure 1). However, c) the proposed CA-MLM can: (1) not only attend to the context in its own language to predict words x 2 and y 3 , (2) but also can firstly attend to its own context and then explicitly attend to all words across languages to predict words x 3 and y 2 via a plug-in cross-attention module. mantically related Chinese tokens. That is, the self-attention module captures little communication across languages, which is crucial for learning universal cross-lingual representations.
Based on the above observation, we propose to plug a cross-attention module (query!=key/value) into the Transformer encoder and design a crossattention MLM task to explicitly capture the interdependence between languages. As illustrated in Figure 2 (c), the cross-attention module takes the representation of x as query and y as key/value (purple lines) to build the representations of x in the next layer, thus explicitly aligning the representations across languages (purple attention matrices). It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language. Moreover, what distinguishes our work from pre-training an encoderdecoder model (Liu et al., 2020b) is that we also keep the good nature (i.e., bidirectional contextual modeling) of the original encoder by unplugging the cross-attention from the model to predicting the masked words (e.g., x 2 and y 3 ).
Furthermore, when fine-tuning on various downstream tasks, we can choose either plug-in or plugout the cross-attention module on-demand, thus making it suitable for both cross-lingual language understanding (NLU) and generation tasks (NLG). For cross-lingual NLU tasks, if plugging the crossattention module out, we can adopt the same fine-tuning methods as an encoder-only model like XLM. However, we find that plugging the crossattention module in fine-tuning can better utilize the bilingual context to boost the performance. For cross-lingual NLG like machine translation (MT), the cross attention is already jointly pre-trained with the whole network. Therefore, the parameters of the decoder do not need to be re-adjusted substantially in the following tuning process, thus fundamentally solving the main drawback of utilizing pre-trained encoders like XLM for initializing encoder-decoder models.
We call our approach VECO for "Variable and Flexible Cross-lingual Pre-training". We validate VECO on a variety of representative cross-lingual understanding and generation benchmarks. Regrading cross-lingual understanding tasks, we conduct experiments on the XTREME benchmark consisting of 9 cross-lingual tasks, including text classification, sequence labeling, question answering, and sentence retrieval. VECO ranks first at the XTREME leaderboard 2 at the submission deadline. Regrading cross-lingual generation tasks, we validate VECO on the widely used WMT14 English-German and English-French machine translation benchmarks. VECO obtains 44.5 and 31.7 BLEU scores, consistently outperforming existing crosslingual pre-training approaches and state-of-the-art Transformer variants by around 1∼2 BLEU.

3982
2 Pre-training of VECO 2.1 Overview of VECO VECO extends from a multi-layer Transformer encoder and plugs a cross-attention module in each layer. Given a pair of input (x, y) and its corrupted version (x,ŷ) via randomly masking part of its tokens, the model builds two types of contextualized vector representation for each token: • One suit of contextual representations H, denoted as green blocks and yellow blocks in Figure 2 (c), are only build on self-attention module (i.e., unpluging the cross-attention module) in each layer.
• Another suit of contextual representations S, denoted as mixed color blocks in Figure 2 (c), are build on both the self-attention and cross-attention modules 3 .
The model is trained to predict the masked tokens via two corresponding representations, conditioning on both its own context and paired context, respectively. Take predicting the masked words in sequence x as an example, the training objective is the cross-entropy of the gold distribution and predicted distribution P (x|x) and P (x|ŷ,x) computed via the above two suits of contextual representations. Thus, the training objective of crossattention masked language modeling (CA-MLM) can be formulated as where θ s and θ c are the parameters of self-attention and cross-attention modules.

Architecture
The backbone network of VECO is composed of a stack of N Transformer layers. Each layer has three modules: a required self-attention module, a plugand-play cross-attention module, and a required feed-forward linear module. Both self-attention and cross-attention modules are based on the multihead attention (Vaswani et al., 2017). An attention function can be described as mapping a query (Q) and a set of key-value (K-V) pairs to an output.
For the self-attention module, all the queries, keys and values are the same representations from the previous layer. Specifically, for the l-th Transformer layer, the output of a self-attention head A s l is computed via: where H l−1 are the previous layer's outputs, W Q l , W K l , W V l are the parameter matrices of selfattention modules.
For the cross-attention module, the queries come from the previous layer, and the keys and values come from the last layer's representations of paired input. Specifically, for the l-th layer, the output of a cross-attention head A c l is computed via: where S l−1 are the previous layer's outputs, U Q l , U K l , U V l are the parameter matrices of crossattention modules.
Finally, the output H L of the last layer is used to recover the masked tokens of x, conditioning on its own context.
where f is the feed-forward network that maps the output vectors into the dictionary. H L x and H L y are computed via Eq 2∼5 when H 0 x and H 0 y are the word embeddings of x and y, respectively.
Meanwhile, S L , conditioning on the context of the paired sequencex andŷ, is used to predict the masked tokens of y.
where S L x and S L y are computed via Eq 6∼9 with the corresponding word embeddings and H L .

3983
VECO Fine-tuning: Flexible for NLU and NLG tasks  Figure 3: The overview of VECO. During pre-training, a plug-and-play cross-attention module is jointly pretrained along with the self-attention module. When fine-tuning on natural language understanding (NLU) tasks, the cross-attention module can be either plug-in or plug-out on demand. When fine-tuning on natural language generation (NLG) tasks, VECO can initialize an encoder-decoder module (the mainstream backbone model of generation tasks) since all those necessary modules in the encoder and decoder are already pre-trained.
Note that when optimizing the objectives based on Eq 12 and Eq 13, we apply a stop-gradients operation (Chen and He, 2020) to H L (i.e., H L is treated as a constant in this term). This operation can largely speed up the training by avoiding the backpropagation on a 2L-layer network. Moreover, it even stabilizes the training of deep postlayernorm Transformer, which requires non-trivial efforts regarding carefully designing learning rate schedulers and cutting-edge optimizers (Liu et al., 2020a;Bachlechner et al., 2020).

Fine-tuning VECO for Downstream
Cross-lingual Understanding and Generation Tasks As Figure 3 illustrated, when fine-tuning on various downstream tasks, one advantage of VECO is its flexibility for initializing both the encoder-only Transformer for understanding tasks and encoderdecoder Transformer for generation tasks. Beyond it, we also explore a fine-tuning approach combined with the characteristics of VECO .

VECO for Cross-lingual Understanding
Due to the plug-and-play cross-attention module, we explore two fine-tuning approaches: • Plug-Out fine-tuning is to unplug the crossattention module from the pre-trained model. In other words, the architecture of the finetuned model is almost the same as mBERT or XLM. Specifically, the contextual representations from the last layer H L x is used to predict the label of input x.
• Plug-In fine-tuning is to plug the crossattention module into the fine-tuned model, if the bilingual or automatically translated training data y is available in the downstream task. Specifically, we concatenated the two representations [H L x : S L x ] to predict the label of x, [H L y : S L y ] to predict the label of y. 4 .

VECO for Cross-lingual Generation
For pre-trained encoders like XLM, it is not a trivial problem to incorporate them into the sequenceto-sequence architecture -the mainstream backbone model of generation tasks (Zhu et al., 2020). One of the drawbacks or challenges could be that the encoder-to-decoder attention is not pre-trained. Therefore, the parameters of the decoder need to be re-adjusted along with the encoder in the following fine-tuning process (Ren et al., 2019). However, under the framework of VECO , the cross-attention is jointly pre-trained along with the whole network, making it easy to provide full initialization for sequence-to-sequence models. Specifically, the self-attention module is used to initialize both the corresponding modules in the encoder and decoder for contextual modeling, while the cross-attention module is used to initialize the encoder-to-decoder attention. It's okay whether you continue to tie the self-attention parameters during fine-tuning. Directly pre-training a sequenceto-sequence model like mBART (Liu et al., 2020b) could be another solution for NLG tasks, but we found mBART is not so effective in cross-lingual NLU tasks. We refer the reader to the Section 7 for detailed experiments and analysis.  Optimization Settings For each iteration, we alternately sample a batch of adjacent segments from the monolingual corpus and a batch of parallel sentences from bilingual datasets to conduct a pair of masked input (x,ŷ). We adopt the translation language modeling (TLM) when the inputs are parallel bilingual sentences. Thus the overall training objective is the sum of TLM and the proposed CA-MLM objectives. During training, the model parameters except for cross-attention are initialized by XLM-R. We first freeze the parameters of XLM-R and only update the cross-attention parameters for faster convergence. Then, we jointly train the whole model. We pre-train our model with mixed-precision training using 64 Nvidia Telsa V100 32GB GPUs. Appendix A shows additional details.
5 Experiments on Cross-lingual Understanding Tasks

Experimental Setup
Downstream Tasks We conduct cross-lingual NLU evaluations on XTREME (Hu et al., 2020), a representative massively multilingual benchmark that consists of 9 understanding tasks over 40 languages. XTREME tasks can be classified into four different categories: (1)  (1) Cross-lingual Transfer which fine-tunes the pre-trained model using English golden data only and directly performs inference on the test data of different target languages; (2) Translate-Train-All fine-tunes a multilingual model on the concatenation of all data (golden training corpus in English and translated training corpus in other languages). Note that for two sequence-labeling tasks (POS, NER), the position of token labels in the translated text generally differs from that in the source text.   under the translate-train-all setting, we also show the results of XLM-R using the same fine-tuning hyperparameters as VECO .

Experimental Results
The detailed test results of nine tasks on the XTREME benchmark are shown in Table 2. It demonstrates that the proposed VECO outperforms previous cross-lingual models on all datasets. Compared to XLM-R, it averagely scores 5.0 and 6.6 points higher under the cross-lingual transfer and translation-train-all settings, respectively. In the cross-lingual transfer setting, VECO delivers a large improvement compared to XLM-R, especially on zero-shot sentence retrieval tasks (BUCC, Tatoeba). This phenomenon reflects that our model can better build the interdependence between languages. Thus it can better mine parallel sentences in a multilingual corpus.
Under the translation-train-all setting, it can be observed that VECO with Plug-In fine-tuning (VECO in ) is better than Plug-Out fine-tuning (VECO out ). We conclude the reasons as two-fold. On the input side, the Plug-Out fine-tuning individually takes multilingual instances as input, while the Plug-In fine-tuning considers the bilingual instances 6 at each run. On the model side, the Plug-In fine-tuning can encourage correspondence across language via the cross-attention module. Note that the Plug-In fine-tuning method also outperforms FILTER (Fang et al., 2020), an enhanced cross-lingual fine-tuning method that also takes the 6 English instance with its translated one. bilingual instance as the input of XLM-R. It further demonstrates the effectiveness of VECO and its specialized fine-tuning method.
We conclude the reasons for the above performance improvement as two-fold: 1) the introduction of bilingual data during pre-training, which is a direct way to enhance the cross-lingual ability of the model; 2) Stronger ability to enhance the interdependence and fusion among languages via the proposed CA-MLM pre-training tasks. To analyze which plays a leading role, we conduct a set of more fair experiments in Section 7.  tokenized SacreBLEU 7 to avoid the influence of different tokenization and normalization between models (Post, 2018).

Fine-tuning Setting
We fine-tune our model using fairseq 8 toolkit and adopt comparable training settings with baselines. We run WMT 14 En-De and En-Fr MT experiments on 16 and 32 V100 GPUs, respectively. The batch size is 64k for En-De and 256k for En-Fr. The total training updates are set to 100k. The learning rate is 1e-4/2e-4, with linear warm-up over the first 16k steps and linear decay. We average the last 10 checkpoints and use beam search with a beam size of 5.
Baselines We consider two types of Transformer baselines: randomly initialized and cross-lingual models initialized. For random initialization, we reproduce a Transformer baseline that adopts the same architecture and fine-tuning hyperparameters as VECO but with random initialization.   Table 3 (right) displays the BLEU scores of same-sized models during training. We find that VECO initialized model can get a surprising more than 28 SacreBLEU score just after 10 epochs, which is better than the final score of the randomly initialized model at 35 epochs. It reveals that VECO can provide a fairly good initialization for the machine translation model, which can converge quickly and further boost the results.
One might suspect that the main reason for the performance improvement is leveraging parallel corpus during pre-training. To figure it out, we conduct a more comparable experiment. We first train an out-of-domain Transformer model using the whole En-De parallel data (∼ 68M) used in VECO pre-training, and then continue to train the model on the in-domain WMT14 En-De training dataset. Results are shown in Table 3 (left) marked with *. Under this set of a totally fair comparison, VECO still maintains a lead of 1.1 BLEU score. This directly confirms that the improvement in MT is not only due to the use of bilingual data. More importantly, CA-MLM ensures better use of bilingual and large-scale unlabeled multilingual corpus.

Potential of Initializing Shallow Decoder
Online translation applications usually have a restriction of inference time. The most direct way is to reduce the decoder layers since previous MT works (Liu et al., 2020a) have shown that deeper encoders are more worthwhile than deeper decoders. Based on this, we also explore the potential of the VECO to initialize deep encoder and shallow decoder Transformers, which is a blank in the crosslingual pre-training works. Table 4 contrasts two ways of initializing a Transformer with n decoder layers (n < 24) via selecting: (1) the first n layers; (2) the last n layers from a 24-layer pre-trained VECO model. We consider n = {3, 6} to conduct experiments. We find that selecting the last n layers exhibits better performance than selecting the first n layers. It reveals that the last several layers play a more important role in making predictions over the whole vocabulary. Moreover, we can find that there is 0.2∼0.3 BLEU gain when increasing the decoder layers from 3 to 6. However, we observe that only marginal improvement can be gained when further increasing the decoder layers from 6 to 24, which is also in line with the findings in Liu et al. (2020a). Regardless of the initialization method, the VECO initialized model can gain consistent 1∼2 BLEU improvement over the randomly initialized model.

Analysis and Ablation Study
We perform an ablation study to investigate where the improvement in cross-lingual NLU and NLG tasks mainly comes from. Specifically, there are three main aspects we have studied: 1. How much performance improvement comes from the parallel translation corpus used in pre-training?
2. How effective of the CA-MLM pre-training  3. How about pre-training a sequence-tosequence model like mBART for NLU and NLG tasks?
To figure out these questions, we train XLM, mBART and VECO model from scratch using the same datasets and parameter settings (see Appendix A for more details). All of them is pre-trained via MLM and TLM tasks. Note that the MLM task generally refers to predict the masked words of source language, while the TLM task generally refers to predict the words of the target language. Specifically for mBART that is under the framework of encoder-decoder, the input of encoder is masked sequencex, and the target of decoder is the masked words of source input x (for MLM task), or the parallel sentence y (for TLM task). Table 5 shows the results of two representative datasets of cross-lingual NLU and NLG. We can observe that, when using monolingual corpus only, VECO can outperform XLM by 0.8 points on the XNLI dataset and 0.3 BLEU scores on the IWSLT14 De-En translation dataset. It suggests that the CA-MLM can still benefit from adjacent sentences in monolingual corpus 9 , to be equipped with a stronger ability of contextual modeling. Moreover, when pre-training both on the monolingual and bilingual corpus, VECO can even achieve a larger improvement compared to XLM, with 3.2 and 2.1 points improvement on two datasets, respectively. It reveals that CA-MLM objective of VECO can better utilize the bilingual corpus, compared to only optimized by TLM and MLM of XLM.
Moreover, we find that pre-training a sequenceto-sequence model like mBART (Liu et al., 2020b) performs worst on NLU tasks like XNLI 10 , almost 6 points worse than VECO and near 2 points worse than XLM. One possible explanation could be that the unidirectional language modeling in the decoder might be sub-optimal for NLU tasks. And even on the machine translation task, mBART still performs worse than VECO when pre-training on the same bilingual datasets. We conclude that it is because that VECO can do better in the contextual modeling of source input x via a explicit masked language modeling objective in Eq 10 applied to x 2 in Figure 2   propose new pre-training tasks to utilize the bilingual data better. However, there are two main drawbacks of these works. First, they mainly rely on the self-attention module in the Transformer encoder to implicitly build the interdependence between languages, leading to few attention patterns across languages due to the "lazy" network. Second, even though they show impressive performance improvement on cross-lingual understanding tasks like XNLI, only marginal improvement has been gained on cross-lingual generation tasks like machine translation, especially on high-resource languages.
A feasible solution for cross-language generation is to pre-train a denoising auto-encoder like mBART (Liu et al., 2020b). It extends BART (Lewis et al., 2019) to the multilingual setting, demonstrating significant gains in low/medium-resource machine translation, but 10 We follow BART (Lewis et al., 2019) by utilizing the final representation from the decoder for classification tasks. with a decrease in high resource languages. Unlike mBART, Chi et al. (2020a) first trains an encoder via MLM and then frozen the encoder to train the decoder only via two generative tasks. A similar approach is also proposed in Liang et al. (2020) and Lin et al. (2020), with the main difference in the joint training of encoder-decoder with code-switch tricks. However, all these cross-lingual models emphasize training a dedicated model for NLG. Thus they may hurt the NLU capabilities of the model. The ablation study in Section 7 also validates that it is sub-optimal to train an encoder-encoder network for NLU tasks.
This paper endeavors to build a unified crosslingual model for NLU and NLG tasks via a plugand-play cross-attention module. More importantly, the cross-attention module plays a role in the explicit alignment of encoded representations of different languages, thus largely contributing to building a unified cross-lingual model.

Conclusion
We present VECO, a variable and flexible crosslingual pre-training model, targets at explicitly capturing the interdependence between languages via a plug-and-play cross-attention module. Based on the flexible characteristics, VECO can initialize both NLU preferred encoder-only and NLG specialized encoder-decoder Transformer. Moreover, we also introduce a Plug-In fine-tuning approach to encourage the fusion between languages, combining the feature of VECO and cross-language downstream tasks.
Taken together, VECO achieves consistent improvements on various language understanding and generation tasks, broadening the way of thinking about pre-trained backbone architecture and finetuning methods under the cross-lingual scenario.  et al., 2019). There are 1.36TB monolingual data in 50 languages before up/down-sampling. Table 6 reports the language codes and statistics of pretraining data. We collect bilingual corpus in 50 languages from the OPUS website 11 , including Mul-tiUN, UNPC, Bombay, EU-bookshop, OpenSubti-tles2018, Tanzil, GlobalVoices, ParaCrawl, Multi-ParaCrawl, DGT, Tilde, Europarl, Wikipedia, ECB, TED2013, News-Commentary, Ubuntu, Books, UN, infopankki-v1, EUconst, and Bianet. In total, there are 1TB bilingual training data before pre-processing, covering 879 language pairs. Table 7 lists the statistics for each language pair. We then apply subword tokenization directly on raw text data using Sentence Piece Model (Kudo and Richardson, 2018) without any additional preprocessing.
We use the whole corpus to train VECO and a subset (∼ 1/4) that contains 33 languages to train small-sized XLM, mBART and VECO . The full set of pre-training hyperparameters for smallsized and large-sized VECO (default) are listed in Table 8.

B More details about Illustrated Attention
The models illustrated with attention patterns in Figure 1 of main paper (not appendix), are the base-sized XLM 12 and XLM-R 13 . We show the attention scores averaged on all heads in the middle layer.

D Detailed Results on XTREME
The detailed results of each XTREME task under the cross-lingual transfer and translate-train-all settings on all languages are listed in the following tables.