Universal Conditional Masked Language Pre-training for Neural Machine Translation

Pre-trained sequence-to-sequence models have significantly improved Neural Machine Translation (NMT). Different from prior works where pre-trained models usually adopt an unidirectional decoder, this paper demonstrates that pre-training a sequence-to-sequence model but with a bidirectional decoder can produce notable performance gains for both Autoregressive and Non-autoregressive NMT. Specifically, we propose CeMAT, a conditional masked language model pre-trained on large-scale bilingual and monolingual corpora in many languages. We also introduce two simple but effective methods to enhance the CeMAT, aligned code-switching & masking and dynamic dual-masking. We conduct extensive experiments and show that our CeMAT can achieve significant performance improvement for all scenarios from low- to extremely high-resource languages, i.e., up to +14.4 BLEU on low resource and +7.9 BLEU improvements on average for Autoregressive NMT. For Non-autoregressive NMT, we demonstrate it can also produce consistent performance gains, i.e., up to +5.3 BLEU. To the best of our knowledge, this is the first work to pre-train a unified model for fine-tuning on both NMT tasks. Code, data, and pre-trained models are available at https://github.com/huawei-noah/Pretrained-Language-Model/CeMAT

fine-tuned (Guo et al., 2020;Zhu et al., 2020). Recently, pre-training standard sequence-to-sequence (Seq2Seq) models has shown significant improvements and become a popular paradigm for NMT tasks (Song et al., 2019;Lin et al., 2020). However, some experimental results from XLM (Conneau and Lample, 2019) have shown that the decoder module initialized by the pre-trained bidirectional masked language model (MLM) (Devlin et al., 2019), rather than the unidirectional causal language model (CLM, Radford and Narasimhan, 2018), would achieve better results on Autoregressive NMT (AT). Especially, compared to random initialization, initialized by GPT (Radford and Narasimhan, 2018) might result in performance degradation sometimes. We conjecture that when fine-tuning on generation tasks (e.g., NMT), the representation capability of the pre-trained models may be more needed than the generation capability. Therefore, during pre-training, we should focus on training the representation capability not only for the encoder, but also for the decoder more explicitly.
Inspired by that, we present CeMAT, a multilingual Conditional masked language prEtraining model for MAchine Translation, which arXiv:2203.09210v2 [cs.CL]  Para. Figure 1: The framework for CeMAT, which consists of an encoder and a bidirectional decoder. "Mono" denotes monolingual, "Para" denotes bilingual. During the pre-training (left), the original monolingual and bilingual inputs in many languages are augmented (the words are replaced with new words with same semantics or "[mask]", please see Figure 2 for more details) and fed into the model. Finally, we predict all the "[mask]" words on the source side and target side respectively. For fine-tuning (right), CeMAT provides unified initial parameter sets for AT and NAT.
consists of a bidirectional encoder, a bidirectional decoder, and a cross-attention module for bridging them. Specifically, the model is jointly trained by MLM on the encoder and Conditional MLM (CMLM) on the decoder with large-scale monolingual and bilingual texts in many languages. Table 1 compares our model with prior works. Benefiting from the structure, CeMAT can provide unified initialization parameters not only for AT task, but also for Non-autoregressive NMT (NAT) directly. NAT has been attracting more and more attention because of its feature of parallel decoding, which helps to greatly reduce the translation latency.
To better train the representation capability of the model, the masking operations are applied in two steps. First, some source words that have been aligned with target words are randomly selected and then substituted by new words of similar meanings in other languages, and their corresponding target words are masked. We call this method aligned code-switching & masking. Then, the remaining words in both source and target languages will be masked by dynamic dual-masking.
Extensive experiments on downstream AT and NAT tasks show significant gains over prior works. Specifically, under low-resource conditions (< 1M bitext pairs), our system gains up to +14.4 BLEU points over baselines. Even for extremely highresource settings (> 25M), CeMAT still achieves significant improvements. In addition, experiments on the WMT16 Romanian→English task demonstrate that our system can be further improved (+2.1 BLEU) by the Back-Translation (BT; Sennrich et al., 2016a).
The main contributions of our work can be summarized as follows: • We propose a multilingual pre-trained model CeMAT, which consists of a bidirectional encoder, a bidirectional decoder. The model is pre-trained on both monolingual and bilingual corpora and then used for initializing downstream AT and NAT tasks. To the best of our knowledge, this is the first work to pre-train a unified model suitable for both AT and NAT.
• We introduce a two-step masking strategy to enhance the model training under the setting of bidirectional decoders. Based on a multilingual translation dictionary and word alignment between source and target sentences, aligned code-switching & masking is firstly applied. Then, dynamic dual-masking is used.
• We carry out extensive experiments on AT and NAT tasks with data of varied sizes. Consistent improvements over strong competitors demonstrate the effectiveness of CeMAT. [en] We dance on the grass  Figure 2: The details of our two-step masking. We first obtain the aligned pair set Λ = {("dance","tanzen"),...} (marked with ) from the original inputs by looking up the cross-lingual dictionary (denote as 1.Aligned), and then randomly select a subset (marked as "dance" "tanzen" with red color) from it, in the lower left of the figure. For each element in the subset, we select a new word by F m (x i m ), and perform CSR to replace the source fragment ("danse" marked as red color) and CSM for target ("[mask]" marked as red color) respectively. Finally, we do the DM process to mask the contents of the source and target respectively ("[mask]" marked as light-blue color).

Pre-training Approach
Our CeMAT is jointly trained by MLM and CMLM on the source side and the target side, respectively. The overall framework is illustrated in Figure 1. In this section, we first introduce the multilingual CMLM task (Section 2.1). Then, we describe the two-step masking, including the aligned codeswitching & masking (Section 2.2) and the dynamic dual-masking (Section 2.3). Finally, we present training objectives of CeMAT (Section 2.4).
Formally, our training data consists of M language-pairs D = {D 1 , D 2 , ..., D M }. D k (m, n) is a collection of sentence pairs in language L m and L n , respectively. In the description below, we denote a sentence pair as (X m , Y n ) ∈ D k (m, n), where X m is the source text in the language L m , and Y n is the corresponding target text in the language L n . For monolingual corpora, we create pseudo bilingual text by copying the sentence, namely, X m = Y n .

Conditional Masked Language Model
CMLM predicts masked tokens y mask n , given a source sentence X m and the remaining target sentence Y n \y mask n . The probability of each y j n ∈ y mask n is independently calculated: CMLM can be directly used to train a standard Seq2Seq model with a bidirectional encoder, a unidirectional decoder, and a cross attention. However, it is not restricted to the autoregressive feature on the decoder side because of the independence between masked words. Therefore, following practices of NAT, we use CMLM to pre-train a Seq2Seq model with a bidirectional decoder, as shown in Figure 1.
Although bilingual sentence pairs can be directly used to train the model together with the conventional CMLM (Ghazvininejad et al., 2019), it is challenging for sentence pairs created from monolingual corpora because of identical source and target sentences. Therefore, we introduce a twostep masking strategy to enhance model training on both bilingual and monolingual corpora.

Aligned Code-Switching & Masking
We use aligned code-switching & masking strategy to replace the source word or phrase with a new word in another language, and then mask the corresponding target word. Different from the previous code-switching methods (Yang et al., 2020;Lin et al., 2020) where source words always are randomly selected and replaced directly, our method consists of three steps: 1. Aligning: We utilize a multilingual translation dictionary to get a set of aligned words Λ = {· · · , (x i m , y j n ), · · · } between the source X m and target Y n . The word pair (x i m , y j n ) denotes that the i-th word in X m and j-th word in Y n are translations of each other. For sentence pairs created from monolingual corpora, words in an aligned word pair are identical.
2. Code-Switching Replace (CSR): Given an aligned word pair (x i m , y j n ) ∈ Λ, we first select a new wordx i k in the language L k that can be used to replace x i m in the source sentence where F m (x) is a multilingual dictionary lookup function for a word x in the language L m ,x i k is a randomly selected word from the dictionary, which is a translation of x i m in the language L k .

Code-Switching Masking (CSM): If the
source word x i m in the aligned pair (x i m , y j n ) is replaced byx i k , we also mask y j n in Y n by replacing it with a universal mask token. Then, CeMAT will be trained to predict it in the output layers of the bidirectional decoder.
For aligning and CSR, we only use available multilingual translation dictionary provided by MUSE (Lample et al., 2018). Figure 2 shows the process of aligned code-switching & masking. According to the given dictionary, "dance" and "tanzen" are aligned, then a new French word "danse" is selected to replace "dance", and "tanzen" replaced by "[mask]" (marked as red color).
During training, at most 15% of the words in the sentence will be performed by CSR and CSM. For monolingual data, we set this ratio to 30%. We use to denote the new sentence pair after aligned codeswitching & masking, which will be further dynamically dual-masked at random.

Dynamic Dual-Masking
Limited by the dictionary, the ratio of aligned word pairs is usually small. In fact, we can only match aligned pairs for 6% of the tokens on average in the bilingual corpora. To further increase the training efficiency, we perform dynamic dual-masking (DM) on both bilingual and monolingual data.
• Bilingual data: We first sample a masking ratio υ from a uniform distribution between [0.2, 0.5], then randomly select a subset of target words which are replaced by "[mask]". Similarly, we select a subset on the source texts and mask them with a ratio of µ in a range of [0.1, 0.2]. Figure 2 shows an example of dynamic dual-masking on bilingual data. We set υ ≥ µ to force the bidirectional decoder to obtain more information from the encoder.
• Monolingual data: Since the source and target are identical before masking, we sample υ = µ from a range [0.3, 0.4] and mask the same subset of words on both sides. This will avoid the decoder directly copying the token from the source.
Follow practices of pre-trained language models, 10% of the selected words for masking remain unchanged, and 10% replaced with a random token. Words replaced by the aligned code-switching & masking will not be selected to prevent the loss of cross-lingual information. We use to denote the new sentence pair after dynamic dualmasking, which will be used for pre-training.

Multilingual Pre-training Objectives
We jointly train the encoder and decoder on MLM and CMLM tasks. Given the sentence pair from the masked corporaD, the final training objective is formulated as follows: (2) where y mask n are the set of masked target words, x mask m are the set of masked source words, and λ is a hyper-parameter to balance the influence of both tasks. In our experiments, we set λ = 0.7.

Pre-training Settings
Pre-training Data We use the English-centric multilingual parallel corpora of PC32 2 , and then collect 21-language monolingual corpora from common crawl 3 . In this paper, we use ISO language code 4 to identify each language. A "[language code]" token will be prepended to the beginning of the source and target sentence as shown in Figure 2. This type of token helps the model to distinguish sentences from different languages. The detailed correspondence and summary of our pre-training corpora can be seen in Appendix A.
Data pre-processing We directly learn a shared BPE (Sennrich et al., 2016b) model on the entire data sets after tokenization. We apply Moses tokenization (Sennrich et al., 2016b)   for Japanese and jieba 6 for Chinese, and a special normalization for Romanian (Sennrich et al., 2016a). Following , we balance the vocabulary size of languages by up/down-sampling text based on their data size when learning BPE.
Model and Settings As shown in Figure 1, we apply a bidirectional decoder so that it can utilize left and right contexts to predict each token. We use a 6-layer encoder and 6-layer bidirectional decoder with a model dimension of 1024 and 16 attention heads. Following Vaswani et al. (2017), we use sinusoidal positional embedding, and apply layer normalization for word embedding and pre-norm residual connection following Wang et al. (2019a). Our model is trained on 32 Nvidia V100 GPUs for 300K steps, The batch size on each GPU is 4096 tokens, and we set the value of update frequency to 8. Following the training settings in Transformer, we use Adam optimizer ( = 1e − 6, β 1 = 0.9, β 2 = 0.98) and polynomial decay scheduling with a warm-up step of 10,000.

Autoregressive Neural Machine Translation
In this section, we verify CeMAT provides consistent performance gains in low to extremely high resource scenarios. We also compare our method with other existing pre-training methods and further present analysis for better understanding the contributions of each component.

Fine-Tuning Objective
The AT model consists of an encoder and a unidirectional decoder. The encoder maps a source sentence X m into hidden representations which are then fed into the decoder. The unidirectional decoder predicts the t-th token in a target language L n conditioned on X m and the previous target tokens 6 https://github.com/fxsjy/jieba y <t n . The training objective of AT is to minimize the negative log-likelihood:

Experimental Settings
Benchmarks We selected 9 different language pairs and then use CeMAT to fine-tune on them.
Configuration We adopt a dropout rate of 0.1 for extremely high-resource En→Fr, En→De (WMT19); for all other language pairs, we set the value of 0.3. We fine-tune AT with a maximum learning rate of 5e − 4, a warm-up step of 4000 and label smoothing of 0.2. For inference, we use beam search with a beam size of 5 for all translation directions. For a fair comparison with previous works, all results are reported with case-sensitive and tokenized BLEU scores.

Results and Analysis
Main Results We fine-tune AT systems initialized by our CeMAT on 8 popular language pairs, which are the overlapping language pairs in experiments of mBART  and mRASP (Lin et al., 2020). Table 2 shows the results. Compared to directly training AT models, our systems with CeMAT as initialization obtain significant improvements on all four scenarios. We observe gains of up to +14.4 BLEU and over +11.4 BLEU on three of the four tasks on low-resource scenarios, i.e., En↔Tr. Without loss of generality, as the scale of the dataset increases, the benefits of pre-training models are getting smaller and smaller. However, we can still obtain significant gains when the data size is large enough (extremely high-resource: > 25M), i.e. +8.3 and +2.3 BLEU for En→De and En→Fr respectively. This notable improvement shows that our model can further enhance extremely high-resource translation. Overall, we obtain performance gains of more than +8.0 BLEU for most directions, and finally observe gains of +7.9 BLEU on average on all language pairs. We further compare our CeMAT with mBART  and mRASP (Lin et al., 2020), which are two pre-training methods of current SOTA. As illustrated in Table 2, CeMAT outperforms mBART on all language pairs with a large margin (+3.8 BLEU on average), for extremely high-resource, we can obtain significant improvements when mBART hurts the performance. Compared to mRASP, we achieve better performance on 11 out of the total 13 translation directions, and outperforms this strong competitor with an average improvement of +1.2 BLEU on all directions.

Comparison with Existing Pre-training Models
We further compare our CeMAT with more existing multilingual pre-trained models on three popular translation directions, including WMT14 En→De, WMT16 En↔Ro. Results are shown in Table 3. Our CeMAT obtains competitive results on these languages pairs on average, and achieves the best performance on En→Ro.
Our model also outperforms BT (Sennrich et al., 2016a), which is a universal and stable approach to augment bilingual with monolingual data. In addition, when combining back-translation with our CeMAT on Ro →En, we obtain a significantly improvement from 36.8 to 39.0 BLEU, as shown in Table 3. This indicates that our method is complementary to BT.

The Effectiveness of Aligned Code-Switching and Masking
We investigate the effectiveness of aligned code-switching & masking as shown in Table 4. We find that utilizing aligned code-switching & masking can help CeMAT improve the performance for all different scenarios with gains of +0.5 BLEU on average, even though we can only match the aligned word pairs for 6% of the tokens on average in the bilingual corpora. We presume the method can be improved more significantly if we adopt more sophisticated word alignment methods.

The Effectiveness of Dynamic Masking
In the pre-training phase, we use a dynamic strategy when doing dual-masking on the encoder and decoder respectively. We verify the effectiveness of this dynamic masking strategy. As illustrated in Table 4 and Appendix C, we achieve significant gains with margins from +0.4 to +4.5 BLEU, when we adjusted the ratio of masking from a static value to a dynamically and randomly selected value. The average improvement on all language pairs is +2.1 BLEU. This suggests the importance of dynamic masking.

Non-autoregressive Neural Machine Translation
In this section, we will verify the performance of our CeMAT on the NAT, which generates translations in parallel, on widely-used translation tasks.

Fine-Tuning Objective
As illustrated in Figure 1, NAT also adopts a Seq2Seq framework, but consists of an encoder and a bidirectional decoder which can be used to predict the target sequences in parallel. The training objective of NAT is formulated as follows: In this work, we follow Ghazvininejad et al. (2019), which randomly sample some tokens y mask n for masking from target sentences and train the model by predicting them given source sentences  and remaining targets. The training objective is: During decoding, given an input sequence to translate, the initial decoder input is a sequence of "[mask]" tokens. The fine-tuned model generates translations by iteratively predicting target tokens and masking low-quality predictions. This process can make the model re-predict the more challenging cases conditioned on previous high-confidence predictions.

Experimental Settings
NAT Benchmark Data We evaluate on three popular datasets: WMT14 En↔De, WMT16 En↔Ro and IWSLT14 En↔De. For a fair comparison with baselines, we only use the bilingual PC32 corpora to pre-train our CeMAT. We only use knowledge distillation (Gu et al., 2018) on WMT14 En↔De tasks.
Baselines We use our CeMAT for initialization and fine-tune a Mask-Predict model (Ghazvininejad et al., 2019) as in Section 4. To better quantify the effects of the proposed pre-training models, we build two strong baselines.
Direct. We directly train a Mask-Predict model with randomly initialized parameters.
mRASP. To verify that our pre-trained model is more suitable for NAT, we use a recently pretrained model mRASP (Lin et al., 2020) to finetune on downstream language pairs. Configuration We use almost the same configuration as the pre-training and AT except the following differences. We use learned positional embeddings (Ghazvininejad et al., 2019) and set the max-positions to 10,000.

Main Results
The main results on three language pairs are presented in Table 5. When using CeMAT to initialize the Mask-Predict model, we observe significant improvements (from +0.9 to +5.3 BLEU) on all different tasks, and finally obtain gains of +2.5 BLEU on average. We also achieve higher results than the AT model on both En→De (+2.8 BLEU) and De→En (+0.9 BLEU) directions on IWSLT14 datasets, which is the extremely low-resource scenarios where training from scratch is harder and pre-training is more effective.
As illustrated in Table 5, on all different tasks, CeMAT outperforms mRASP with a significant margin. On average, we obtain gains of +1.4 BLEU over mRASP. Especially under low-resource settings on IWSLT14 De→En, we achieve a large gains of +3.4 BLEU over mRASP. Overall, mRASP shows limited improvement (+0.4 to +1.9 BLEU) compared to CeMAT. This also suggests that although we can use the traditional pre-training method to fine-tune the NAT task, it does not bring a significant improvement like the AT task because of the gap between pre-training and fine-tuning tasks.
We further compare the dynamic performance on three language pairs during iterative decoding, as shown in Appendix D. We only need 3 to 6 iterations to achieve the best score. During the iteration, we always maintain rapid improvements. In contrast, mRASP obtains the best result after 6 to 9 iterations. We also observe a phenomenon that the performance during iterations is also unstable on both mRASP and Mask-Predict, but CeMAT appears more stable. We conjecture that our pretrained model can learn more related information between words in both the same and different languages. This ability alleviated the drawback of NAT assumptions: the individual token predictions are conditionally independent of each other.

Related Work
Multilingual Pre-training Task Conneau and Lample (2019) and Devlin et al. (2019) proposed to pre-train a cross-lingual language model on multi language corpora, then the encoder or decoder of model are initialized independently for fine-tuning. Song et al. (2019), Yang et al. (2020) and  directly pre-trained a Seq2Seq model by reconstructing part or all of inputs and achieve significant performance gains. Recently, mRASP (Lin et al., 2020) and CSP (Yang et al., 2020) apply the code-switching technology to simply perform random substitution on the source side. Another similar work, DICT-MLM (Chaudhary et al., 2020) introduce multilingual dictionary, pre-training the MLM by mask the words and then predict its crosslingual synonyms. mRASP2 (Pan et al., 2021) also used code-switching on monolingual and bilingual data to improve the effectiveness, but it is essentially a multilingual AT model. Compared to previous works: 1) CeMAT is the first pre-trained Seq2Seq model with a bidirectional decoder; 2) We introduce aligned code-switching & masking, different from traditional code-switching, we have two additional steps: align between source and target, and CSM; 3) We also introduce a dynamic dual-masking method.

Autoregressive Neural Machine Translation
Our work is also related to AT, which adopts an encoder-decoder framework to train the model (Sutskever et al., 2014). To improve the performance, back-translation, forward-translation and related techniques were proposed to utilize the monolingual corpora (Sennrich et al., 2016a;Zhang and Zong, 2016;Edunov et al., 2018;Hoang et al., 2018). Prior works also attempted to jointly train a single multilingual translation model that translates multi-language directions at the same time (Firat et al., 2016;Johnson et al., 2017;Aharoni et al., 2019;. In this work, we focus on pre-training a multilingual language model, which can provide initialization parameters for the language pairs. On the other hand, our method can use other languages to further improve high-resource tasks. Non-autoregressive Neural Machine Translation Gu et al. (2018) first introduced a transformer-based method to predict the complete target sequence in parallel. In order to reduce the gap with the AT model, Lee et al. (2018) and Ghazvininejad et al. (2019) proposed to decode the target sentence with iterative refinement. Wang et al. (2019b) and Sun et al. (2019) utilized auxiliary information to enhance the performance of NAT. One work related to us is Guo et al. (2020), which using BERT to initialize the NAT. In this work, CeMAT is the first attempt to pre-train a multilingual Seq2Seq language model on NAT task.

Conclusion
In this paper, we demonstrate that multilingually pre-training a sequence-to-sequence model but with a bidirectional decoder produces significant performance gains for both Autoregressive and Non-autoregressive Neural Machine Translation. Benefiting from conditional masking, the decoder module, especially the cross-attention can learn the word representation and cross-lingual representation ability more easily. We further introduce the aligned code-switching & masking to align the representation space for words with similar semantics but in different languages, then we use a dynamic dual-masking strategy to induce the bidirectional decoder to actively obtain the information from the source side. Finally, we verified the effectiveness of these two methods. In the future, we will investigate more effective word alignment method for aligned code-switching & masking.

B Statics of Five Different Scenarios
We present dataset statistics for fine-tuning corpora in Table 7.

C Detailed Ablation Experiments
We show more detailed results of the ablation experiments on two language pairs in Table 8.

D Performance with Iterations for NAT
We present the dynamic performance on three language-pair datasets during iterative decoding in Figure 3, 4, 5, 6, 7 and 8.        Table 8: Verification of the effectiveness of different techniques on two language pairs: Kk-En and Et-En. "w/ Bilingual" denotes that we use only bilingual data when pre-training CeMAT; "w/ Monolingual" denotes that we use only monolingual data when pre-training CeMAT; "w Bi-& Monolingual" denotes that when pre-training CeMAT, we use both bilingual and monolingual data; "w/o Aligned CS masking" denotes that we pre-train CeMAT without aligned code-switching & masking algorithm; "w/o Dynamic (masking:0.15)" means that we use a fixed masking ratio with 0.15 for dual-masking; "w/o Dynamic (masking:0.35)" means that we use a fixed masking ratio with 0.35 for dual-masking to make a more fair comparison with dynamic masking. To save computational resources, we use Transformer-base to obtain all the results of this experiment.