Improving Neural Machine Translation by Bidirectional Training

We present a simple and effective pretraining strategy – bidirectional training (BiT) for neural machine translation. Specifically, we bidirectionally update the model parameters at the early stage and then tune the model normally. To achieve bidirectional updating, we simply reconstruct the training samples from “src\rightarrowtgt” to “src+tgt\rightarrowtgt+src” without any complicated model modifications. Notably, our approach does not increase any parameters or training steps, requiring the parallel data merely. Experimental results show that BiT pushes the SOTA neural machine translation performance across 15 translation tasks on 8 language pairs (data sizes range from 160K to 38M) significantly higher. Encouragingly, our proposed model can complement existing data manipulation strategies, i.e. back translation, data distillation, and data diversification. Extensive analyses show that our approach functions as a novel bilingual code-switcher, obtaining better bilingual alignment.


Introduction
Recent years have seen a surge of interest in neural machine translation (NMT, Luong et al., 2015;Wu et al., 2016;Gehring et al., 2017;Vaswani et al., 2017) where it benefits from a massive amount of training data. But obtaining such large amounts of parallel data is not-trivial in most machine translation scenarios. For example, there are many lowresource language pairs (e.g. English-to-Tamil), which lack adequate parallel data for training.
Although many approaches about fully exploiting the parallel and monolingual data are proposed, e.g. back translation (Sennrich et al., 2016a), knowledge distillation (Kim and Rush, 2016) and data diversification (Nguyen et al., 2020), the prerequisite of these approaches is to build a wellperformed baseline model based on the parallel data. However, Koehn and Knowles (2017); Lample et al. (2018); Sennrich and Zhang (2019) em-pirically reveal that NMT runs worse than their statistical or even unsupervised counterparts in lowresource conditions. Here naturally arise a question: Can we find a strategy to consistently improve NMT performance given the parallel data merely?
We decide to find a solution from human learning behavior. Pavlenko and Jarvis (2002); Dworin (2003); Chen et al. (2015) show that bidirectional language learning helps master bilingualism. In the context of machine translation, both the source→target and target→source language mappings may benefit bilingual modeling, which motivates many recent studies, e.g. dual learning (He et al., 2016) and symmetric training (Cohn et al., 2016;Liang et al., 2007). However, their approaches rely on external resources (e.g. word alignment or monolingual data) or complicated model modifications, which limit the applicability of the method to a broader range of languages and model structures. Accordingly, we turn to propose a simple data manipulation strategy and transfer the bidirectional relationship through bidirectional training ( §2.2). The core idea is using a bidirectional system as an initialization for a unidirectional system. Specifically, to make the most of the parallel data, we first reconstruct the training samples from " − → B : source→target" to " ← → B : source+target→target+source", where the training data was doubled. Then we update the model parameters with ← → B in the early stage, and tune the model with normal " − → B source→target" direction. We validated our approach on several benchmarks across different language families and data sizes, including IWSLT21 En↔De, WMT16 En↔Ro, WMT19 En↔Gu, IWSLT21 En↔Sw, WMT14 En↔De, WMT19 En↔De, WMT17 Zh↔En and WAT17 Ja↔En. Experimental results show that the proposed bidirectional training (BiT) consistently and significantly improves the translation performance over the strong Transformer (Vaswani et al., 2017). Also, we show that BiT can complement existing data manipulation strategies, i.e. back translation, knowledge distillation and data diversification. Extensive analyses in §3.3 confirm that the performance improvement indeed comes from the better cross-lingual modeling and our method works like a novel code-switching method.

Preliminary
Given a source sentence x, an NMT model generates each target word y t conditioned on previously generated ones y <t . Accordingly, the probability of generating y is computed as: where T is the length of the target sequence and the parameters θ are trained to maximize the likelihood of a set of training examples according to L(θ) = arg max θ log p(y|x; θ). Typically, we choose Transformer (Vaswani et al., 2017) as its SOTA performance. The training examples can be formally defined as follows: where N is the total number of sentence pairs in the training data. Note that in standard MT training, the x is fed into the encoder and y <t into the decoder to finish the conditional estimation for y t , thus the utilization of − → B is directional, i.e. x i →y i .

Pretraining with Bidirectional Data
Motivation The motivation is when human learn foreign languages with translation examples, e.g.
x i and y i . Both directions of this example, i.e.
x i →y i and y i →x i , may help human easily master the bilingual knowledge. Motivated by this, Levinboim et al. (2015); Liang et al. (2007) propose to modelling the invertibility between bilingual languages. Cohn et al. (2016) introduce extra bidirectional prior regularization to achieve symmetric training from the point view of training objective. Our Approach Many studies have shown that pretraining could transfer the knowledge and data distribution, hence improving the generalization (Hendrycks et al., 2019;Mathis et al., 2021).
Here we want to transfer the bidirectional knowledge among the corpus. Specifically, we propose to first pretrain MT models on bidirectional corpus, which can be defined as follows: (3) such that the θ in Equation 1 can be updated by both directions. Then the bidirectional pretraining objective can be formulated as: where the forward − → L θ and backward ← − L θ are optimized iteratively.
From data perspective, we achieve the bidirectional updating as follows: 1) swapping the source and target sentences of a parallel corpus, and 2) appending the swapped data to the original. Then the training data was doubled to make better and full use of the costly bilingual corpus. The pretraining can acquire general knowledge from bidirectional data, which may help better and faster learning further tasks. Thus, we early stop BiT at 1/3 of the total training steps (we discuss its reasonability in §3.1). In order to ensure the proper training direction, we further train the pretrained model on required direction − → B with the rest of 2/3 training steps. Considering the effectiveness of pretraining (Mathis et al., 2021) and clean finetuning (Wu et al., 2019b), we introduce a combined pipeline: as out best training strategy. There are many possible ways to implement the general idea of bidirectional pretraining. The aim of this paper is not to explore the whole space but simply to show that one fairly straightforward implementation works well and the idea is reasonable.

Setup
Data Main experiments in Table 1 (Post, 2018). The sign-test (Collins et al., 2005) is used for statistical significance test.
Training For Transformer-BIG models, we empirically adopt large batch strategy (Edunov et al., 2018) (i.e. 458K tokens/batch) to optimize the performance. The learning rate warms up to 1 × 10 −7 for 10K steps, and then decays for 30K (data volumes range from 2M to 10M) / 50K (data volumes large than 10M) steps with the cosine schedule; For Transformer-BASE models, we empirically adopt 65K tokens per batch for small data sizes, e.g. IWSLT14 En→De and WMT16 En→Ro. The learning rate warms up to 1 × 10 −7 for 4K steps, and then decays for 26K steps. For regularization, we tune the dropout rate from [0.1, 0.2, 0.3] based on validation performance, and apply weight decay with 0.01 and label smoothing with = 0.1. We use Adam optimizer (Kingma and Ba, 2015) to train the models. We evaluate the performance on an ensemble of last 10 checkpoints to avoid stochasticity. Someone may doubt that BiT heavily depends on how to properly set the early-stop steps. To dispel the doubt, we investigate whether our approach is robust to different early-stop steps. In preliminary experiments, we tried several simple fixed earlystop steps according to the size of training data (e.g. training 40K En-De and early stop at 10K/ 15K/ 20K, respectively). We found that both strategies achieve similar performances. Thus, we decide to choose a simple and effective method (i.e. 1/3 of the total training steps) for better reprehensibility.

Results
Results on Different Data Scales we experimented on 10 language directions, including IWSLT14 En↔De, WMT16 En↔Ro, IWSLT21  En↔Sw, WMT14 En↔De and WMT19 En↔De. The smallest one merely contains 160K sentences, while the largest direction includes 38M sentence pairs. Table 1 reports the results, we show that BiT achieves significant improvements over strong baseline Transformer in 7 out of 10 directions under the significance test p < 0.01, and the rest of 3 directions also show promising performance under the significance test p < 0.05, demonstrating the effectiveness and universality of our proposed bidirectional pretraining strategy. Notably, one advantage of BiT is it saved 1/3 of the training time for the reverse direction. For example, the pretrained BiT checkpoint for En→De can be used to tune the reverse direction De→En. This advantage shows that BiT could be an efficient training strategy for multiple training direction, e.g. multi-lingual MT tasks (Ha et al., 2016). Ding et al. (2021b), to dispel the doubt that BiT could merely be applied on languages within the same language family, e.g. English and German, we report the results of BiT on Zh↔En and Ja→En language pairs, which belong to different language families (i.e. Indo-European, Sino-Tibetan and Japonic). Table 2 lists the results, as seen, compared with baselines, our method significantly and incrementally improves the translation quality in all cases. In particular, BiT achieves averaged +0.9 BLEU improvement over the baselines, showing the effectiveness and universality of our method across language pairs.

Results on Distant Language Pairs Inspired by
Complementary to Related Work Recent studies start to combine pretraining and traditional data manipulation approaches for better model performance (Conneau and Lample, 2019;Liu et al., 2020b Table 3: Complementary to other works. "/+BiT" means combining BiT with corresponding works, and BLEU scores of BiT followed their counterparts with "/". Experiments are conducted on WMT14 En-De.
related data manipulation works, we list three representative data manipulation approaches for NMT: a) Tagged Back Translation (BT, Caswell et al. 2019) combines the synthetic data generated with target-side monolingual data and parallel data; b) Knowledge Distillation (KD, Kim and Rush 2016) trains the model with sequence-level distilled parallel data; c) data diversification (DD, Nguyen et al. 2020) diversifies the data by applying KD and BT on parallel data. As seen in Table 3, BiT can be applied on existing data manipulation approaches and yield further significant improvements.

Analysis
We conducted analyses to better understand BiT. Unless otherwise stated, all results are reported on the WMT14 En-De.
BiT works as a simple bilingual code-switcher Lin et al. (2020); Yang et al. (2020) employ the third-party tool to obtain the alignment information to perform code-switching pertraining, where partial of the source tokens is replaced with the aligned target ones. But training such alignment model is time-consuming and the alignment errors may be propagated. Actually, BiT can be viewed as a novel yet simple bilingual code-switcher, where the switch span is the whole sentence and both the source-and target-side sentences are replaced with   Table 6: Results for En↔Gu on WMT2019 test sets. "Ave. ∆" shows the averaged improvements of "Base+BiT" v.s. "Base" and their corresponding "+BT" comparisons.
the probability 0.5. Take a sentence pair {"Bush held a talk with Sharon"→"布什 与 沙龙 举行 了 会谈"} in English→Chinese dataset as an example, during pretraining phase, the reconstructed corpus contains {"Bush held a talk with Sharon" → "布什 与 沙龙 举行 了 会谈"} and its reversed version "布什 与 沙龙 举行 了 会谈" → "Bush held a talk with Sharon", simultaneously. For the English→Chinese direction, the reversed sentence pair exactly belongs to the sentence-level switch with a probability of 0.5. For fair comparison, we implement Lin et al., 2020;Yang et al., 2020's approaches in bilingual data scenario. Table 4 show the superiority of BiT, indicating BiT is a good alternative to code-switch in bilingual scenario.
BiT improves alignment quality Our proposed BiT intuitively encourages self-attention to learn bilingual agreement, thus has the potential to induce better attention matrices. We explore this hypothesis on the widely-used Gold Alignment dataset 6 and follow Tang et al. (2019) to perform the alignment. The only difference being that we average the attention matrices across all heads from the penultimate layer (Garg et al., 2019). The alignment error rate (AER, Och and Ney 2003), precision (P) and recall (R) are evaluation metrics. Table 5 summarizes that BiT allows Transformer to learn better attention matrices, thereby improving alignment performance (24.3 vs. 27.1).

BiT works for extremely low-resource settings
Researches may doubt BiT may fail on extremely low-resource settings where back-translation even does not work. To dispel this concern, we conduct experiments on WMT19 English↔Gujurati 7 in Table 6. Specifically, we follow  to collect and preprocess the parallel data to build the base model "Base" and our "Base+BiT" model. For a fair comparison, we sample the monolingual English and Gujurati sentences to ensure Parallel: Monolingual = 1:1 to generate the synthetic data. As seen, when directly applying back-translation (BT) on the En↔Gu Base model, there indeed shows a slight performance drop (-0.4 BLEU). However, our "BiT" significantly improves the initial Base model by averaged +1.0 BLEU, and making the BiT-equipped BT more effective compared to vanilla BT (+2.8 BLEU). These findings on extremely low-resource settings demonstrate that 1) our BiT consistently works well; and 2) BiT provides a better initial model, thus rejuvenating the effects of back-translation.

Conclusion and Future Works
In this study, we propose a pretraining strategy for NMT with parallel data merely. Experiments show that our approach significantly improves translation performance, and can complement existing data manipulation strategies. Extensive analyses reveal that our method can be viewed as a simple yet better bilingual code-switching approach, and improves bilingual alignment quality. Encouragingly, with BiT, our system (Ding et al., 2021d) got the first place in terms of BLEU scores in IWSLT2021 8 low-resource track. It will be interesting to integrate BiT into our previous systems (Ding and Tao, 2019; and validate its effectiveness on industrial level competitions, e.g. WMT 9 . It is also worthwhile to explore the effectiveness of our proposed bidirectional pretraining strategy on multilingual NMT task (Ha et al., 2016).