Alternated Training with Synthetic and Authentic Data for Neural Machine Translation

While synthetic bilingual corpora have demonstrated their effectiveness in low-resource neural machine translation (NMT), adding more synthetic data often deteriorates translation performance. In this work, we propose alternated training with synthetic and authentic data for NMT. The basic idea is to alternate synthetic and authentic corpora iteratively during training. Compared with previous work, we introduce authentic data as guidance to prevent the training of NMT models from being disturbed by noisy synthetic data. Experiments on Chinese-English and German-English translation tasks show that our approach improves the performance over several strong baselines. We visualize the BLEU landscape to further investigate the role of authentic and synthetic data during alternated training. From the visualization, we find that authentic data helps to direct the NMT model parameters towards points with higher BLEU scores and leads to consistent translation performance improvement.

Existing approaches to synthesizing data in NMT focus on leveraging monolingual data in the training process.Among them, back-translation (BT) (Sennrich et al., 2016a) has been widely used to generate synthetic bilingual corpora by using a trained target-to-source NMT model to translate large-scale target-side monolingual corpora.Such synthetic data can be used to improve source-totarget NMT models.Despite the effectiveness of back-translation, the synthetic data inevitably contains noise and erroneous translations.As a matter of fact, it has been widely observed that while BT is capable of benefiting NMT models by using relatively small-scale synthetic data, further increasing the quantity often deteriorates translation performance (Edunov et al., 2018;Wu et al., 2019;Caswell et al., 2019).
This problem has attracted increasing attention in the NMT community (Edunov et al., 2018;Wang et al., 2019).One direction to alleviate the problem is to add noise or a special tag on the source side of synthetic data, which enables NMT models to distinguish between authentic and synthetic data (Edunov et al., 2018;Caswell et al., 2019).Another direction is to filter or evaluate the synthetic data by calculating confidence over corpora, making NMT models better exploit synthetic data (Imamura et al., 2018;Wang et al., 2019).While these methods have outperformed the conventional BT approach, NMT models still suffer from a performance degradation as the size of synthetic data keeps increasing.Hence, how to better take advantage of limited authentic data and abundant synthetic data still remains a grand challenge.
In this work, we propose alternated training with synthetic and authentic data for neural machine translation.The basic idea is to alternate synthetic and authentic corpora iteratively during training.Compared with previous work, we introduce authentic data as guidance to prevent the training of NMT models from being disturbed by noisy synthetic data.Our approach is inspired by the characterization of synthetic and authentic corpora as two types of different approximations for the dis-tribution of infinite authentic data.We visualize the BLEU landscape to further investigate the role of authentic and synthetic data during alternated training.We find that the authentic data helps to direct NMT model parameters towards the points with higher BLEU scores.Experiments on Chinese-English translation tasks show that our approach improves the performance over strong baselines.

Alternated Training
Let x be a source sentence and y be a target sentence.We use P (y|x; θ) to denote an NMT model parameterized by θ.Let D a = { x n , y n } N n=1 be an authentic parallel corpus containing N sentence pairs.Traditional NMT aims to obtain θa that maximizes the log-likelihood on D a : θa = argmax (4) Suppose that there exists infinite authentic parallel data, which can be characterized as distribution p(x, y).Synthesizing the large-scale corpus D s is to better approach the authentic parallel data distribution.Furthermore, the finite corpora D a Algorithm 1 Alternated Training for NMT where δ represents the Dirac distribution.On the one hand, D a is considered to be of higher quality as lim N →∞ p a (x, y) = p(x, y) exactly recovers the authentic data distribution.On the other hand, although D s contains certain noise (as lim M →∞ p s (x, y) = p(x, y)), it provides more diversified data samples that enable the NMT model to reconstruct the global distribution.As the two corpora are complementary to each other, we introduce authentic data periodically during the training process with synthetic data.Intuitively, alternated training using authentic corpora helps to rectify the deviation of training direction affected by the noisy synthetic data and enhances model performance.
Our proposed alternated training approach is shown in Algorithm 1. Starting with random initialization, each alternation cycle during training consists of two steps.For the t-th cycle, the first step is to finetune the model θ(t) a with Eq. (4) on D s ∪D a until convergence 1 to obtain θ(t+1) s , which is referred as S-Step (line 4).The second step is to alter the training data back to D a and finetune θ(t+1) s with Eq. ( 1) until convergence to obtain θ(t+1) a , which is referred as A-Step (line 5).We alternate the training process until convergence.It is noted that back-translation is equivalent to a single S-Step performed in our approach.

Setup
We evaluated our training strategy on Chinese-English and German-English translation tasks.We reported the tokenized BLEU score as calculated by multi-bleu.perl.
For the Chinese-English task, we extracted 1.25M parallel sentence pairs from LDC as our authentic bilingual corpus and 10M English-side sentences from WMT17 Chinese-English training set as our monolingual corpus for back-translation.NIST06 was used as the validation set.We use NIST02, 03, 04, 05 and 08 datasets as test sets.For the German-English task, we selected the dataset of IWSLT14 German-English task, which contains 16k parallel sentence pairs for training.We further extracted 4.5M English-side sentences from WMT14 German-English training set as monolingual dataset.We segmented Chinese sentences by THULAC (Sun et al., 2016) and tokenized English and German sentences by Moses (Koehn et al., 2007).The vocabulary was built by Byte Pair Encoding (BPE) (Sennrich et al., 2016b) with 32k merge operations.We used Transformer (Vaswani et al., 2017) implemented in THUMT (Tan et al., 2020) with standard hyperparameters as a base model.We used Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.98 and = 10 −9 with the maximum learning rate = 7 × 10 −4 .
We applied early-stopping to verify convergence of each single S/A-step.If the validation BLEU failed ti exceed the highest score during the certain S/A-step after 10K training iterations, we consider the model converged and alternated the training set.For the whole training process, we set the maximum training iterations as 250k for Chinese-English task and 150k for German-English task.

Results
Figure 1 shows the comparison among several approaches in different scales of training sets on the Chinese-English task.The leftmost point is trained on the authentic data, and other points are trained on the combination of authentic and synthetic corpora.The X-axis shows the synthetic data scale ranging from 1.25M (the size of authentic data) to 10M (the full size of the monolingual corpus).The Y-axis shows the BLEU scores of the combined test set.We find that the performance of BT rises firstly but then decreases as more synthetic data is added, which confirms the findings of Wu et al. (2019).In contrast, our approach achieves consistent improvement with the enlargement of the synthetic data scale.
Table 1 shows the detailed translation performance on the Chinese-English task when the synthetic data scale is set to 10M.It can be seen that our alternated training strategy outperforms conventional back-translation and tagged back-translation on all test sets.We find that during training, the S-Steps account for about 73% of the total training time, and the A-Steps account for 27%.This finding suggests that our training procedure composes mainly of S-Steps, and moderate A-Steps are efficient to guide the NMT model towards better points, which lead to the improvement of BLEU performance.
Table 2 shows the results of the German-English task.Similar to the Chinese-English task, we vary the synthetic data scale from 1M to 4.5M for experiments.We find that the performance degradation also occurs while utilizing large-scale synthetic data, and alternated training approach alleviate the problem and perform better than corresponding baselines.1: BLEU scores on the NIST Chinese-English task with 10M additional synthetic corpus."Base" means only authentic data is used."BT" corresponds to the back-translation method (Sennrich et al., 2016a)."BT-tagged" corresponds to the tagged BT technique proposed by Caswell et al. (2019)."AlterBT" means alternated training on authentic data and synthetic data using "BT" in each alternation."AlterBT-tagged" means training on authentic data and synthetic data using "BT-tagged" in each alternation."+" means significantly better than BT (p < 0.01)."†" means significantly better than BT-tagged (p < 0.01).Table 2: BLEU scores on the IWSLT14 German-English task with 1M and 4.5M additional synthetic corpus."+" means significantly better than BT (p < 0.01)."†" means significantly better than BT-tagged (p < 0.01).

BLEU Landscape Visualization
To validate the assumption that the authentic data helps to rectify the deviation in synthetic data and redirect the NMT model parameters to a better optimization path, we further investigate the BLEU landscape to compare our method with the BT approach during the same training steps.
The visualization of the BLEU landscape is shown in Figure 2. Checkpoints during alternated training are projected onto the 2D plane defined by θ(t) s , θ(t) a and θ(t+1) s 2 .Our projection method considers both the model parameters and their translation performance (See Appendix A for details).For the conventional BT approach, the model parameters are stuck in an inefficient optimization path (highlighted in blue dashed lines).In our approach, we find that authentic data effectively guides the model towards a better direction for A-Step (highlighted in red solid lines).For S-Step (highlighted in red dashed lines), although training with synthetic data deteriorates the BLEU performance, it pushes the model away from the original route, and 2 We select t = 2 for this visualization, and similar performance can be observed for other t's.
s and θ(t+1) a successively, which finally leads to a better point with a higher BLEU score.
enables authentic data to further redirect the model into a better point with a higher BLEU score.

Related Work
Our work is based on back-translation (BT), an approach to leverage monolingual data by an additional target-to-source system.BT was proved to be effective in neural machine translation (NMT) systems (Sennrich et al., 2016a).Despite its effectiveness, BT is limited by the accuracy of synthetic data.Noise and translation errors hinder the boosting of model performance (Hoang et al., 2018).The negative results become more evident when more synthetic data is mixed into training data (Caswell et al., 2019;Wu et al., 2019).
Considerable studies have focused on the accuracy problem in synthetic data and further extended back-translation.Imamura et al. (2018) demonstrate that generating source sentences via sampling increases the diversity of synthetic data and benefits the BT system.Edunov et al. (2018) further propose a noisy beam search method to generate more diversified source-side data.Caswell et al. (2019) add a reserved token to synthetic source-side sentences in order to help NMT model distinguish between authentic and synthetic data.Another perspective aims at measuring the translation quality of synthetic data.Imamura et al. (2018) filter sentence pairs with low likelihood or low confidence.Wang et al. (2019) use uncertainty-based confidence to measure words and sentences in synthetic data.Different from the aforementioned works, our approach introduces neither data modification (e.g.noising or tagging) nor additional models for evaluation.We alternate training set on the original authentic and synthetic data.
The work relatively close to ours is Iterative Back-Translation (Hoang et al., 2018), which refines forward and backward model via backtranslation data, and regenerates more accurate synthetic data from monolingual data.Our work differs from Iterative BT in that we do not require source-side monolingual corpora or repeatedly finetune the backward model.

Conclusion
In this work, we propose alternated training with synthetic and authentic data for neural machine translation.Experiments have shown the supremacy of our approach.Visualization of the BLEU landscape indicates that alternated training guides the NMT model towards better points.
y n |x n ; θ) .(1) Back-translation generates additional synthetic parallel data from the monolingual corpus.Let D m = {y m } M m=1 be a monolingual corpus containing M target-side sentences.Back-translation first trains a target-to-source model θBT on D a : x n |y n ; θ) , (2) which is then used to translate each sentence in the target-side monolingual corpus D m : xm = argmax x P (x|y m ; θBT ) , (3) where m = 1, . . ., M .The synthetic corpus D s is generated by pairing the translations {x m } M m=1 with D m , i.e.D s = { xm , y m } M m=1 .The required source-to-target model is finally trained on the combination of authentic and synthetic data: y n |x n ; θ) + M m=1 log P (y m |x m ; θ) .

Figure 1 :
Figure 1: Comparison with several baselines in different data scale.Our alternated approach outperforms the conventional back-translation method and improves the performance of Tagged BT.Moreover, with the enlargement of the synthetic data scale, the BLEU score rises steadily by alternated training.