Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Back-translation is a critical component of Unsupervised Neural Machine Translation (UNMT), which generates pseudo parallel data from target monolingual data. A UNMT model is trained on the pseudo parallel data with \text{\bf translated source}, and translates \text{\bf natural source} sentences in inference. The source discrepancy between training and inference hinders the translation performance of UNMT models. By carefully designing experiments, we identify two representative characteristics of the data gap in source: (1) \text{\textit{style gap}} (i.e., translated vs. natural text style) that leads to poor generalization capability; (2) \text{\textit{content gap}} that induces the model to produce hallucination content biased towards the target language. To narrow the data gap, we propose an online self-training approach, which simultaneously uses the pseudo parallel data \{natural source, translated target\} to mimic the inference scenario. Experimental results on several widely-used language pairs show that our approach outperforms two strong baselines (XLM and MASS) by remedying the style and content gaps.


Introduction
In recent years, there has been a growing interest in unsupervised neural machine translation (UNMT), which requires only monolingual corpora to accomplish the translation task (Lample et al., 2018a,b;Artetxe et al., 2018b;Yang et al., 2018;Ren et al., 2019). The key idea of UNMT is to use backtranslation (BT) (Sennrich et al., 2016) to construct

Source Target
Train X * Y Inference X Y * Table 1: {X * , Y} is the translated pseudo parallel data which is used for UMMT training on X ⇒ Y translation. The input discrepancy between training and inference: 1) Style gap: X * is in translated style, and X is in the natural style; 2) Content gap: the content of X * biases towards target language Y due to the backtranslation manipulation, and the content of X biases towards source language X.
the pseudo parallel data for translation modeling. Typically, UNMT back-translates the natural target sentence into the synthetic source sentence (translated source) to form the training data. A BT loss is calculated on the pseudo parallel data {translated source, natural target} to update the parameters of UNMT models. In Supervised Neural Machine Translation (SNMT), Edunov et al. (2020) found that BT suffers from the transltionese problem (Zhang and Toral, 2019;Graham et al., 2020) in which BT improves BLEU score on the target-original test set with limited gains on the source-original test set. Unlike authentic parallel data available in the SNMT training data, the UNMT training data entirely comes from pseudo parallel data generated by the back-translation. Therefore in this work, we first revisit the problem in the UNMT setting and start our research from an observation ( §2): with comparable translation performance on the full test set, the BT based UNMT models achieve better translation performance than the SNMT model on the target-original (i.e. translationese) test set, while achieves worse performance on the source-original ones.
In addition, the pseudo parallel data {translated source, natural target} generated by BT poses great challenges for UNMT, as shown in Table 1. First, there exists the input discrepancy between the translated source (translated style) in UNMT training data and the natural source (natural style) in inference data. We find that the poor generalization capability caused by the style gap (i.e., translated style v.s natural style) limited the UNMT translation performance ( §3.1). Second, the translated pseudo parallel data suffers from the language coverage bias problem , in which the content of UNMT training data biases towards the target language while the content of the inference data biases towards the source language. The content gap results in hallucinated translations (Lee et al., 2018;Wang and Sennrich, 2020) biased towards the target language ( §3.2).
To alleviate the data gap between the training and inference, we propose an online self-training (ST) approach to improve the UNMT performance. Specifically, besides the BT loss, the proposed approach also synchronously calculates the ST loss on the pseudo parallel data {natural source, translated target} generated by self-training to update the parameters of UNMT models. The pseudo parallel data {natural source, translated target} is used to mimic the inference scenario with {natural source, translated target} to bridge the data gap for UNMT. It is worth noting that the proposed approach does not cost extra computation to generate the pseudo parallel data {natural source, translated target} 2 , which makes the proposed method efficient and easy to implement.
We conduct experiments on the XLM (Lample and Conneau, 2019) and MASS (Song et al., 2019) UNMT models on multiple language pairs with varying corpus sizes (WMT14 En-Fr / WMT16 En-De / WMT16 En-Ro / WMT20 En-De / WMT21 En-De). Experimental results show that the proposed approach achieves consistent improvement over the baseline models. Moreover, we conduct extensive analyses to understand the proposed approach better, and the quantitative evidence reveals that the proposed approach narrows the style and content gaps to achieve the improvements.
2 The vanilla UNMT model adopts the dual structure to train both translation directions together, and the pseudo parallel data {natural source, translated target} has already been generated and is used to update the parameters of UNMT model in the reverse direction.
In summary, the contributions of this work are detailed as follows: • Our empirical study demonstrates that the backtranslation based UNMT framework suffers from the translationese problem, causing the inaccurate evaluation of UNMT models on standard benchmarks.
• We empirically analyze the data gap between training and inference for UNMT and identify two critical factors: style gap and content gap.
• We propose a simple and effective approach for incorporating the self-training method into the UNMT framework to remedy the data gap between the training and inference.
represent the collection of monolingual sentences of the corresponding language, where M, N are the size of the corresponding set. Generally, UNMT method that based on BT adopts dual structure to train a bidirectional translation model (Artetxe et al., 2018b(Artetxe et al., , 2019Lample et al., 2018a,b). For the sake of simplicity, we only consider translation direction X → Y unless otherwise stated.
Online BT. Current mainstream of UNMT methods turn the unsupervised task into the synthetic supervised task through BT, which is the most critical component in UNMT training. Given the translation task X → Y where target corpus Y is available, for each batch, the target sentence y ∈ Y is used to generate its synthetic source sentence by the backward model MT Y →X : whereθ is a fixed copy of the current parameters θ indicating that the gradient is not propagated throughθ. In this way, the synthetic parallel sentence pair {x * , y} is obtained and used to train the forward model MT X→Y in a supervised manner by minimizing: It is worth noting that the synthetic sentence pair generated by the BT is the only supervision signal of UNMT training.
Objective function. In addition to BT, denoising auto-encoding (DAE) is an additional loss term of UNMT training, which is denoted by L D and is not the main topic discussed in this work.
In all, the final objective function of UNMT is: where λ D is the hyper-parameter weighting DAE loss term. Generally, λ D starts from one and decreases as the training procedure continues 3 .

Translationese Problem
To verify whether the UNMT model suffers from the input gap between training and inference and thus is biased towards translated input while against natural input, we conduct comparative experiments between SNMT and UNMT models.
Setup We evaluate the UNMT and SNMT models on WMT14 En-Fr, WMT16 En-De and WMT16 En-Ro test sets, following Lample and Conneau (2019) and Song et al. (2019). We first train the UNMT models on the above language pairs with model parameters initialized by XLM and MASS models. Then, we train the corresponding SNMT models whose performance on the full test sets is controlled to be approximated to UNMT by undersampling training data. Finally, we evaluate the UNMT and SNMT models on the targetoriginal and source-original test sets, whose inputs are translated and natural respectively. Unless otherwise stated, we follow previous work (Lample and Conneau, 2019;Song et al., 2019) to use casesensitive BLEU score (Papineni et al., 2002) with the multi-bleu.perl 4 script as the evaluation metric. Please refer to Appendix B for the results of SacreBLEU, and refer to Appendix A for the training details of SNMT and UNMT models.

Results
We present the translation performance in terms of the BLEU score in Table 2 and our observations are: • UNMT models perform close to the SNMT models on the full test sets with 0.3 BLEU difference at most on average (33.5/33.9 vs. 33.6).
• UNMT models outperform SNMT models on target-original test sets (translated input) with  • UNMT models underperform the SNMT models on source-original test sets (natural input) with an average performance degradation of 4.4 and 4.2 BLUE points (28.7/28.9 vs. 33.1).
The above observations are invariant concerning the pre-trained model and translation direction. In particular, the unsatisfactory performance of UNMT under natural input indicates that UNMT is overestimated on the previous benchmark. We attribute the phenomenon to the data gap between training and inference for UNMT: there is a mismatch between natural inputs of source-original test data and the back-translated inputs that UNMT employed for training. This work focuses on the experiments on the source-original test sets (i.e., the input of an NMT translation system is generally natural), which is closer to the practical scenario. 5

Inference Input PPL
Natural 242 Translated 219  UNMT: style gap and content data. We divide the test sets into two portions: the natural input portion with source sentences originally written in the source language and the translated input portion with source sentences translated from the target language. Due to the limited space, we conduct the experiments with pre-trained XLM initialization and perform analysis with different kinds of inputs (i.e., natural and translated inputs) on De⇒En newstest2013-2018 unless otherwise stated.

Style Gap
To perform the quantitative analysis of the style gap, we adopt KenLM 6 to train a 4-gram language model on the UNMT translated source sentences 7 and use the language model to calculate the perplexity (PPL) of natural and translated input sentences in the test sets. The experimental results are shown in Table 3. The lower perplexity value (219 < 242) indicates that compared with the natural inputs, the UNMT translated training inputs have a more similar style with translated inputs in the test sets.
In order to further reveal the influence of the style gap on UNMT, we manually eliminated it and re-evaluated the models on the natural input portion of WMT16 De⇒En. Concretely, We first take the third-party Google Translator to translate the target English sentences of the test sets into the source German language to eliminate the style gap. And then we conduct translation experiments on the natural input portion and its Google translated portion to evaluate the impact of the style gap on the translation performance. We list the experimental results in Table 4. We can find that by converting from the natural inputs (natural De) to the translated inputs (translated De * ), the UNMT model achieves more improvement than the SNMT model (-2.8 > -6.3), demonstrating that the style gap inhibits the UNMT translation output quality.

Content Gap
In this section, we show the existence of the content gap by (1) showing the most high-frequency name entities, (2) calculating content similarity using term frequency-inverse document frequency (TF-IDF) for the training and inference data.
We use spaCy 8 to recognize German named entities for the UNMT translated source sentences, natural inputs and translated inputs in test sets, and show the ten most frequent name entities in Table 5. From the table, we can observe that the UNMT translated source sentences have few named entities biased towards source language German (words in red color), while having more named entities biased towards target language English, e.g., USA, Obama. It indicates that the content of the UNMT translated source sentences is biased towards the target language English.
Meanwhile, the natural input portion of the inference data has more named entities biased towards source language German (words in red color), demonstrating that the content gap exists between the natural input portion of the inference data and the UNMT translated training data.
Next, we remove the stop words and use the term frequency-inverse document frequency (TF-IDF) approach to calculate the content similarity between the training and inference data. Similarity scores are presented in Table 6. We can observe that the UNMT translated source data has a more significant similarity score with translated inputs which are generated from the target English sentences. This result indicates that the content of UNMT translated source data is more biased towards the target language, which is consistent with the findings in Table 5.
As it is difficult to measure the name entities  Table 5: Ten most frequent entities in the source sentences (i.e., German) of back-translated training data ("BT Train Data"). For reference, we also list the most frequent entities in the natural and translated inference inputs. The BT training data has more entities biased towards the target language English (blue words) rather than the expected source language German (red words).

Inference Input Train
Natural Translated Natural 0.95 0.85 Translated 0.84 0.93 translation accuracy in terms of BLEU evaluation metric, we provide a translation example in Table 7 to show the effect of the content gap in the UNMT translations (more examples in Appendix C). We observe that the UNMT model outputs the hallucinated translation "U.S.", which is biased towards the target language English. We present a quantitative analysis to show the impact of the content gap on UNMT translation performance in Section 6.2.

Online Self-training for UMMT
To bridge the data gap between training and inference of UNMT, we propose a simple and effective method through self-training. For the translation task X → Y , we generate the source-original training samples from the source corpus X to improve the model's translation performance on natural inputs. For each batch, we apply the forward model MT X→Y on the natural source sentence x to generate its translation: In this way, we build a sample {x, y * } with natural input, on which the model can be trained by minimizing:  Under the framework of UNMT training, the final objective function can be formulated as: where λ S is the hyper-parameter weighting the selftraining loss term. It is worth noting that the generation step of Eq.(4) has been done by the BT step of Y → X training. Thus, the proposed method will not increase the training cost significantly but make the most of the data generated by BT (  Table 8: Translation performance on WMT14 En-Fr, WMT16 En-De, WMT16 En-Ro and their corresponding source-original (natural input) and target-original (translated input) subset. "↑ / ⇑": significant over the corresponding baseline model (p < 0.05/0.01), tested by bootstrap resampling (Koehn, 2004 Table 8 shows the translation performance of XLM and MASS baselines and our proposed models. We have the following observations:

Main Result
• Our re-implemented baseline models achieve comparable or even better performance as reported in previous works. The reproduced XLM+UNMT model has an average improvement of 1.4 BLEU points compared to the original report in Lample and Conneau (2019)  • Our approach with online self-training significantly improves overall translation performance (+0.8 BLEU on average). This demonstrates the universality of the proposed approach on both large-scale (En-Fr, En-De) and data imbalanced corpus (En-Ro).
• In the translated input scenario, our approach achieves comparable performance to baselines.
It demonstrates that although the sample of selftraining is source-original style, our approach does not sacrifice the performance on the targetoriginal side.
• In the natural input scenario, we find that our proposed approach achieves more significant improvements, with +1.1 and +1.3 average BLEU on both baselines. The reason is that the sourceoriginal style sample introduced by self-training alleviates model bias between natural and translated input.

Comparison with Offline Self-training and CBD
We compare online self-training with the following two related methods, which also incorporate natural inputs in training: • Offline Self-training model distilled from the forward and backward translated data generated by the trained UNMT model.
• CBD (Nguyen et al., 2021) model distilled from the data generated by two trained UNMT models through cross-translation, which embraces data diversity.  Table 9: Comparison with offline self-training and CBD 11 . "↑ / ⇑": significant over the corresponding baseline model (p < 0.05/0.01), tested by bootstrap resampling (Koehn, 2004). The training cost is estimated by the time required for training one epoch where the cost of data generation is also considered.
Dataset Previous studies have recommended restricting test sets to natural input sentences, a methodology adopted by the 2019-2020 edition of the WMT news translation shared task (Edunov et al., 2020). In order to further verify the effectiveness of the proposed approach, we also conduct the evaluation on WMT19 and WMT20 En-De test sets. Both test sets contain only natural input samples.
Results Experimental results are presented in Table 9. We also show the training costs of these methods. We find that • Unexpectedly, the offline self-training has no significant improvement over baseline UNMT.  have demonstrated the effectiveness of offline self-training in UNMT under low-resource and data imbalanced scenarios. However, in our data-sufficient scenarios, offline self-training may suffer from the data diversity problem while online self-training can alleviate the problem through the dynamic model parameters during the training process. We leave the complete analysis to future work.
• CBD achieves a significant improvement compared to baseline UNMT, but the training cost is about six times that of online self-training.
• The proposed online self-training achieves the best translation performance in terms of BLEU score, which further demonstrates the superiority of the proposed method under natural input. 11 Our re-implemented CBD model can not achieve comparable performance with Nguyen et al. (2021), with 28.4 and 35.2 BLEU scores on WMT16 En-De and De-En test sets.

Translationese Output
Since the self-training samples are translated sentences on the target side, there is concern that the improvement achieved by self-training only comes from making the model outputs better match the translated references, rather than enhancing the model's ability on natural inputs. To dispel the concern, we conducted the following experiments: (1) evaluate the fluency of model outputs in terms of language model PPL and (2) evaluate the translation performance on Google Paraphrased WMT19 En⇒De test sets (Freitag et al., 2020).
Output fluency We exploit the monolingual corpora of target languages to train the 4-gram language models. Table 10 shows the language models' PPL on model outputs of test sets mentioned in §5.2. We find that online self-training has only a slight impact on the fluency of model outputs, with the average PPL of XLM and MASS models only increasing by +3 and +6, respectively. We ascribe this phenomenon to the translated target of self-training samples, which is model generated and thus less fluent then natural sentences. However, since the target of BT data is natural and the BT loss term is the primary training objective, the output fluency does not decrease significantly.
Translation performance on paraphrased references Freitag et al. (2020) collected additional human translations for newstest2019 with the ultimate aim of generating a natural-to-natural test set. We adopt the HQ(R) and HQ(all 4), which have higher human adequacy rating scores, to re-

Data Gap
Style Gap From Table 8, our proposed approach achieves significant improvements on the natural input portion while not gaining on the translated input portion over the baselines. It indicates our approach has better generalization capability on the natural input portion of test sets than the baselines.

Content Gap
To verify that our proposed approach bridges the content gap between training and inference, we calculate the accuracy of NER translation by different models. Specifically, we adopt spaCy to recognize the name entities in reference and translation outputs and treat the name entities in reference as the ground truth to calculate the accuracy of NER translation. We show the results in Table 12. Our proposed method achieves a significant improvement in the translation accuracy of NER compared to the baseline. The result demonstrates that online self-training can help the model pay more attention to the input content rather than being affected by the content of the target language training corpus.

Target Quality
Next, we investigate the impact of target quality on ST. We use the SNMT model from §2.2 to generate ST data rather than the current model itself and keep the process of BT unchanged. As shown in Table 2, the SNMT models perform well on sourceoriginal test set and thus yield higher quality target in ST data. We denote this variant as "knowledge distillation (KD)" and report the performance on WMT19/20 E⇔De in Table 13. When target quality gets better, model performance improves significantly, as expected. Therefore, reducing the noise on the target side of the ST data may further improve the performance. Implementing in an unsupervised manner is left to future work.

Unsupervised Neural Machine Translation
Before attempts to build NMT model using monolingual corpora only, unsupervised cross-lingual embedding mappings had been well studied by Zhang et al. (2017); Artetxe et al. (2017Artetxe et al. ( , 2018a; Conneau et al. (2018). These methods try to align the word embedding spaces of two languages without parallel data and thus can be exploited for unsupervised word-by-word translation. Initialized by the cross-lingual word embeddings, Artetxe et al. (2018b) and Lample et al. (2018a) concurrently proposed UNMT, which achieved remarkable performance for the first time using monolingual corpora only. Both of them rely on online back-translation and denoising auto-encoding. After that, Lample et al. (2018b) proposed joint BPE for related languages and combined the neural and phrase-based methods. Artetxe et al. (2019) warmed up the UNMT model by an improved statistical machine translation model. Lample and Conneau (2019) proposed cross-lingual language model pretraining, which obtained large improvements over previous works. Song et al. (2019) extended the pretraining framework to sequence-to-sequence. Tran et al. (2020) induced data diversification in UNMT via cross-model back-translated distillation.
Data Augmentation Back-translation (Sennrich et al., 2016;Edunov et al., 2018;Marie et al., 2020) and self-training (Zhang and Zong, 2016;He et al., 2020;Jiao et al., 2021) have been well studied in the supervised NMT. In the unsupervised scenario, Tran et al. (2020) have shown that multilingual pretrained language models can be used to retrieve the pseudo parallel data from the large monolingual data. Han et al. (2021) use generative pre-training language models, e.g., GPT-3, to perform zero-shot translations and use the translations as few-shot prompts to sample a larger synthetic translations dataset. The most related work to ours is that offline self-training technology used to enhance low-resource UNMT . In this paper, the proposed online self-training method for UNMT can be applied to both high-resource and low-resource scenarios without extra computation to generate the pseudo parallel data.
Translationese Problem Translationese problem has been investigated in machine translation evaluation (Lembersky et al., 2012;Zhang and Toral, 2019;Edunov et al., 2020;Graham et al., 2020). These works aim to analyze the effect of translationese in bidirectional test sets. In this work, we revisit the translationese problem in UNMT and find it causes the inaccuracy evaluation of UNMT performance since the training data entirely comes from the translated pseudo-parallel data.

Conclusion
Pseudo parallel corpus generated by backtranslation is the foundation of UNMT. However, it also causes the problem of translationese and results in inaccuracy evaluation on UNMT performance. We attribute the problem to the data gap between training and inference and identify two data gaps, i.e., style gap and content gap. We conduct the experiments to evaluate the impact of the data gap on translation performance and propose the online self-training method to alleviate the data gap problems. Our experimental results on multiple language pairs show that the proposed method achieves consistent and significant improvement over the strong baseline XLM and MASS models on the test sets with natural input. Model We initialize the model parameter by XLM pre-trained model and adopt 2500 tokens/batch to train the SNMT model for 40 epochs.

A Training Details
We select the best model by BLEU score on the validation set mentioned in §5.1. Note that in order to avoid introducing other factors, our SNMT models are bidirectional, which is consistent with the UNMT models.

A.2 Training Details of UNMT Model
Training data Table 14 lists the monolingual data used in this study to train the UNMT models 12 . We filter the training corpus based on language and remove sentences containing URLs.
Model We adopt the pre-trained XLM models released by Lample and Conneau (2019) and MASS models released by Song et al. (2019) for all language pairs. In order to better reproduce the results for MASS on En-De, we use monolingual data to continue pre-training the MASS pre-trained model for 300 epochs and select the best model by perplexity (PPL) on the validation set. We adopt 2500 tokens/batch to train the UNMT model for 70 epochs and select the best model by BLEU score on the validation set.
Hyper-parameter The target of self-training samples is the translation of the model, which may be noisy in comparison with the reference. Therefore, we adopted the strategy of linearly increasing λ S and keeping it at a small value to avoid negatively affecting the online back-translation training. We denote the beginning and final value of λ S by λ 0 S and λ 1 S , respectively. We tune the λ 0 S within {0, 1e−3, 1e−2, 2e−2} and λ 1 S within {5e−3, 5e−2, 1e−1, 1.5e−1} based on the BLEU score on validation sets. 12 All the data is available at http://www.statmt.org/wmt20/translation-task.html except for En-De which we will release in our github repo.

B Sacrebleu Results
To be consistent with previous works (Lample and Conneau, 2019;Song et al., 2019;Nguyen et al., 2021), we use multi-bleu.perl script in the main text to measure translation performance. However, Post (2018) has pointed out that multi-bleu.perl requires user-supplied preprocessing, which cannot be directly compared and provide a sacreBLEU 13 tool to facilitate this. Although we adopted the same preprocessing steps for all models, we still report BLEU scores calculated with sacreBLEU 14 in this section. Tables 15  to 19 show the sacreBLEU results of Tables 2, 4, 8,  9 and 13, respectively.

Source
Deutschland schiebe ein Wohnungsdefizit vor sich her , das von Jahr zu Jahr größer wird . Reference Germany has a housing deficit which increases every year .

UNMT
The U.S. was shooting ahead of a housing deficit that is expected to grow from year to year .