Self-Training Sampling with Monolingual Data Uncertainty for Neural Machine Translation

Self-training has proven effective for improving NMT performance by augmenting model training with synthetic parallel data. The common practice is to construct synthetic data based on a randomly sampled subset of large-scale monolingual data, which we empirically show is sub-optimal. In this work, we propose to improve the sampling procedure by selecting the most informative monolingual sentences to complement the parallel data. To this end, we compute the uncertainty of monolingual sentences using the bilingual dictionary extracted from the parallel data. Intuitively, monolingual sentences with lower uncertainty generally correspond to easy-to-translate patterns which may not provide additional gains. Accordingly, we design an uncertainty-based sampling strategy to efficiently exploit the monolingual data for self-training, in which monolingual sentences with higher uncertainty would be sampled with higher probability. Experimental results on large-scale WMT English⇒German and English⇒Chinese datasets demonstrate the effectiveness of the proposed approach. Extensive analyses suggest that emphasizing the learning on uncertain monolingual sentences by our approach does improve the translation quality of high-uncertainty sentences and also benefits the prediction of low-frequency words at the target side.


Introduction
Leveraging large-scale unlabeled data has become an effective approach for improving the performance of natural language processing (NLP) models (Devlin et al., 2019;Brown et al., 2020;Jiao et al., 2020a). As for neural machine translation (NMT), compared to the parallel data, the monolingual data is available in large quantities for many languages. Several approaches on boosting the NMT performance with the monolingual data have been proposed, e.g., data augmentation (Sennrich et al., 2016a;Zhang and Zong, 2016), semisupervised training (Cheng et al., 2016;Zhang et al., 2018;Cai et al., 2021), pre-training (Siddhant et al., 2020;. Among them, data augmentation with the synthetic parallel data (Sennrich et al., 2016a;Edunov et al., 2018) is the most widely used approach due to its simple and effective implementation. It has been a de-facto standard in developing the large-scale NMT systems (Hassan et al., 2018;Huang et al., 2021).
Self-training (Zhang and Zong, 2016) is one of the most commonly used approaches for data augmentation. Generally, self-training is performed in three steps: (1) randomly sample a subset from the large-scale monolingual data; (2) use a "teacher" NMT model to translate the subset data into the target language to construct the synthetic parallel data; (3) combine the synthetic and authentic parallel data to train a "student" NMT model. Recent studies have shown that synthetic data manipulation (Edunov et al., 2018;Caswell et al., 2019) and training strategy optimization (Wu et al., 2019b; in the last two steps can boost the self-training performance significantly. However, how to efficiently and effectively sample the subset from the large-scale monolingual data in the first step has not been well studied. Intuitively, self-training simplifies the complexity of generated target sentences (Kim and Rush, 2016;Zhou et al., 2019;Jiao et al., 2020b), and easy patterns in monolingual sentences with deterministic translations may not provide additional gains over the self-training "teacher" model (Shrivastava et al., 2016). Related work on computer vision also reveals that easy patterns in unlabeled data with the deterministic prediction may not provide additional gains (Mukherjee and Awadallah, 2020). In this work, we investigate and identify the uncertain monolingual sentences which implicitly hold difficult patterns and exploit them to boost the self-training performance. Specifically, we measure the uncertainty of the monolingual sentences by using a bilingual dictionary extracted from the authentic parallel data ( §2.1). Experimental results show that NMT models benefit more from the monolingual sentences with higher uncertainty, except on those with excessively high uncertainty ( §2.3). By conducting the linguistic property analysis, we find that extremely uncertain sentences contain relatively poor translation outputs, which may hinder the training of NMT models ( §2.4).
Inspired by the above finding, we propose an uncertainty-based sampling strategy for selftraining, in which monolingual sentences with higher uncertainty would be selected with higher probability ( §3.1). Large-scale experiments on WMT English⇒German and English⇒Chinese datasets show that self-training with the proposed uncertainty-based sampling strategy significantly outperforms that with random sampling ( §3.3). Extensive analyses on the generated outputs confirm our claim by showing that our approach improves the translation of uncertain sentences and the prediction of low-frequency target words ( §3.4).
Contributions. Our main contributions are: • We demonstrate the necessity of distinguishing monolingual sentences for self-training.
• We propose an uncertainty-based sampling strategy for self-training, which selects more complementary sentences for the authentic parallel data.
• We show that NMT models benefit more from uncertain monolingual sentences in selftraining, which improves the translation quality of uncertain sentences and the prediction accuracy of low-frequency words.

Observing Monolingual Uncertainty
In this section, we aimed to understand the effect of uncertain monolingual data on self-training. We first introduced the metric for identifying uncertain monolingual sentences, then the experimental setup and at last our preliminary results.
Notations. Let X and Y denote the source and target languages, and let X and Y represent the sentence domains of corresponding languages. Let denote the authentic parallel data, where x i ∈ X , y i ∈ Y and N is the number of sentence pairs. Let M x = {x j } Mx j=1 denote the collection of monolingual sentences in the source language, where x j ∈ X and M x is the size of the set. Our objective is to obtain a translation model f : X → Y, that can translate sentences from language X to language Y .

Identification of Uncertain Data
Data Complexity. According to Zhou et al. (2019), the complexity of a parallel corpus can be measured by adding up the translation uncertainty of all source sentences. Formally, the translation uncertainty of a source sentence x with its translation candidates can be operationalized as conditional entropy: where T x denotes the length of the source sentence, x and y represent a word in the source and target vocabularies, respectively. Generally, a high H(Y|X = x) denotes that a source sentence x would have more possible translation candidates. Equation (2) estimates the translation uncertainty of a source sentence with all possible translation candidates in the parallel corpus. It can not be directly applied to the sentences in monolingual data due to the lack of corresponding translation candidates. One potential solution to the problem is utilizing a trained model to generate multiple translation candidates. However, generation may lead to bias estimation due to the generation diversity issue Shu et al., 2019). More importantly, generation is extremely time-consuming for large-scale monolingual data.
Monolingual Uncertainty. To address the problem, we modified Equation (2) to reflect the uncertainty of monolingual sentences. We estimate the target word distribution conditioned on each source word based on the authentic parallel corpus, and then use the distribution to measure the translation uncertainty of the monolingual example. Specifically, we measure the uncertainty of monolingual sentences based on the bilingual dictionary. For a given monolingual sentence x j ∈ M x , its uncertainty U is calculated as: which is normalized by T x to avoid the length bias. A higher value of U indicates a higher translation uncertainty of the monolingual sentence. In Equation 3, the word level entropy H(y|A b , x = x t ) captures the translation modalities of each source word by using the bilingual dictionary A b . The bilingual dictionary records all the possible target words for each source word, as well as translation probabilities. It can be built from the word alignments by external alignment toolkits on the authentic parallel corpus. For example, given a source word x with all three word translations y 1 , y 2 and y 3 and the translation probabilities of p(y 1 |x), p(y 2 |x) and p(y 3 |x), respectively, the word level entropy can be calculated as follows: (4)

Experimental Setup
Data. We conducted experiments on two large-scale benchmark translation datasets, i.e., WMT English⇒German (En⇒De) and WMT English⇒Chinese (En⇒Zh).
The authentic parallel data for the two tasks consists of about 36.8M and 22.1M sentence pairs, respectively. The monolingual data we used is from newscrawl released by WMT2020.
We combined the newscrawl data from year 2011 to 2019 for the English monolingual corpus, consisting of about 200M sentences. We randomly sampled 40M monolingual data for En⇒De and 20M for En⇒Zh unless otherwise stated. We adopted newstest2018 as the validation set and used newstest2019/2020 as the test sets. For each language pair, we applied Byte Pair Encoding (BPE, Sennrich et al., 2016b) with 32K merge operations.
Model. We chose the state-of-the-art TRANS-FORMER (Vaswani et al., 2017) network as our model, which consists of an encoder of 6 layers and a decoder of 6 layers. We adopted the open-source toolkit Fairseq    (Wu et al., 2019a). We used 16 Nvidia V100 GPUs to conduct the experiments and selected the final model by the best perplexity on the validation set.

Effect of Uncertain Data
First of all, we investigated the effect of monolingual data uncertainty on the self-training performance in NMT. We conducted the preliminary experiments on the WMT En⇒De dataset with the TRANSFORMER-BASE model. We sampled 8M bilingual sentence pairs from the authentic parallel data and randomly sampled 40M monolingual sentences for the self-training. To ensure the quality of synthetic parallel data, we trained a TRANSFORMER-BIG model for translating the source monolingual data to the target language. We generated translations using beam search with beam width 5, and followed Edunov et al. (2018) 4 to filter the generated sentence pairs (See Appendix A.1).
Self-training v.s. Data Size. We took a look at the performance of standard self-training and its relationship with data size. Figure 1 showed the results. Obviously, self-training with 8M synthetic data can already improve the NMT performance by a significant margin (36.2 averaged BLEU score on WMT En⇒De newstest2019 and newstest2020). Increasing the size of added monolingual data does not bring much more benefit. With all the 40M monolingual sentences, the final performance achieves only 36.5 BLEU points. It indicates that adding more monolingual data only is not a promising way to improve self-training, and more sophisticated approaches for exploiting the monolingual data are desired.
Self-training v.s. Uncertainty. In this experiment, we first adopted fast-align 5 to establish word alignments between source and target words in the authentic parallel corpus and used the alignments to build the bilingual dictionary A b . Then we used the bilingual dictionary to compute the data uncertainty expressed in Equation (3) for the sentences in the monolingual data set. After that, we ranked all the 40M monolingual sentences and grouped them into 5 equally-sized bins (i.e., 8M sentences per bin) according to their uncertainty scores. At last, we performed self-training with each bin of monolingual data.
We reported the translation performance in Figure 2. As seen, there is a trend of performance improvement with the increase of monolingual data uncertainty (e.g., bins 1 to 4) until the last bin. The last bin consists of sentences with excessively high uncertainty, which may contain erroneous synthetic target sentences. Training on these sentences forces the models to over-fit on these incorrect synthetic data, resulting in the confirmation bias issue (Arazo et al., 2020). These results corroborate with prior studies (Chang et al., 2017;Mukherjee and Awadallah, 2020) such that learning on certain examples brings little gain while on the excessively uncertain examples may also hurt the model training.

Linguistic Properties of Uncertain Data
We further analyzed the differences between the monolingual sentences with varied uncertainty to gain a deeper understanding of the uncertain data. Specifically, we performed linguistic analysis on the five data bins in terms of three properties: 1) sentence length that counts the tokens in the sentence, 2) word rarity (Platanios et al., 2019) that measures the frequency of words in a sentence with a higher value indicating a more rare sentence, and 3) translation coverage (Khadivi and Ney, 2005) that measures the ratio of source words being aligned with any target words. The first two reflect the properties of monolingual sentences while the last one reflects the quality of synthetic sentence pairs. We also presented the results of the synthetic target sentences for reference. Details of the linguistic properties are in Appendix A.2.
The results are reported in Figure 3. For the length property, we find that monolingual sentences with higher uncertainty are usually longer except for those with excessively high uncertainty (e.g., bin 5). The monolingual sentences in the last data bin noticeably contain more rare words than other bins in Figure 3(b), and the rare words in the sentences pose a great challenge in the NMT training process . In Figure 3(c), the overall coverage in bin 5 is the lowest among the self-training bins. In contrast, bin 1 with the lowest uncertainty has the highest coverage. These observations suggest that monolingual sentences in bin 1 indeed contain the easiest patterns while

Exploiting Monolingual Uncertainty
By analyzing the effect of monolingual data uncertainty on self-training in Section 2, we understood that monolingual sentences with relatively high uncertainty are more informative while also with high quality, which motivates us to emphasize the training on these sentences. In this section, we introduced the uncertainty-based sampling strategy for self-training and the overall framework.

Uncertainty-based Sampling Strategy
With the aforementioned measure of monolingual data uncertainty in Section 2.1, we propose the uncertainty-based sampling strategy for selftraining, which prefers to sample monolingual sen-tences with relatively high uncertainty.
To ensure the data diversity and avoid the risk of being dominated by the excessively uncertain sentences, we sample monolingual sentences according to the uncertainty distribution with the highest uncertainty penalized. Specifically, given a budget of N s sentences to sample, we set two hyperparameters to control the sampling probability as follows: where α is used to penalize excessively high uncertainty over a maximum uncertainty threshold U max (See Figure 4(a)), the power rate β is used to adjust the distribution such that a larger β gives more probability mass to the sentences with high uncertainty (See Figure 4(b)). The maximum uncertainty threshold U max is assigned to the uncertainty value such that R% of sentences in the authentic parallel corpus have monolingual data uncertainty below than it. R is assumed to be as high as 80 to 100. Because for monolingual data with uncertainty higher than this threshold, they may not be translated correctly by the "teacher" model as there are inadequate such sentences in the authentic parallel data for the model to learn. As a result, monolingual sentences with uncertainty higher than U max should be penalized in terms of the sampling probability.
Overall Framework. Figure 5 presents the framework of our uncertainty-based sampling for  Figure 5: Framework of the proposed uncertainty-based sampling strategy for self-training. Procedures framed in the red dashed box corresponds to our approach integrated into the standard self-training framework. "Bitext", "Mono", "Synthetic" denotes authentic parallel data, monolingual data and synthetic parallel data, respectively. self-training, which includes four steps: 1) train a "teacher" NMT model and an alignment model on the authentic parallel data simultaneously; 2) extract the bilingual dictionary from the alignment model and perform uncertainty-based sampling for monolingual sentences; 3) use the "teacher" NMT model to translate the sampled monolingual sentences to construct the synthetic parallel data; 4) train a "student" NMT model on the combination of synthetic and authentic parallel data.

Constrained Scenario
We first validated the proposed sampling approach in a constrained scenario, where we followed the experimental configuration in Section 2.3 with the TRANSFORMER-BASE model, the 8M bitext, and the 40M monolingual data. It allows the efficient evaluation of our approach with varied combinations of hyper-parameters and also the comparison with related methods. Specifically, we performed our approach by sampling 8M sentences from the 40M monolingual data and then combining the corresponding 8M synthetic data with the 8M bitext to train the TRANSFORMER-BASE model. Table 1 reported the impact of β and R on the BLEU score. As shown, sampling with high uncertainty sentences and penalizing those with excessively high uncertainty improves translation performance from 36.6 to 36.9. In these experiments, the uncertainty threshold U max for penalizing are 2.90 and 2.74, which are determined by the 90% and 80% (R=90 and 80 in Table 1) most certain sentences in the authentic parallel data, respectively. Obviously, the proposed uncertainty-based sampling strategy achieves the best performance with R at 90 and β at 2. In the following experiments, we use R = 90 and β = 2 as the default setting for our sampling strategy if not otherwise stated.
Effect of Sampling. Some researchers may doubt that the final translation quality is affected by the quality of the teacher model. Therefore, translations of high-uncertainty sentences should contain many errors, and it is better to add the results of oracle translations to discuss the sampling effect and the quality of pseudo-sentences separately. To dispel the doubt, we still used the aforementioned 8M bitext as the bilingual data, and used the rest of WMT19 En-De data (28.8M) as the held-out data (with oracle translations) for sampling. The results are listed in Table 2. Clearly, our uncertainty-based sampling strategy (UNCSAMP) outperforms the random sampling strategy (RANDSAMP) when manual translations are used (Rows 2 vs. 3), demonstrating the effectiveness of our sampling strategy based on the un-   certainty. Another interesting finding is that using the pseudo-sentences outperforms using the manual translations (Rows 4 vs. 2, 5 vs. 3). One possible reason is that the TRANSFORMER-BIG model to construct the pseudo-sentences was trained on the whole WMT19 En-De data that contains the heldout data, which serves as self-training to decently improve the supervised baseline (He et al., 2019).
Comparison with Related Work. We compared our sampling approach with two related works, i.e., difficult word by frequency (DWF, Fadaee and Monz, 2018) and source language model (SRCLM, Lewis, 2010). The former one was proposed for monolingual data selection for back-translation, in which sentences with lowfrequency words were selected to boost the performance of back-translation. The latter one was proposed for in-domain data selection for in-domain language models. Details of the implementation of related work are in Appendix A.3.  technique developed for back-translation may not work for self-training. As for SRCLM, it achieves a marginal improvement over RANDSAMP. The proposed UNCSAMP approach outperforms the baseline RANDSAMP by +0.7 BLEU point, which demonstrates the effectiveness of our approach. In addition to our UNCSAMP approach, we also utilized another N-gram language model at the target side to further filter out the synthetic data with potentially erroneous target sentences. By filtering out 20% sentences from the sampled 8M sentences, our UNCSAMP approach achieves a further improvement up to +0.9 BLEU point.

Unconstrained Scenario
We extended our sampling approach to the unconstrained scenario, where the scale of data and the capacity of NMT models for self-training are increased significantly. We conducted experiments on the high-resource En⇒De and En⇒Zh translation tasks with all the authentic parallel data, including 36.8M sentence pairs for En⇒De and 22.1M for En⇒Zh, respectively. For monolingual data, we considered all the 200M English newscrawl monolingual data to perform sampling. We trained the TRANSFORMER-BIG model for experiments. Table 4 listed the main results of large-scale self-training on high-resource language pairs. As shown, our TRANSFORMER-BIG models trained on the authentic parallel data achieve the performance competitive with or even better than the submissions to WMT competitions. Based on such strong baselines, self-training with RANDSAMP improves the performance by +2.0 and +0.9 BLEU points on En⇒De and En⇒Zh tasks respectively, demonstrating the effectiveness of the large-scale selftraining for NMT models. With our uncertaintybased sampling strategy UNCSAMP, self-training achieves further significant improvement by +1.1 and +0.6 BLEU points over the random sampling strategy, which demonstrates the effectiveness of exploiting uncertain monolingual sentences.

Analysis
In this section, we conducted analyses to understand how the proposed uncertainty-based sampling approach improved the translation performance. Concretely, we analyzed the translation outputs of WMT En⇒De newstest2019 from the TRANSFORMER-BIG model in Table 4.
Uncertain Sentences. As we propose to enhance high uncertainty sentences in self-training, one remaining question is whether our UNCSAMP approach improves the translation quality of high uncertainty sentences. Specifically, we ranked the source sentences in the newstest2019 by the monolingual uncertainty, and divided them into three equally sized groups, namely Low, Medium and High uncertainty.
The translation performance on these three groups is reported in Table 5. The first observation is that sentences with high uncertainty are with relatively low BLEU scores (i.e., 31.0), indicating the higher difficulty for NMT models to correctly decode the source sentences with higher uncertainty. Our UNCSAMP approach improves the translation performance on all sentences, especially on the sentences with high uncertainty (+10.9%), which confirms our motivation of emphasizing the learning on uncertain sentences for self-training.
Low-Frequency Words. Partially motivated by Fadaee and Monz (2018), we hypothesized that the addition of monolingual data in self-training   has the potential to improve the prediction of lowfrequency words at the target side for the NMT models. Therefore, we investigated whether our approach has a further boost to the performance on the prediction of low-frequency words. We calculated the word accuracy of the translation outputs with respect to the reference in newstest2019 by compare-mt. Following , we divided words into three categories based on their frequency, including High: the most 3,000 frequent words; Medium: the most 3,001-12,000 frequent words; Low: the other words. Table 6 listed the results of word accuracy on these three groups evaluated by F-measure. First, we observe that low-frequency words in BITEXT are more difficult to predict than medium-and high-frequency words (i.e., 52.3 v.s. 65.2 and 70.3), which is consistent with Fadaee and Monz (2018). Second, adding monolingual data by selftraining improves the prediction performance of low-frequency words. Our UNCSAMP approach outperforms RANDSAMP significantly on the lowfrequency words. These results suggest that emphasizing the learning on uncertain monolingual sentences also brings additional benefits for the learning of low-frequency words at the target side.

Related Work
Synthetic Parallel Data. Data augmentation by synthetic parallel data has been the most simple and effective way to utilize monolingual data for NMT, which can be achieved by self-training (He et al., 2019) and back-translation (Sennrich et al., 2016a). While back-translation has dominated the NMT area for years (Fadaee and Monz, 2018;Edunov et al., 2018;Caswell et al., 2019), recent works on translationese (Marie et al., 2020;Graham et al., 2019) suggest that NMT models trained with backtranslation may lead to distortions in automatic and human evaluation. To address the problem, starting from WMT2019 (Barrault et al., 2019), the test sets only include naturally occurring text at the sourceside, which is a more realistic scenario for practical translation usage. In this new testing setup, the forward-translation (Zhang and Zong, 2016), i.e., self-training in NMT, becomes a more promising method as it also introduces naturally occurring text at the source-side. Therefore, we focus on the data sampling strategy in the self-training scenario, which is different from these prior studies.
Data Uncertainty in NMT. Data uncertainty in NMT has been investigated in the last few years.  analyzed the NMT models with data uncertainty by observing the effectiveness of data uncertainty on the model fitting and beam search.  and  computed the data uncertainty on the backtranslation data and the authentic parallel data and proposed uncertainty-aware training strategies to improve the model performance, respectively. Wei et al. (2020) proposed the uncertainty-aware semantic augmentation method to bridge the discrepancy of the data distribution between the training and the inference phases. In this work, we propose to explore monolingual data uncertainty to perform data sampling for the self-training in NMT.

Conclusion
In this work, we demonstrate the necessity of distinguishing monolingual sentences for self-training in NMT, and propose an uncertainty-based sampling strategy to sample monolingual data. By sampling monolingual data with relatively high uncertainty, our method outperforms random sampling significantly on the large-scale WMT English⇒German and English⇒Chinese datasets. Further analyses demonstrate that our uncertainty-based sampling approach does improve the translation quality of high uncertainty sentences and also benefits the prediction of low-frequency words at the target side. The proposed technology has been applied to TranSmart 6 (Huang et al., 2021), an interactive machine translation system in Tencent, to improve the performance of its core translation engine. Future work includes the investigation on the confirmation bias issue of self-training and the effect of decoding strategies on self-training sampling.
1.5. The "teacher" NMT model for self-training is the TRANSFORMER-BIG model to ensure the quality of synthetic data.

A.2 Linguistic Properties
Word Rarity. Word rarity measures the frequency of words in a sentence with a higher value indicating a more rare sentence (Platanios et al., 2019). The word rarity of a sentence is calculated as follows: where p(x t ) denotes the normalized frequency of word x t in the authentic parallel data, and T x is the sentence length.
Coverage. Coverage measures the ratio of source words being aligned by any target words (Tu et al., 2016). Firstly, we trained an alignment model on the authentic parallel data by fast-align 8 . Then we used the alignment model to force-align the monolingual sentences and the synthetic target sentences. Next, we calculated the coverage of each source sentence, and report the averaged coverage of each data bin. The lower coverage of monolingual sentences in bin 5 indicates that they are not aligned as well as the other bins.

A.3 Comparison with Related Work
We compared our sampling approach with two related works, i.e., difficult word by frequency (DWF, Fadaee and Monz, 2018) and source language model (SRCLM, Lewis, 2010). The former one was proposed for monolingual data selection for back-translation, in which sentences with low-frequency words were selected to boost the performance of back-translation. The latter one was proposed for in-domain data selection for indomain language models. For DWF, we ranked the monolingual data by word rarity (Platanios et al., 2019) of sentences and also selected the top 80M monolingual data for self-training. For SRCLM, we trained an N-gram language model (Heafield, 2011) 9 on the source sentences in the bitext and measured the distance between each monolingual sentence to the bitext source sentences by cross-entropy. Similarly, we selected 8M monolingual data with the lowest crossentropy for self-training.