Confidence-Aware Scheduled Sampling for Neural Machine Translation

Scheduled sampling is an effective method to alleviate the exposure bias problem of neural machine translation. It simulates the inference scene by randomly replacing ground-truth target input tokens with predicted ones during training. Despite its success, its critical schedule strategies are merely based on training steps, ignoring the real-time model competence, which limits its potential performance and convergence speed. To address this issue, we propose confidence-aware scheduled sampling. Specifically, we quantify real-time model competence by the confidence of model predictions, based on which we design fine-grained schedule strategies. In this way, the model is exactly exposed to predicted tokens for high-confidence positions and still ground-truth tokens for low-confidence positions. Moreover, we observe vanilla scheduled sampling suffers from degenerating into the original teacher forcing mode since most predicted tokens are the same as ground-truth tokens. Therefore, under the above confidence-aware strategy, we further expose more noisy tokens (e.g., wordy and incorrect word order) instead of predicted ones for high-confidence token positions. We evaluate our approach on the Transformer and conduct experiments on large-scale WMT 2014 English-German, WMT 2014 English-French, and WMT 2019 Chinese-English. Results show that our approach significantly outperforms the Transformer and vanilla scheduled sampling on both translation quality and convergence speed.

Generally, NMT models are trained to maximize the likelihood of the next token given previous golden tokens as inputs, i.e., teacher forcing (Goodfellow et al., 2016). However, at the inference stage, golden tokens are unavailable. The model is exposed to an unseen data distribution generated by itself. This discrepancy between training and inference is named as the exposure bias problem (Ranzato et al., 2016).
Many techniques have been proposed to alleviate the exposure bias problem. To our knowledge, they mainly fall into two categories. The one is sentence-level training, which treats the sentencelevel metric (e.g., BLEU) as a reward, and directly maximizes the expected rewards of generated sequences (Ranzato et al., 2016;Shen et al., 2016;Rennie et al., 2017). Although intuitive, they generally suffer from slow and unstable training due to the high variance of policy gradients and the credit assignment problem (Sutton, 1984;Liu et al., 2018;Wang et al., 2018). Another category is samplingbased approaches, aiming to simulate the data distribution of reference during training. Scheduled sampling  is a representative method, which samples tokens between golden references and model predictions with a scheduled probability.  further refine the sampling space of scheduled sampling with predictions from beam search. Mihaylova and Martins (2019) and Duckworth et al. (2019) extend scheduled sampling to the Transformer with a novel twopass decoding architecture.
Although these sampling-based approaches have been shown effective, most of them schedule the sampling probability based on training steps. We argue this schedule strategy has two following limitations: 1) It is far from exactly reflecting the real-time model competence; 2) It is only based on training steps and equally treat all token positions, which is too coarse-grained to guide the sampling selection for each target token. These two limitations yield an inadequate and inefficient schedule strategy, which hinders the potential performance and convergence speed of vanilla scheduled sampling-based approaches.
To address these issues, we propose confidenceaware scheduled sampling. Specifically, we take the model prediction confidence as the assessment of real-time model competence, based on which we design fine-grained schedule strategies. Namely, we sample predicted tokens as target inputs for high-confidence positions and still ground-truth tokens for low-confidence positions. In this way, the NMT model is exactly exposed to corresponding tokens according to its real-time competence rather than coarse-grained predefined patterns. Additionally, we observe that most predicted tokens are the same as ground-truth tokens due to teacher forcing 1 , degenerating scheduled sampling to the original teacher forcing mode. Therefore, we further expose more noisy tokens (Meng et al., 2020) (e.g., wordy and incorrect word order) instead of predicted ones for high-confidence token positions. Experimentally, we evaluate our approach on the Transformer (Vaswani et al., 2017) and conduct experiments on large-scale WMT 2014 English-German (EN-DE), WMT 2014 English-French (EN-FR), and WMT 2019 Chinese-English (ZH-EN).
The main contributions of this paper can be summarized as follows 2 : • To the best of our knowledge, we are the first to propose confidence-aware scheduled sampling for NMT, which exactly samples corresponding tokens according to the real-time model competence rather than coarse-grained predefined patterns.
• We further explore to sample more noisy tokens for high-confidence token positions, preventing scheduled sampling from degenerating into the original teacher forcing mode.
• Our approach significantly outperforms the Transformer by 1.01, 1.03, 0.98 BLEU and outperforms the stronger scheduled sampling by 0.51, 0.41, and 0.58 BLEU on EN-DE, 1 We observe that about 70% tokens are correctly predicted in WMT14  Codes are available at https://github.com/Ada xry/conf aware ss4nmt.

Self-Attention Feed Forward
Cross-Attention

Self-Attention
Feed Forward

Source Input
Target Input (Golden)

Self-Attention Feed Forward
Target Input (Golden + Predictions) Output Probabilities Output Probabilities EN-FR, and ZH-EN, respectively. Our approach speeds up model convergence about 3.0× faster than the Transformer and about 1.8× faster than vanilla scheduled sampling.
• Extensive analyses indicate the effectiveness and superiority of our approach on longer sentences. Moreover, our approach can facilitate the training of the Transformer model with deeper decoder layers.

Neural Machine Translation
Given a pair of source language X = {x 1 , x 2 , · · · , x m } with m tokens and target language Y = {y 1 , y 2 , · · · , y n } with n tokens, neural machine translation aims to model the following translation probability: where t is the index of target tokens, y <t is the partial translation before y t , and θ is model parameter. In the training stage, y <t are ground-truth tokens, and this procedure is also known as teacher forcing. The translation model is generally trained with maximum likelihood estimation (MLE).

Scheduled Sampling for the Transformer
Scheduled sampling is initially designed for Recurrent Neural Networks , and further modifications are needed when applied to the Transformer (Mihaylova and Martins, 2019;Duckworth et al., 2019). As shown in Figure 1, we follow the two-pass decoding architecture. In the first pass, the model conducts the same as a standard NMT model. Its predictions are used to simulate the inference scene 3 . In the second pass, inputs of the decoder y <t are sampled from predictions of the first pass and ground-truth tokens with a certain probability. Finally, predictions of the second pass are used to calculate the cross-entropy loss, and Equation (1) is modified as follow: Note that the two decoders are identical and share the same parameters. At inference, only the first decoder is used, that is just the standard Transformer. How to schedule the above probability of sampling tokens is the key point, which is exactly what we aim to improve in this paper.

Decay Strategies on Training Steps
Existing schedule strategies are based on training steps . As the number of the training step i increases, the model should be exposed to its own predictions more frequently. At the i-th training step, the probability of sampling golden tokens f (i) is calculated as follow: • Linear Decay: f (i) = max( , ki + b), where is the minimum value, and k < 0 and b are respectively the slope and offset of the decay.
• Exponential Decay: f (i) = k i , where k < 1 is the radix to adjust the sharpness of the decay.
• Inverse Sigmoid Decay: e is the mathematical constant, and k ≥ 1 is a hyperparameter to adjust the sharpness of the decay.
We draw visible examples for different decay strategies in Figure 2.

Approaches
In this section, we firstly describe how to estimate model confidence at each token position. Secondly, we elaborate the fine-grained schedule strategy based on model confidence. Finally, we explore to sample more noisy tokens instead of predicted tokens for high-confidence positions.

Model Confidence Estimation
We explore two approaches to estimate model confidence at each token position.
Predicted Translation Probability (PTP). Current NMT models are well-calibrated with regularization techniques in the training setting (Ott et al., 2018;Müller et al., 2019;. Namely the predicted translation probability can directly serve as the model confidence. At the tth target token position, we calculate the model confidence conf (t) as follow: Since we base our approach on the Transformer with two-pass decoding (Mihaylova and Martins, 2019;Duckworth et al., 2019), above predicted translation probability can be directly obtained in the first-pass decoding (shown in Figure 1), causing no additional computation costs.
Monte Carlo Dropout Sampling. The model confidence can be quantified by Bayesian neural networks (Buntine and Weigend, 1991; Neal, 2012), which place distributions over the weights of neural networks. We adopt widely used Monte Carlo dropout sampling (Gal and Ghahramani, 2016;Wang et al., 2019b) to approximate Bayesian inference. Given a batch of training data and current NMT model parameterized by θ, we repeatedly conduct forward propagation K times 4 . On the k-th propagation, part of neuronsθ (k) in network θ are randomly deactivated. Eventually, we obtain K sets of model parameters {θ (k) } K k=1 and corresponding translation probabilities. We use the expectation or variance of translation probabilities to estimate the model confidence (Wang et al., 2019b). Intuitively, the higher expectation or, the lower variance of translation probabilities reflects higher model confidence. Formally at the t-th token position, we estimate the model confidence conf (t) that calculated by the expectation of translation probabilities: We also use the variance of translation probabilities to estimate the model confidence conf (t) as an alternative: where Var[·] denotes the variance of a distribution that calculated following the setting in (Wang et al., 2019b;). We will further analyze the effect of different confidence estimations in Section 4.2.

Confidence-Aware Scheduled Sampling
The confidence score conf (t) quantifies whether the current NMT model is confident or hesitant on predicting the t-th target token. We take conf (t) as exact and real-time information to conduct a fine-grained schedule strategy in each training iteration. Specifically, a lower conf (t) indicates that the current model θ still struggles with the teacher forcing mode for the t-th target token, namely underfitting for the conditional probability P (y t |y <t , X, θ). Thus we should keep feeding ground-truth tokens for learning to predict the t-th target token. Conversely, a higher conf (t) indicates the current model θ has learned well the basic conditional probability under teacher forcing. Thus we should empower the model with the ability to cope with the exposure bias problem. Namely, we take inevitably erroneous model predictions as target inputs for learning to predict the t-th target.
Formally, in the second-pass decoding, the above fine-grained schedule strategy is conducted at all decoding steps simultaneously: where t golden is a threshold to measure whether conf (t) is high enough (e.g., 0.9) to sample the predicted tokenŷ t−1 .

Confidence-Aware Scheduled Sampling with Target Denoising
Considering predicted tokens are obtained from the teacher forcing model, most predicted tokens (e.g., about 70% tokens in WMT14 EN-DE) are the same as ground-truth tokens, which degenerate the scheduled sampling to the original teacher forcing. Although previous study  have proposed to address this issue by using predictions from beam search, it conducts very slowly (about 4× slower than ours) due to the autoregressive property of beam search decoding. To avoid the above degeneration problem while preserving computational efficiency, we try to add more noisy tokens instead of predicted tokens for high-confidence positions. Inspired by Meng et al. (2020), we replace ground-truth y t−1 with a random token y rand of the current target sentence, which can simulate wordy and incorrect word order phenomena that occur at inference. Considering y rand is more difficult 5 to learn thanŷ t−1 , we only adopt the noisy y rand for higher confidence positions. Therefore, the finegrained schedule strategy in Equation 6 is extended to: where t rand is a threshold to measure whether conf (t) is high enough (e.g., 0.95) to sample the random target token y rand . We provide detailed selections about t golden and t rand in Section 4.2.

Experiments
We conduct experiments on three large-scale WMT 2014 English-German (EN-DE), WMT 2014 English-French (EN-FR), and WMT 2019 Chinese-English (ZH-EN) translation tasks. We respectively build a shared source-target vocabulary for the EN-DE and EN-FR datasets, and unshared vocabularies for the ZH-EN dataset. We apply byte-pair encoding (Sennrich et al., 2016) with 32k merge operations for all datasets. More datasets statistics are listed in Table 1.

Implementation Details
Training Setup. We train the Transformer base and Transformer big models (Vaswani et al., 2017) with the open-source THUMT (Zhang et al., 2017). All Transformer models are first trained by teacher forcing with 100k steps, and then trained with different training objects or scheduled sampling approaches for 300k steps. All experiments are conducted on 8 NVIDIA Tesla V100 GPUs, where each is allocated with a batch size of approximately 4096 tokens. We use Adam optimizer (Kingma and Ba, 2014) with 4000 warmup steps. During training and the Monte Carlo Dropout process, we set dropout (Srivastava et al., 2014) rate to 0.1 for the Transformer base and 0.3 for the Transformer big .
Evaluation. We set the beam size to 4 and the length penalty to 0.6 during inference. We use multibleu.perl to calculate case-sensitive BLEU scores for WMT14 EN-DE and EN-FR, and use mteval-v13a.pl to calculate case-sensitive BLEU scores for WMT19 ZH-EN. We use the paired bootstrap resampling methods (Koehn, 2004) to compute the statistical significance of test results.

Hyperparameter Experiments
In this section, we elaborate hyperparameters settings involved in our approaches according to the performance on the validation set of WMT14 EN-DE, and share these settings for all WMT tasks.

Different Confidence Estimations.
In this section, we analyze effects of different estimations for model confidence described in Section 3.1. As shown in Table 2, we observe that Monte Carlo dropout sampling based approaches (i.e., expectation and variance of translation probabilities) achieve comparable or marginally better translation quality than PTP. However, since Monte Carlo dropout sampling based approaches need   (3). 'Expectation' and 'Variance' refers to Monte Carlo dropout sampling-based confidence estimation in Equation (4) and (5) Thresholds Settings. There are two important hyperparameters in our approaches, namely the two threshold t golden and t rand that determine token selections in Equation (7). In our preliminary experiments, we observe our approach is relatively not sensitive to t golden , thus we firstly fix t golden to a modest value, i.e., 0.5 and analyze effects when t rand ranging from 0.5 to 0.95. As the red line is shown in Figure 3, we observe that a rapid improvement in performance with the growth of t rand . Therefore, we decide to set t rand to 0.95 and then analyze effects when t golden ranging from 0.5 to 0.95. As the blue line is shown in Figure  3, the model performance gently rises with the growth of t golden and finally achieves its peak when t golden = 0.9. Thus we finally set t golden to 0.9.  (Song et al., 2020) 29.20 --+ Self-paced learning    (Goodman et al., 2020) reports SacreBLEU scores. For fair comparison, we re-implement it and report BLEU scores. ' * / * * ': significantly (Koehn, 2004) better than 'Vanilla Scheduled Sampling' with p < 0.05 and p < 0.01. TeaForN. Teacher forcing with n-grams (Goodman et al., 2020) enable the standard teacher forcing with a broader view by n-grams optimization.

Systems
Self-paced learning.  assign confidence scores for each input to weight its loss.
Vanilla schedule sampling. Scheduled sampling on training steps with the inverse sigmoid decay .
Sampling with sentence oracles.  refine the sampling space of scheduled sampling with sentence oracles, i.e., predictions from beam search. Note that its sampling strategy is still based on training steps with the sigmoid decay.
Target denoising. Meng et al. (2020) add noisy perturbations into decoder inputs when training, which yields a more robust translation model against prediction errors by target denoising.
Confidence-aware scheduled sampling. Our fine-grained schedule strategy described in Equation (6) with t golden = 0.9.
Confidence-aware scheduled sampling with target denoising. Our fine-grained schedule strategy described in Equation (7) with t golden = 0.9 28.15 * * +1.05 Table 4: BLUE scores (%) on the validation set of WMT14 EN-DE with different schedule strategies. 'Confidence' refers to the confidence-aware strategy in Equation (6). 'ref.' is short for the reference baseline. ' * / * * ': significantly (Koehn, 2004) better than the Transformer base with p < 0.05 and p < 0.01. and t rand = 0.95 .

Main Results
We list translation qualities in Table 3. For the Transformer base baseline, our 'Confidence-aware scheduled sampling' shows consistent improvements by 0.90, 0.98, 0.89 BLEU points on EN-DE, ZH-EN, and EN-FR, respectively. Moreover, after applying the more fine-grained strategy with target denoising, our 'Confidence-aware scheduled sampling with target denoising' achieves further improvements which are 1.01, 1.03, 0.98 BLEU points on EN-DE, ZH-EN, and EN-FR, respectively. When comparing with the stronger vanilla scheduled sampling method, 'Confidence-aware scheduled sampling with target denoising' still yields improvements by 0.51, 0.57, and 0.41 BLEU points on the above three tasks, respectively. For the more powerful Transformers big , we also observe similar experimental conclusions as above. Specifically, 'Confidence-aware scheduled sampling with target denoising' outperforms vanilla scheduled sampling by 0.47, 0.67, and 0.42 BLEU points, respectively. In summary, experiments on strong baselines and various tasks verify the effectiveness and superiority of our approaches.

Analysis and Discussion
We analyze our proposals on WMT 2014 EN-DE with the Transformer base model.

Effects of Confidence-Aware Strategies
In this section, we rigorously validate the effectiveness of confidence-aware strategies by univariate experiments with the only difference at schedule strategy. As shown in Table 4, existing heuristic functions, i.e., linear, exponential, and inverse sigmoid decay, moderately bring improvements  Table 5: BLUE scores (%) on the validation set of WMT14 EN-DE for ablation experiments. 'Our approach' is 'confidence-aware scheduled sampling with target denoising' in Equation (7). 'Confidence' refers to the confidence-aware strategy in Equation (7). 'Denoising' refers to the target random noise y rand in Equation (7). 'ref.' is short for the reference baseline.
over the Transformer base baseline by 0.46, 0.50, and 0.55 BLEU points, respectively. While our confidence-aware strategy that described in Equation (6) can significantly outperform the baseline by 1.05 BLEU points. We attribute the effectiveness of the confidence-aware strategy to its exact and suitable token assignments according to the real-time model competence rather than predefined patterns.

Ablation Experiments
We conduct ablation experiments to investigate the impacts of various components in our 'Confidenceaware scheduled sampling with target denoising' (described in Equation (7)) and list results in Table 5. Separately removing the confidence-aware strategy degenerates our approach into the vanilla target denoising with a uniform strategy (Meng et al., 2020), which causes a noticeable drop (0.4 BLEU), indicating the confidence-aware strategy plays a leading role for performance. On the other hand, we only observe a drop (0.15 BLEU) when removing 'Target denoising', revealing the additional noise plays a secondary role for performance. Finally, ablating both the confidence-aware strategy and 'Target denoising' degenerates our approach into the vanilla scheduled sampling. It yields a further decrease (0.51 BLEU), suggesting the confidence-aware strategy and 'Target denoising' are complementary with each other.

Different Numbers of Decoder Layers
As known in existing studies (Domhan, 2018;Wang et al., 2019a), there exists a performance bottleneck at the decoder side of NMT models. Namely, the increase in the number of decoder layers can not bring corresponding improvements for performance. He et al. (2019) attribute this bottle- neck to the fact that decoders learn an easier task than encoders. In this paper, our fine-grained schedule strategy in Equation (7) assigns a more difficult task to the decoder. We can not help wondering whether our strategy is able to alleviate the above performance bottleneck. Firstly, we keep the number of encoders fixed to 6 (i.e., Encoder-6), then apply our confidence-aware schedule strategy on the Encoder-6 Transformer base with the number of decoder layers ranging from 1 to 6. As shown in Figure 4, our approach (solid red line) consistently outperforms the Encoder-6 Transformer base (dashed red line). More importantly, the improvement of Encoder-6 Transformer base stops (i.e., performance bottleneck) once the number of decoder exceeds 4. Despite this, we observe continuous improvement with the growth of decoder layers in our approach. Moreover, we repeat the above experiments with more powerful deep encoders (Encoder-20). We observe that the performance bottleneck for Encoder-20 Transformer base becomes more evident (dashed blue line). Despite this, our approaches (solid blue line) still keep improving performance with the growth of decoder layers on the stronger Encoder-20 Transformer base .
In summary, our confidence-aware schedule strategy brings a meaningful increase in the difficulty of decoders, and the bottleneck at the decoder side is alleviated to a certain extend.

Effects on Different Sequence Lengths
Due to error accumulations, the exposure bias problem becomes more problematic with the growth of sequence lengths (Zhou et al., 2019;  2020). Thus it is intuitive to verify the effectiveness of our approach over different sequence lengths.
Considering the validation set of WMT14 EN-DE (3k) is too small to cover scenarios with various sentence lengths, we randomly select 10k training data with lengths from 10 to 100. As shown in Figure 5, our approach consistently outperform the Transformer base model at different sequence lengths. Moreover, the improvements of our approach over the Transformer base is gradually increasing with sentence lengths. Specifically, we observe more than 1.0 BLEU improvements when sentence lengths in [80, 100].

Model Convergence
As aforementioned, our confidence-aware scheduled sampling learns to deal with the exposure bias problem in an efficient manner, thus speeding up the model convergence. As shown in Figure 6, it costs the Transformer base 245k steps to converge to a local optimum (about 27.1 BLEU). To achieve the same performance, it only costs our confidenceaware scheduled sampling 80k step, namely about 3.0× speed up over the Transformer base and 1.8× speed up over the vanilla scheduled sampling. Since vanilla scheduled sampling randomly exposes more difficult predicted tokens for each token position, regardless of the actual model competence, its convergence speed is restricted to a certain extent. On the contrary, our approach samples predicted tokens only if the current model is capable of dealing with these more difficult inputs, mimicking the learning process of humans. Therefore, our approach is trained more efficiently.

Related Work
Confidence-aware Learning for NMT. As to confidence estimations for NMT, Zoph et al. (2015) frame translation as a compression game and measure the amount of information added by transla- tors. Wang et al. (2019b) propose to quantify the confidence of NMT model predictions based on model uncertainty, which is widely extend to select training samples (Jiao et al., 2020;Dou et al., 2020), to design confidence-aware curriculum learning , and to augment synthetic corpora (Wei et al., 2020). Model confidence is also served as a useful metric for analyze NMT model from the perspective of fitting and search (Ott et al., 2018), visualization (Rikters et al., 2017) and calibration (Kumar and Sarawagi, 2019;. Different from existing studies, we are the first to propose confidence-aware scheduled sampling for alleviating the exposure bias problem in NMT.

Conclusion
In this paper, we propose confidence-aware scheduled sampling for NMT, which exactly samples corresponding tokens according to the real-time model competence rather than human intuitions. We further explore to sample more noisy tokens for high-confidence token positions, preventing scheduled sampling from degenerating into the original teacher forcing mode. Experiments on three largescale WMT translation tasks suggest that our approach improves vanilla scheduled sampling both translation quality and convergence speed. We elaborately analyze the effectiveness and efficiency of our approach from multiple aspects. As a result, we further observe our approaches: 1) can alleviate the performance bottleneck of decoders for NMT to a certain extend; 2) improve the translation quality of long sequences.