Scheduled Sampling Based on Decoding Steps for Neural Machine Translation

Scheduled sampling is widely used to mitigate the exposure bias problem for neural machine translation. Its core motivation is to simulate the inference scene during training by replacing ground-truth tokens with predicted tokens, thus bridging the gap between training and inference. However, vanilla scheduled sampling is merely based on training steps and equally treats all decoding steps. Namely, it simulates an inference scene with uniform error rates, which disobeys the real inference scene, where larger decoding steps usually have higher error rates due to error accumulations. To alleviate the above discrepancy, we propose scheduled sampling methods based on decoding steps, increasing the selection chance of predicted tokens with the growth of decoding steps. Consequently, we can more realistically simulate the inference scene during training, thus better bridging the gap between training and inference. Moreover, we investigate scheduled sampling based on both training steps and decoding steps for further improvements. Experimentally, our approaches significantly outperform the Transformer baseline and vanilla scheduled sampling on three large-scale WMT tasks. Additionally, our approaches also generalize well to the text summarization task on two popular benchmarks.


Introduction
Neural Machine Translation (NMT) has made promising progress in recent years (Sutskever et al., 2014;Bahdanau et al., 2015;Vaswani et al., 2017). Generally, NMT models are trained to maximize 1 To calculate the precision for training, we strictly match predicted tokens with ground-truth tokens word by word. When inference, we relax the strict matching to the fuzzy matching within a local window of size 3, and truncate or pad hypotheses to the same length of golden references. We also explore n-gram matching in preliminary experiments and observe analogical results with different n. For simplicity, we use the above unigram matching to calculate the translation precision (similarly for the error rate) in all experiments. Figure 1: The translation precision for training (blue line) and inference (red line) at each decoding step. The gap between training and inference (black line) increases rapidly with the growth of decoding steps. We randomly sample 100k training data from WMT 2014 EN-DE and report the average precision of 1k tokens for each decoding step 1 . the likelihood of next token given previous golden tokens as inputs, i.e., teacher forcing (Salakhutdinov, 2014). However, at the inference stage, golden tokens are unavailable. The model is exposed to an unseen data distribution generated by itself. This discrepancy between training and inference is named as the exposure bias problem (Ranzato et al., 2016). With the growth of decoding steps, such discrepancy becomes more problematic due to error accumulations Zhang et al., 2020a) (shown in Figure 1).
Many techniques have been proposed to alleviate the exposure bias problem. To our knowledge, they mainly fall into two categories. The one is sentencelevel training, which treats the sentence-level metric (e.g., BLEU) as a reward, and directly maximizes the expected rewards of generated sequences (Ranzato et al., 2016;Shen et al., 2016;Rennie et al., 2017;Pang and He, 2021). Although intuitive, they generally suffer from slow and unstable training due to the high variance of policy gradients and the credit assignment problem (Sutton, 1984;Wiseman and Rush, 2016;Liu et al., 2018;Wang et al., 2018). Another category is sampling-based approaches, aiming to simulate the data distribution of the inference scene during training. Scheduled sampling  is a representative method, which samples tokens between golden references and model predictions with a scheduled probability.  further refine the sampling candidates by beam search. Mihaylova and Martins (2019) and Duckworth et al. (2019) extend scheduled sampling to the Transformer with a novel two-pass decoder architecture. Liu et al. (2021) develop a more fine-grained sampling strategy according to the model confidence.
Although these sampling-based approaches have been shown effective and training efficient, there still exists an essential issue in their sampling strategies. In the real inference scene, the nature of sequential predictions quickly accumulates errors along with decoding steps, which yields higher error rates for larger decoding steps Zhang et al., 2020a) (Figure 1). However, most sampling-based approaches are merely based on training steps and equally treat all decoding steps 2 . Namely, they simulate an inference scene with uniform error rates along with decoding steps, which is inconsistent with the real inference scene.
To alleviate this inconsistent issue, we propose scheduled sampling methods based on decoding steps, which increases the selection chance of predicted tokens with the growth of decoding steps. In this way, we can more realistically simulate the inference scene during training, thus better bridging the gap between training and inference. Furthermore, we investigate scheduled sampling based on both training steps and decoding steps, which yields further improvements. It indicates that our proposals are complementary with existing studies. Additionally, we provide in-depth analyses on the necessity of our proposals from the perspective of translation error rates and accumulated errors. Experimentally, our approaches significantly outperform the Transformer baseline by 1.08, 1.08, and 1.27 BLEU points on WMT 2014 English-German, WMT 2014 English-French, and WMT 2019 Chinese-English, respectively. When comparing with the stronger vanilla scheduled sampling method, our approaches bring further improvements by 0.58, 0.62, and 0.55 BLEU points on these WMT tasks, respectively. Moreover, our approaches generalize well to the text summarization task and achieve consistently better performance 2 For clarity in this paper, 'training steps' refer to the number of parameter updates and 'decoding steps' refer to the index of decoded tokens on the decoder side. on two popular benchmarks, i.e., CNN/DailyMail (See et al., 2017) and Gigaword (Rush et al., 2015).
The main contributions of this paper can be summarized as follows 3 : • To the best of our knowledge, we are the first that propose scheduled sampling methods based on decoding steps from the perspective of simulating the distribution of real translation errors, and provide in-depth analyses on the necessity of our proposals.
• We investigate scheduled sampling based on both training steps and decoding steps, which yields further improvements, suggesting that our proposals complement existing studies.
• Experiments on three large-scale WMT tasks and two popular text summarization tasks confirm the effectiveness and generalizability of our approaches.
• Analyses indicate our approaches can better simulate the inference scene during training and significantly outperform existing studies.

Neural Machine Translation
Given a pair of source language X = {x 1 , x 2 , · · · , x m } and target language Y = {y 1 , y 2 , · · · , y n }, neural machine translation aims to model the following translation probability: where t is the index of target tokens, y <t is the partial translation before y t , and θ is model parameter. In the training stage, y <t are ground-truth tokens, and this procedure is also known as teacher forcing. The translation model is generally trained with maximum likelihood estimation (MLE).

Scheduled Sampling for the Transformer
Scheduled sampling is initially designed for Recurrent Neural Networks , and further modifications are needed when applied to the Transformer (Mihaylova and Martins, 2019;Duckworth et al., 2019). As shown in Figure 2, we follow the two-pass decoder architecture for the training of Transformers. In the first pass, the

Self-Attention Feed Forward
Cross-Attention

Self-Attention
Feed Forward

Source Input
Target Input (Golden)

Self-Attention Feed Forward
Target Input (Golden + Predictions) Output Probabilities Output Probabilities Figure 2: Scheduled sampling for the transformer with a two-pass decoder at training. model conducts the same as a standard NMT model. Its predictions are used to simulate the inference scene 4 . In the second pass, the decoder's inputs y <t are sampled from predictions of the first pass and ground-truth tokens with a certain probability. Finally, predictions of the second pass are used to calculate the cross-entropy loss, and Equation (1) is modified as follow: Note that the two decoders are identical and share the same parameters during training. At inference, only the first decoder is used, that is just the standard Transformer. How to schedule the above probability of sampling tokens for training is the key point, which is we aim to improve in this paper.

Decay Strategies Based on Training Steps
Existing schedule strategies are based on training steps . At the i-th training step, the probability of sampling golden tokens f (i) is calculated as follow: • Linear Decay: f (i) = max( , ki + b), where is the minimum value, and k < 0 and b is respectively the slope and offset of the decay.
• Exponential Decay: is the radix to adjust the decay.
• Sigmoid Decay 5 : , where e is the mathematical constant, and k ≥ 1 is a hyperparameter to adjust the decay. We draw some examples for different decay strategies based on training steps in Figure 3.

Definitions and the Overview
At the training stage, in the input of the secondpass decoder, each token is sampled either from the golden token or the predicted token by the first-pass decoder. For clarity, we only define the probability of sampling golden tokens, e.g., f (i), and use 1 − f (i) to represent the probability of sampling predicted tokens. Specifically, we define the probability of sampling golden tokens as f (i) when sampling based on the training step i, as g(t) when sampling based on the decoding step t, and as h(i, t) when sampling based on both training steps and decoding steps. In this paper, when we mention a scheduled strategy, it is about the probability of sampling golden tokens at the model training stage.
In this section, we firstly point out the drawback of merely sampling based on training steps. Secondly, we describe how to appropriately sample based on decoding steps. Finally, we explore whether sampling based on both training steps and decoding steps can complement each other.

Sampling Based on Training Steps
As the number of the training step i increases, the model should be exposed to its own predictions more frequently. Thus a decay strategy for sampling golden tokens f (i) (in Section 2.3) is generally used in existing studies . At a specific training step i, given a target sentence, f (i) is only related to i and equally conducts the same sampling probability for all decoding steps. Therefore, f (i) simulates an inference scene with uniform error rates and still remains a gap with the real inference scene.

Sampling Based on Decoding Steps
We take a further step to bridge the above gap f (i) left. Specifically, we propose sampling based on decoding steps and schedule the sampling probability g(t) under the guidance of real translation errors. As mentioned earlier (Figure 1), translation error rates are growing rapidly along with decoding steps in the real inference stage. To more realistically simulate such error distributions of the real inference scene during training, we expose more model predictions for larger decoding steps and more golden tokens for smaller decoding steps. Thus it is intuitive to apply a decay strategy for sampling golden tokens based on the number of decoding steps t. Specifically, we directly inherit above decay strategies (Section 2.3) for training steps f (i) to g(t) with a different set of hyperparameters (listed in Table 2). To rigorously validate the necessity and effectiveness of our proposals, we further conduct the following method variants for comparisons: • Always Sampling: This model always samples from its own predictions.
• Uniform Sampling: This model randomly samples golden tokens with a uniform probability (0.5 in our experiments).
We draw some representative strategies 6 in Figure  4. Both 'Always Sampling' (blue line) and 'Uniform Sampling' (green line) parallel to the x-axis, namely irrelevant with t. They serve as baseline models to verify whether a scheduled strategy is necessary on the dimension of t. The exponential decay (solid red line) shows a similar trend with the real error rate (black line): the larger decoding steps and the higher error rates. On the other hand, the exponential increase (dashed red line) is entirely contrary to the real error rate. However, we cannot take it for granted that the exponential increase is inappropriate, as it can still simulate the error accumulation phenomenon 7 . Therefore, merely comparing error rates is not enough. We need to step deeper into the dimension of error accumulations for further comparisons.
Error Accumulations. At the decoding step t, the number of accumulated errors accum(t) is the definite integral of the probability of sampling model predictions 1 − g(t): As shown in Figure 5, accum(t) is a monotonically increasing function, which can simulate the error accumulation phenomenon no matter which kind of scheduled strategy g(t) during training. Nevertheless, we observe that different strategies show different speeds and distributions for simulating error accumulations. For instance, decay strategies (solid lines) show a slower speed at the beginning of decoding steps and then rapidly accumulate errors with the growth of decoding steps, Figure 6: Examples for different h(i, t). The wavelengths of colors represent the probability of sampling golden tokens. Namely, the closer the color to the red, the greater the probability. Red circles are for the sake of highlights.
which is analogous with the real inference scene (black line). However, increase strategies (dashed lines) are just on the contrary. They simulate a distribution with lots of errors at the beginning and an almost fixed number of errors in following decoding steps. Moreover, although different decay strategies show similar trends for simulating error accumulations in the training stage, the degrees of their approximations with real error numbers are still different. We will further validate whether the proximity is closely related to the final performance in Section 5.1.

Sampling Based on Both Training Steps and Decoding Steps
When comparing above two types of approaches, i.e., f (i) and g(t), our approach g(t) focus on simulating the distribution of real translation errors, and the vanilla f (i) emphasizes the competence of the current model. Thus it is intuitive to verify whether f (i) and g(t) complement each other. How to combine them is the critical point. At the training step i and decoding step t, we define the probability of sampling golden tokens h(i, t) by the following joint distribution function: One simple solution ('Product') is to directly multiply f (i) and g(t) . However, both f (i) and g(t) are less than or equal to 1, thus their product quickly shrinks to a tiny value close to 0. Consequently, it exposes too few golden tokens and too many predicted tokens to the model (Figure 6 (a)), which 8 We also tried f (i · (1 − g(t)) in preliminary experiments, but it slightly underperformed the above g(t · (1 − f (i))).

Dataset
Size ( increases the difficulty for training. 'Arithmetic Mean' is another possible solution with a relatively gentle combination. However, it still inappropriately exposes too few golden tokens to the model at the beginning of training steps (Figure 6 (b)). Finally, we propose to apply function compositions on both f (i) and g(t) (i.e., 'Composite'). It guarantees enough golden tokens at the beginning of training steps, and gradually exposes more predicted tokens to the model with the increase of both i and t (Figure 6 (c)). We will analyze effects of different h(i, t) in Section 5.2.

Experiments
We validate our proposals on two important sequence generation tasks, i.e., machine translation and text summarization.

Tasks and Datasets
Machine 2017), and (b) Gigaword corpus (Rush et al., 2015). We list dataset statistics for all datasets in Table 1.

Implementation Details
Training Setup. For the translation task, we follow the default setup of the Transformer base and Transformer big models (Vaswani et al., 2017), and provide detailed setups in Appendix A (Table 7). All Transformer models are first trained by teacher forcing with 100k steps, and then trained with different training objects or scheduled sampling approaches for 300k steps. All experiments are conducted on 8 NVIDIA V100 GPUs, where each is allocated with a batch size of approximately 4096 tokens. For the text summarization task, we base on the ProphetNet (Qi et al., 2020) and follow its training setups. We set hyperparameters involved in various scheduled sampling strategies (i.e., f (i) and g(t)) according to the performance on validation sets of each tasks and list k in Table 2. For the linear decay, we set and b to 0.2 and 1, respectively. Please note that scheduled sampling is only used during training instead of the inference stage.
Evaluation. For the machine translation task, we set the beam size to 4 and the length penalty to 0.6 during inference. We use multibleu.perl to calculate cased sensitive BLEU scores for EN-DE and EN-FR, and use mteval-v13a.pl script to calculate cased sensitive BLEU scores for ZH-EN. We use the paired bootstrap resampling methods (Koehn, 2004) to compute the statistical significance of translation results. We report mean and standard-error variation of BLEU scores over three runs. For the text summarization task, we respectively set the beam size to 4/5 and length penalty to 1.0/1.2 for Gigaword and CNN/DailyMail dataset following previous studies (Song et al., 2019;Qi et al., 2020). We report the F1 scores of ROUGE-1, ROUGE-2, and ROUGE-L for both datasets. TeaForN. Teacher forcing with n-grams (Goodman et al., 2020) enables the standard teacher forcing with a broader view by a n-grams optimization.
Sampling based on training steps. For distinction, we name vanilla scheduled sampling as Sampling based on training steps. We defaultly adopt the sigmoid decay following .
Sampling with sentence oracles. Zhang et al.
(2019) refine the sampling candidates of scheduled sampling with sentence oracles, i.e., predictions from beam search. Note that its sampling strategy is based on training steps with the sigmoid decay.
Sampling based on decoding steps. Sampling based on decoding steps with exponential decay.

Sampling based on training and decoding steps.
Our sampling based on both training steps and decoding steps with the 'Composite' method.
show better translation qualities while preserving efficient training. TeaForN also yields competitive translation qualities due to its long-term optimization. Among all existing methods, our 'Sampling based on decoding steps' shows consistent improvements on various datasets. Moreover, 'Sampling based on training and decoding steps' combines the advantages of both existing methods and our proposals, and achieves better performance. Specifically for the Transformers base , it brings significant improvements by 1.08, 1.08, and 1.27 BLEU points on EN-DE, ZH-EN, and EN-FR, respectively. Moreover, it significantly outperforms vanilla scheduled sampling by 0.58, 0.62, and 0.55 BLEU points on these tasks, respectively. For the more powerful Transformers big , we observe similar experimental conclusions as above. Specifically, 'Sampling based on training and decoding steps' significantly outperforms the Transformers big by 1.26, 0.88 and 1.24 BLEU points on EN-DE, ZH-EN, and EN-FR, respectively.
Text Summarization. In Table 4, we list F1 scores of ROUGE-1 / ROUGE-2 / ROUGE-L on test sets of both text summarization datasets. We take the powerful ProphetNet large as our primary baseline 9 and apply different sampling-based approaches. For vanilla scheduled sampling (second last row of Table 4), we observe marginal improvements on Gigaword and even degenerations on CNN/DailyMail. We speculate that poor performance comes from their uniform sampling rate along with decoding steps, which violates the distribution of the real inference scene. Namely, the model is overexposed to golden tokens and underexposed to predicted tokens at larger decoding steps. Especially for CNN/DailyMail, its averaged target sequence length exceeds 64, and more than 90% of sentences are longer than 50, which exacerbates the above issue in existing samplingbased approaches. We further analyze the effects of different sampling approaches on different sequence lengths in Section 5.3. Nevertheless, our approaches are not affected by the above issue and show consistent improvements in all criteria of both  datasets. Specifically, our approaches achieve consistently better performance than the baseline system on both datasets, and significantly improve the previous SOTA on ROUGE-L score of Gigaword to 37.24 (+0.5). In conclusion, the strong performance on the text summarization task indicates that our approaches have a good generalization ability across different tasks.

Analysis and Discussion
In this section, we provide in-depth analyses on the necessity of our proposals and conducts experiments on the validation set of WMT14 EN-DE with the Transformer base model.

Effects of Scheduled Strategies
In this section, we focus on the effects of different scheduled strategies based on the decoding step t, and aim to answer the following two questions: (a) Is a Scheduled Strategy is Necessary? We take the Transformer without sampling as the baseline, then respectively apply 'Always Sampling', 'Uniform Sampling', and our 'Exponential Decay'. Results are listed in the part (a) of Table 5. We observe a noticeable drop when conducting 'Always Sampling', as the model is entirely exposed to its predictions and fails to converge fully. As to 'Uniform Sampling', it is essentially a simulation of the vanilla 'Sampling based on Training Steps'. Although 'Uniform Sampling' conducts an inappropriate sampling strategy, it still can simulate the data distribution of the inference scene to some extent and bring BLEU improvements modestly. In  contrast, our 'Exponential Decay' conducts a sampling strategy following real translation errors. It significantly outperforms both 'No Sampling' and 'Uniform Sampling' by 1.64 and 0.68 BLEU scores.
In short, we conclude that an appropriate scheduled strategy based on decoding steps is necessary.

(b) Why Decay Instead of Increase Strategies?
Considering errors naturally accumulate along with decoding steps, both decay strategies and increase strategies can simulate error accumulations. We respectively apply both kinds of sampling strategies upon the 'Uniform Sampling' baseline model, and list results in the part(b) and part(c) of Table 5. Surprisingly, all increase strategies consistently decrease performance by considerable margins. We conjecture that these increase strategies simulate an unreasonably high error rate at the beginning of decoding steps. Too many translation errors are  propagated to subsequent decoding steps, which hinders the final performance. On the contrary, all decay strategies bring consistent improvements with different degrees. Moreover, we observe that the more a decay strategy approximates real error numbers (Figure 5), the more performance improvements. In summary, we need to apply decay strategies instead of increase strategies based on decoding steps in the perspective of simulating real error accumulations.

Effects of Different h(i, t) Strategies
We take our strong 'Sampling based on decoding steps' as the baseline and then apply different combination methods h(i, t). As shown in Table  6, the performance drop of 'Product' and 'Arithmetic Mean' confirms our speculation in Section 3.4. Namely, the model is overexposed to its predictions at the beginning of training steps and decoding steps, thus fails to converge well. In contrast, 'Composite' brings certain improvements over the strong baseline model. Since it stabilizes the model training and successfully combines the advantages of both dimensions of training steps and decoding steps. In summary, a well-designed strategy is necessary when combining both f (i) and g(t), and we provide an effective alternative (i.e., 'Composite').

Effects on Different Sequence Lengths
According to our early findings, the exposure bias problem gets worse as the sentence length grows. Thus it is intuitive to verify whether our approaches improve translations of long sentences. Since the size of WMT14 EN-De validation set (3k) is too small to cover scenarios with various sentence lengths, we randomly select training data with different sequence lengths. Specifically, we divide WMT14 EN-DE training data into ten bins according to the source side's sentence length. The maximal length is 100, and the interval size is 10. Then we randomly select 1000 sentence pairs from each bin and calculate BLEU scores for different approaches. Specifically, we take the Transformer as the baseline, and draw absolute BLEU gains of scheduled sampling on training steps and decoding steps. As shown in Figure 7, BLEU gains of the vanilla scheduled sampling are relatively uniform over different sentence lengths. In contrast, BLEU gains of our scheduled sampling on decoding steps gradually increase with sentence lengths. Moreover, our approach consistently outperforms the vanilla one at most sentence length intervals. Specifically, we observe more than 1.0 BLEU improvements when sentence lengths in [80; 100].

Conclusion
In this paper, we propose scheduled sampling methods based on decoding steps from the perspective of simulating real translation error rates, and provide in-depth analyses on the necessity of our proposals. We also confirm that our proposals are complementary with existing studies (based on training steps). Experiments on three large-scale WMT translation tasks and two text summarization tasks confirm the effectiveness of our approaches. In the future, we will investigate low resource settings which may suffer from a more serious error accumulation problem. In addition, more autoregressive-based tasks would be explored as future work.

A Training Details
We list detailed parameters for training Transformer models in Table 7.

B Real Error Rates as Sampling Priors
In the above contents of this paper, we aim to better simulate the inference scene under the guidance of real error rates. We can not help wondering the effect of directly taking the above error rates as sampling priors. Disappointingly, it fails to outperform our exponential decay strategy within a gap of 0.1 BLEU scores. We conjecture the metric we used to measure translation errors at each decoding step may not be good enough. Considering the optimal metric is currently unknown and unavailable, our unigram matching can yet be regarded as a simple and effective alternative. It succeeds in reflecting the trend of real error rates and brings significant improvements by simulating the error distribution estimated by unigram matching. We believe a better metric would bring further improvements and leave this exploration for future work.