Self-Guided Curriculum Learning for Neural Machine Translation

In supervised learning, a well-trained model should be able to recover ground truth accurately, i.e. the predicted labels are expected to resemble the ground truth labels as much as possible. Inspired by this, we formulate a difficulty criterion based on the recovery degrees of training examples. Motivated by the intuition that after skimming through the training corpus, the neural machine translation (NMT) model “knows” how to schedule a suitable curriculum according to learning difficulty, we propose a self-guided curriculum learning strategy that encourages the NMT model to learn from easy to hard on the basis of recovery degrees. Specifically, we adopt sentence-level BLEU score as the proxy of recovery degree. Experimental results on translation benchmarks including WMT14 English-German and WMT17 Chinese-English demonstrate that our proposed method considerably improves the recovery degree, thus consistently improving the translation performance.


Introduction
Inspired by the learning behavior of humans, Curriculum Learning (CL) for neural network training starts from a basic idea of "starting small", namely better to start from easier aspects of a task and then progress towards aspects with increasing level of difficulty (Elman, 1993). Bengio et al. (2009) achieves significant performance boost on several tasks by forcing models to learn training examples following an order from "easy" to "difficult". They further explain CL method with two important constituents: how to rank training examples by learning difficulty and how to schedule the presentation of training examples based on that rank.  Figure 1: The NMT model is well-trained on parallel corpus D, {(x 1 , y 1 ), (x 2 , y 2 )} ∈ D.ŷ i is translated from x i . The distance between the ground truth y i and the NMT generated hypothesisŷ i represents the recovery degree (dashed arrows), which is computed by sentence-level BLEU in our case. Blue-and greencolored examples represent the NMT learned distribution and the empirical distribution, respectively. Taking x 1 and x 2 as the input, the training example (x 1 , y 1 ) shows a better recovery degree, which means it's easier to be mastered than (x 2 , y 2 ).
In the field of neural machine translation (NMT), empirical studies have shown that CL strategies contribute to both convergence speed and model performance (Zhang et al., 2018;Platanios et al., 2019;Zhang et al., 2019;Liu et al., 2020;Zhan et al., 2021;Ruiter et al., 2020). These CL strategies vary by difficulty criteria and curriculum schedules. Early difficulty criterion depends on manually crafted features and prior knowledge such as sentence length and word rarity (Kocmi and Bojar, 2017). The drawback lies in the fact that humans understand learning difficulty differently from NMT models. Recent works choose to derive difficulty criteria based on the probability distribution of training examples to approximate the perspective of NMT models. For instance, Platanios et al. (2019) turn discrete numerical difficulty scores into relative probabilities and then construct the difficulty criterion, while others derive difficulty criterion from independently trained language model (Zhang et al., 2019;Dou et al., 2020;Liu et al., 2020) and word embedding model (Zhou et al., 2020b). Xu et al. (2020) derive difficulty criterion from the NMT model in the training process. And these difficulty criteria are applied to either fixed curriculum schedule (Cirik et al., 2016) or dynamic one (Platanios et al., 2019;Liu et al., 2020;Xu et al., 2020;Zhou et al., 2020b).
A well-trained NMT model estimates the optimal probability distribution mapping from the source language to the target language, which is assumed to be able to recover the ground truth translations accurately . However, if we perform inference on the training set, many of the predictions are inconsistent with the references. It reflects the distribution shift between the NMT model leaned distribution and the empirical distribution of training corpus, as Figure 1 illustrated. For a training example, a high recovery degree between prediction and ground-truth target sentence means it's easier to be mastered by the NMT model while a lower recovery degree means it's more difficult (Ding and Tao, 2019;. To this end, we employ this recovery degree as the difficulty criterion, where the recovery degree is computed by the sentence-level BLEU. We put forward an analogy of this method that humans can schedule a personal and effective curriculum after skimming over a textbook, namely self-guided curriculum.
In this work, we cast the recovery degree of each training example as its learning difficulty, enforcing the NMT model to learn from examples with higher recovery degrees to those with lower degrees. Also, we implement our proposed recovery-based difficulty criterion with fixed and dynamic curriculum schedules. Experimental results on two machine translation benchmarks, i.e., WMT14 En-De and WMT17 Zh-En, demonstrate that our proposed self-guided CL can alleviate the distribution shift problem in vanilla NMT models, thus consistently boosting the performance.

Problem Definition
For a better interpretation of curriculum learning for neural machine translation, we put the discussion of various CL strategies into a probabilistic perspective. Such perspective also motivates us to derive this recovery-based difficulty criterion.

Neural Machine Translation
Let S and T represent the probability distributions over all possible sequences of tokens in source and target languages, respectively. We denote the distribution of a random source sentence x and y as P S (x) and P T (y). NMT model is to learn a conditional distribution P S,T (y|x) with a probabilistic model P (y|x; θ) parameterized by θ, where θ is estimated by minimizing the objective: (1)

Curriculum Learning for Neural Machine Translation
CL methods decompose the NMT model training into K phases, enforcing the optimization trajectory in parameter space to visit a series of points θ 1 , . . . , θ K . Each training phase can be viewed as a sub-optimal process, optimized on a subset D k of the training corpus D: whereP D k is the empirical distribution of D k . According to the definition of curriculum learning, the optimization difficulty increases from J(θ 1 ) to J(θ K ) (Bengio et al., 2009). In practice, it's achieved by grouping training examples into subsets in ascending order of learning difficulty. The process splitting D into K subsets can be formulated as follows: With these notations, we review the DIFFICULTY CRITERIA in existing CL methods from a probabilistic perspective as these methods generally derive difficulty criteria from a probabilistic distribution. For example: where Feature(·) is handcrafted features and linguistic prior knowledge such as sentence length and word rarity. With the cumulative density function (CDF), numerical scores are mapped into a relative probability distribution over all training examples (Platanios et al., 2019). Only features of source sentences are taken into consideration in their practice.

Language
Model d(x n ) = − 1 I log P LM (w n 1 , . . . , w n I ), where a language model is adopted to estimate the perplexity of each sentence x = w 1 , . . . , w I . Language models trained on source and target side can be used jointly, e.g., d(x n ) + d(y n ) (Zhou et al., 2020b). In other works (Zhang et al., 2019;Dou et al., 2020), language models in different domains are adopted to compute the cross-entropy difference of each sentence, indicating its difficulty for domain adaptation.
. . , w I is a distributed representation of source sentence x mapped through a independent word embedding model. In the case of Liu et al. (2020), the norm of word vector on the source side is used as the difficulty criterion. They also use the CDF function to assure the difficulty scores are , l(z n ; θ k ) = − log P (y n |x n ; θ k ), where θ k represents the NMT model parameters at the kth training phase. The decline of loss is defined as the difficulty criterion in Xu et al. (2020). Besides, the score of cross-lingual patterns may also be a proper difficulty criterion for NMT Zhou et al., 2020a;Wu et al., 2021), which we leave as the future work.
We now turn to CURRICULUM SCHEDULING. There are two controlling factors, extraction of training set and training phase duration. In other words, how to split training corpus into subsets and when to load them. Given K mutual exclusive subsets {D 1 , . . . , D K } ⊆ D, there are two general regimens loading them as training progresses: one pass and baby steps. In one pass regimen, k subsets D k are loaded as training set one by one, while in baby steps regimen, these subsets are merged into the current training set one by one (Cirik et al., 2016). According to Cirik et al. (2016), baby steps outperforms one pass. Later approaches generally take the idea of baby steps in that easy examples are not cast aside while the probability increases for difficulty examples to be batched.
On top of baby steps, we can summarize existing works into two schedule settings: fixed schedule and dynamic schedule. In fixed schedule, both training set extraction and training phase duration are fixed (Cirik et al., 2016;Zhang et al., 2019). The size of the training set scales up by a certain proportion of the total training examples, usually |D k | = N/K at the beginning of a new training phase. And each training phase spends a fixed number of training steps. In dynamic schedule, either training set extraction or training phase duration is dynamic. Depending on which controlling factor is dynamic, we group existing dynamic schedules into two types: the competence type and the selfpaced type. Competence-based CL method is proposed by (Platanios et al., 2019). In competence type of dynamic schedule, training set extraction is dynamic while the training phase duration is fixed. At the beginning of a training phase, the CL algorithm compute the model competence c at the moment, then extract examples with difficulty scores lower than c as the training set for the current phase, {z n |d(z n ) ≤ c, z n ∈ D}. For K training phases, the competence-based schedule is to determine (K − 1) upper limits with a scale factor within range of d(z n ), which is [0, 1]. Platanios et al. (2019) take training steps 1, . . . , t, . . . , T as the scale factor, thus the general form of competence function is : . Recent works develop model competence by introducing different scale factors, such as the norm of the source embedding of the NMT model (Liu et al., 2020) and BLEU score on validation set (Xu et al., 2020). Another type of dynamic schedule is the self-paced one (Jiang et al., 2015;Zhou et al., 2020b), in which training set extraction is fixed while the training phase duration is dynamic. After a training phase begins, it goes on until convergence or until meeting certain conditions. For example in Zhou et al. (2020b), model training will progress to the next phase if the model uncertainty stops decline.

Methodology
As mentioned above, due to the distribution shift problem, predictions made by a well-trained vanilla NMT model can be inconsistent with the references when performing inference on the training set. Training examples with higher recovery degrees are easier to be masted by the NMT model while those with lower recovery degrees are likely to be more difficult.
In this section, we first introduce our recoverybased difficulty criterion and then propose to implement this criterion with fixed and dynamic curriculum schedules. The workflow of our proposed self-guided curriculum learning strategy is illustrated in Figure 2.

Difficulty Criterion
The objective function of the vanilla model can be written as an average distribution over the training corpus D: where f (x n ; ϕ) represents model's prediction and L is the loss function. As noted in Section 2, curriculum learning minimizes the objective J(θ) with a set of sub-optimal processes from easy to difficult. Examples that better fit into the average distribution learned by the vanilla model with parameter ϕ get higher recovery degrees. Starting curriculum learning on a set of examples with higher recovery degrees is to start optimizing J(θ) from a smaller parameter space in the neighborhood of parameter ϕ. In the machine translation scenario, we care more about model performance in terms of translation quality. So we choose BLEU score, the de facto automatic metric for MT, to measure the recovery degree. The difficulty criterion based on sentence-level BLEU score is as follows: Other reference-based automatic metrics for MT are applicable in this difficulty criterion as well.

Curriculum Scheduling
Following basic operations of the baby steps regimen, we first split training corpus D into K mutual exclusive subsets {D 1 , . . . , D K }, corresponding to K training phases. With difficulty criterion d(·), we define the corpus splitting function g: Then we explore both fixed and dynamic schedules: Fixed In fixed schedule, the training duration of each training phase is predefined. At the beginning Output: Trained CL model θ of the kth training phase, subset D k is merged into the current training set. After finished with T steps, the training progresses to the next phase k + 1, see Algorithm 1: Dynamic We follow the self-paced type of dynamic schedule as described in Section 2, in which training duration is dynamic while training set extraction is done before training starts. We define the condition of training phase progressing by the model recovery degree. In training phase k, if the CL model constantly demonstrates recovery degrees higher than the vanilla model on the newly merged subset D k , the CL model training will advance to the training phase k + 1. For easier operation, we randomly sub-sample D k from D k for model recovery validation. Based on the performance on {x n , y n } ∈ D k , which is measured by corpus-level BLEU score, we compute model recovery degree of the CL model at current training phase k by: Similarly, with the same additional validation set D k , we compute model recovery degree of the vanilla model by: o v (k) = BLEU(f (x n ; ϕ), y n ) If o c > o v , training phase will progress to the next one. Otherwise, the current training phase will go on until it reaches the predefined maximum time steps T , and then moves to the next phase. The training process is as described in Algorithm 2.

Datasets
We conduct experiments on two machine translation benchmarks: WMT'14 English⇒German (En-De) and WMT'17 Chinese⇒English (Zh-En). For En-De, the training set consists of 4.5 million sentence pairs. We use newstest2012 as the validation set and report test results on both newstest2014 and newtest2016 for fair comparison with existing approaches. For Zh-En, we follow (Hassan et al., 2018) to extract 20 million sentence pairs as the training set. We use newsdev2017 as the validation set and newstest2017 as the test set. Chinese sentences are segmented with a word segmentation toolkit Jieba 1 . Sentences in other languages are tokenized with Moses 2 . We learn Byte-Pair Encoding(BPE) (Sennrich et al., 2016) with 32k merge operations. And we learn BPE with a shared vocabulary for En-De. We use BLEU (Papineni et al., 2002) as the automatic metrics for computing recovery degree and evaluating model performance with statistical significance test (Collins et al., 2005).

Model Settings
We perform proposed CL method with the FAIRSEQ 3 (Ott et al., 2019) implementation of the   Liu et al. (2020) and Zhou et al. (2020b) instead. For the results of our proposed methods, " ⇑/↑ " indicates significant difference (p < 0.01/0.05) from Transformer BASE.
Transformer BASE (Vaswani et al., 2017). For regularization, we use the dropout of 0.3 and 0.1 for En-De and Zh-En respectively, with label smoothing = 0.1. We train the model with a batch size of approximately 128K tokens. We use Adam (Kingma and Ba, 2015) optimizer. The learning rate warms up to 5×10 −4 in the first 16K steps and then decays with the inverse square-root schedule. We evaluate the translation performance on an ensemble of the top 5 checkpoints to avoid stochasticity. We use shared embeddings for En-De experiments. All our experiments are conducted with 4 NVIDIA Quadro GV100 GPUs.

Curriculum Learning Settings
The vanilla model and the CL model share the same Transformer BASE setting. For the recovery degree, we let the trained vanilla model make predictions of source sentences in the training corpus with beam size set to 1 for we only need to reveal the recovery feature at the moment. Then we evaluate the predictions with sentence-level BLEU score. Specifically, we use fairseq-score to get sentence-level BLEU score, which implements smoothing method 3, i.e., NIST smoothing method (Chen and Cherry, 2014) by default. According to Zhou et al. (2020b), 4 baby steps is superior to those with larger baby steps, so we choose to decompose the CL training into 4 training phases. Implementing the proposed difficulty criterion, we investigate the performance of two curriculum schedules: • SGCL Fixed represents self-guided curriculum learning with fixed schedule.
• SGCL Dynamic represents self-guided curriculum learning with dynamic schedule.

Results
Table 2 summarises our experimental results together with existing CL methods. Row 1 shows the results of the standard Transformer BASE on these benchmarks. Row 2-4 demonstrate results from existing curriculum learning approaches. Row 5 shows the results of our Transformer BASE implementation, and row 6-7 are the results of our proposed CL models. For En-De, if existing works report results on one of newstest2014 and new-stest2017, then only the reported one is shown. We report results on them both for fair comparison.
We train our implemented baseline of Transformer BASE and proposed CL models for 300k steps. For both SGCL Fixed and SGCL Dynamic methods, we observe superior performances over the strong baseline on all three test sets of two benchmarks, which agree with existing approaches that curriculum learning can facilitate the NMT model. And if we compare the two scheduling methods, SGCL Dynamic outperforms SGCL Fixed. A possible reason is that the dynamic schedule encourages the CL model to spend more steps on the more difficult subset. Encouragingly, we observe considerable gains over other curriculum learning counterparts.   We conduct experiments on En-De for further analysis of the proposed CL methods. As described in Section 3, we adopt sentencelevel BLEU score to measure the recovery degrees of all examples in the training corpus with a vanilla NMT model. When making predictions with the vanilla model, we set the beam size to 1 for simplicity. So the recovery degrees could be lower than test results of a strong baseline. If we look at the distribution in terms of BLEU score on all training examples, as Figure 3 illustrated, the distribution is very dense in the region with lower scores. Specifically, more than 53.9% training examples get a recovery degree lower than 10. It reflects the distribution shift problem of well-trained vanilla NMT mode, that the model learned distribution and empirical distribution on training corpus are inconsistent.
In our case, the training corpus is split into 4 subsets with about equal size, {D 1 , D 2 , D 3 , D 4 }. Table 3 is the range and average of recovery degrees of each subset, revealing the learning difficulty of each subset merges into training set as training phase progress. We also look at the average lengths of source sentences in these 4 subsets, which are 22.40, 23.84, 25.33, 29.35 However, even as most internet healthcare companies struggle to raise money in a or (8.61) b rounds, a few of the leading segments still enjoy the capital boom. SGCL However, even as most internet health companies struggle with a round or b round of (27.45) financing, several segments leading business still enjoy the capital boom.

Case Study
Figure 4 presents a case study in Zh-En. It indicates that our approach achieves a performance boost because of better lexical choice. To better understand how our approach alleviates the lowrecovery problem, we conduct statistic analysis on the sentence-level BLEU scores of predictions made by the vanilla model and the CL model on the test set. It shows that the proportion of predictions with a BLEU score under 10 is 10.0% with the vanilla model and is down to 8.1% with the CL one.

Conclusion
In this work, we propose a self-guided CL strategy for neural machine translation. The intuition behind it is that after skimming through all training examples, the NMT model naturally learns how to schedule a curriculum for itself. We discuss exist-ing difficulty criteria for curriculum learning from a probabilistic perspective, which also explains our motivation for deriving a difficulty criterion based on recovery degree. Moreover, we corporate this recovery-based difficulty criterion with both fixed and dynamic curriculum schedules. Empirical results show that with a self-guided CL strategy, the NMT model achieves better performance over the strong baseline on translation benchmarks. In the future, we will corporate recovery-based difficulty criterion with other dynamic scheduling methods. Also, it will be interesting to apply our proposed CL strategy to different scenarios, e.g., non-autoregressive generation (Gu et al., 2018;Wu et al., 2020a;Ding et al., 2020b).