Token-wise Curriculum Learning for Neural Machine Translation

Existing curriculum learning approaches to Neural Machine Translation (NMT) require sampling sufficient amounts of"easy"samples from training data at the early training stage. This is not always achievable for low-resource languages where the amount of training data is limited. To address such limitation, we propose a novel token-wise curriculum learning approach that creates sufficient amounts of easy samples. Specifically, the model learns to predict a short sub-sequence from the beginning part of each target sentence at the early stage of training, and then the sub-sequence is gradually expanded as the training progresses. Such a new curriculum design is inspired by the cumulative effect of translation errors, which makes the latter tokens more difficult to predict than the beginning ones. Extensive experiments show that our approach can consistently outperform baselines on 5 language pairs, especially for low-resource languages. Combining our approach with sentence-level methods further improves the performance on high-resource languages.


Introduction
Neural Machine Translation (NMT) has achieved significant progress in recent years (Sutskever et al., 2014;Bahdanau et al., 2014;Vaswani et al., 2017), mainly in the scenarios where the parallel training corpora are abundant. However, training corpora can be limited in some domains (e.g., spoken language (Cettolo et al., 2015)) and languages (e.g., African languages) due to the high cost of data acquisition. Koehn and Knowles (2017); Lample et al. (2018) show that NMT models do not perform well in such data-limited settings.
To improve NMT with limited data, researchers resort to large amounts of auxiliary data. One line * Work was done at Microsoft Azure AI. of research leverages the knowledge from highresource parallel corpora. For examples, some works pre-train NMT models on high-resource data, and then fine-tune them on low-resource data (Zoph et al., 2016;Chen et al., 2017;Kocmi and Bojar, 2018;Neubig and Hu, 2018;Nguyen and Chiang, 2017); others train Multilingual or Multitask NMT models jointly on both high-resource and low-resource datasets (Gu et al., 2018a,b;Aharoni et al., 2019;Jiang et al., 2019;Siddhant et al., 2020). The other line exploits high-resource monolingual data as auxiliary data to train NMT models in a semi-supervised manner (Sennrich et al., 2015;Currey et al., 2017;Cheng, 2019).
Aside from previous approaches, curriculum learning (Bengio et al., 2009) is proposed to address the data insufficiency issue via utilizing the limited data more efficiently (Zhang et al., 2019b). The idea of curriculum learning is to sample training data in an order of increasing difficulty. The "easy" samples of such a curriculum can be beneficial to the training of the models at the early stage. There have been multiple designs of curriculum for NMT in the recent literatures (Zhou et al., 2020;Liu et al., 2020;Ruiter et al., 2020;Platanios et al., 2019;Wang et al., 2019a,b;Kumar et al., 2019;Zhang et al., 2018). All these methods sample complete sentence pairs for training from a selected subset, which expands as training progresses. We refer to them as the "sentence-level" curricula.
However, such a sentence-level design is not necessarily effective for NMT when data is limited. In the early stage of training, the selected subset is usually limited to a small portion of total training samples. In the low-resource setting, this subset contains even fewer samples. To better measure this effect, we use Figure 1 to show the diversity of the samples selected in the early training stage under low-resource and high-resource settings 1 .
Specifically, we count the number of unique trigrams in the sentence pairs used for training up-till a certain training iteration. We observe that the selected samples in low resource setting are less diverse than in high-resource setting, especially in the early curriculum (i.e. up-till 25% of total updates in curriculum). Consequently, the sentence-level curriculum slows down the learning progress in low-resource setting although this is not an issue for high-resource setting, as shown in Figure 2. This observation, that insufficient diversity in low-resource setting can affect the learning efficiency, motivates us to design a token-wise curriculum. During the curriculum, the model learns to The dataset we use in low-resource setting is IWSLT14 De-En, and in high-resource setting is WMT16 En-De. We adopt Transformer-base (Vaswani et al., 2017) as the baseline model. predict only a short sub-sequence from each target sentence at the early stage of training, and then the sub-sequence is gradually expanded as the training progresses. Comparing with the sentence-level curriculum, which only focuses on "easy" sentence pairs, token-wise curriculum can on purpose create much more partial and diverse samples to address the data insufficiency challenge.
The next question is, how to design an effective sub-sequence selection scheme, such that the difficulty of the selected sub-sequences follows an "easy-to-hard" schedule. Specifically, we consider the sub-sequence difficulty in the context of machine translation generation. In a left-to-right autoregressive generation setting, the generation of next word is dependent on previous generation on the left. In other words, wrong predictions in the early tokens would affect the accuracy of the latter ones during inference. This results in prediction error accumulation 2 (Zhang et al., 2019a), which indicates that the beginning tokens are easier to predict than the latter ones. Therefore, we design a scheduler to select sub-sequences from the beginning part of target sentences, and gradually expand them until the end of the sentences, as the training progresses.
Our experiments on several low-resource NMT datasets collected from IWSLT (Cettolo et al., 2015) show that the proposed curriculum outperforms existing baselines in both standard training and transfer learning settings. In addition, the experiments on a high-resource dataset WMT'16 En-De (Bojar et al., 2016) shows that the proposed curriculum can not only by itself, but also by combining with existing sentence-level curricula, benefit NMT model training. Finally, we show that the proposed token-wise curriculum is general for multiple sequence generation tasks. It shows superior performance on language modeling tasks, besides machine translation. Our codes are released at https://github.com/cliang1453/ token-wise-curriculum-learning.

Background
• NMT models the conditional probability of a target sentence y = (y 1 , ..., y ) given a source sentence x = (x 1 , ..., x m ). The density function p(y|x) is parameterized by an encoder-decoder neural network, which generates the target sentence in an auto-regressive manner (Sutskever et al., 2014;Bahdanau et al., 2014). Specifically, the model predicts the probability of the t-th token by p(y t |y <t , x; θ), where θ denotes the model parameters. It is trained by minimizing the sum of cross-entropy loss on all sentence pairs, where the loss on each sentence pair (x, y) is log p(y t |y <t , x; θ) (1) • Curriculum Learning in NMT. Research on curriculum learning in NMT mainly fall into two categories: measurement of sample difficulty and design of curriculum schedule (Kocmi and Bojar, 2017). In the first category, some research measure sample difficulty with features derived from lexical statistics, e.g., sentence length and word rarity (Zhang et al., 2018;Platanios et al., 2019). Others measure difficulty with features derived from pretrained model, e.g., Liu et al. (2020)

Method
We introduce a token-wise curriculum learning approach for NMT.

Hard Curriculum
We propose a token-wise curriculum based on subsequence selection. At each training iteration, the model is trained to predict a sub-sequence of target sentences. We remark that such prediction is conditioned on complete source sentences. Specifically, at the i-th iteration, the model is updated based on the loss computed on such sub-sequence only, where S i is the set of the token indexes in the selected sub-sequence at the i-th iteration.
Left-to-Right Selection Scheme. The selection scheme of S i can be described as follows: • At the beginning of curriculum (0-th iteration), we select the sub-sequence from the beginning of each target sentence: S 0 = [1, 2, 3, ..., λ 0 ], where is the length of the target sequence and λ 0 ∈ (0, 1) is the initial sub-sequence percentage with respect to the total length.
• We then gradually expand each sub-sequence throughout the curriculum until it covers the whole sentence: S i = [1, 2, 3, ..., i ] with the length i determined by a linear function, where I is the number of updates in the curriculum. With this selection scheme, the model can be updated by the SGD type algorithm (e.g., ADAM (Kingma and Ba, 2014)) with the stochastic gradients computed based on S i . The loss of each sentence at the i-th iteration is computed by: After the curriculum ends (i.e., i ≥ I), the model continues with the standard training.

Soft Curriculum
In hard curriculum, the model is trained without regarding the loss upon {1, ..., } \ S i . However, those tokens may play important roles in sequence generation. For example, the model needs to learn how to end a sentence by predicting the EOS token. Therefore, we propose an alternative method -the soft curriculum, where we place geometrically decaying weights on the loss of all tokens. By allowing weights on all tokens, the model is able to learn more diverse samples. By placing decaying weight on end tokens that are difficult to learn, we maintain the sample easiness. At the i-th iteration, the re-weighted loss on each target sentence with length is computed by: where γ i and α t, are two factors controlling the rate of geometric decay. The decaying factor γ i at the i-th iteration is computed by: where 0 ≤ γ 0 < 1 is a hyperparameter controlling the scale of initial weights placed on all tokens. The weights gradually increase as γ i grows from γ 0 to 1 throughout the curriculum. We remark that while γ i grows linearly, the weights change with different rates for tokens at different positions -we design the power factor α t, uniquely for the t-th token in a target sentence of length : As illustrated in Figure 3, the weights on tokens gradually decay from the beginning to the end of the sentence, where α 0 > 0 is a hyperparameter controlling this decaying rate.

NMT Experiments
To demonstrate the effectiveness of our token-wise curriculum design, we present experimental results on NMT tasks.

Data Preparation & Preprocessing
We evaluate our method on widely used language pairs in both low-resource and highresource datasets. Low-resource datasets include English-to-Vietnamese (En-Vi) from IWSLT15 (Cettolo et al., 2015) 3 , German-to-English (De-En) from IWSLT14, French-to-English (Fr-En) from IWSLT16, and Romanian-to-English (Ro-En) from WMT16 (Bojar et al., 2016)  Fr-En, we use a BPE trained with 32K merge operations and use sentences up to length 200 subword symbols, following Platanios et al. (2019). For Ro-En, we use a BPE trained with 40K merge operations and use sentences up to length 50 subword symbols as Gu et al. (2018a,b). We preprocess De-En data following fairseq 5 . We adopt the preprocessed En-De data released by Google 6 .

Baselines
We compare our token-wise curriculum learning method (TC) with several state-of-the-art sentencelevel methods (SC): • SC r-sqrt measures sample difficulty by word rarity, and uses a square-root function as curriculum schedule (Platanios et al., 2019).
• SC norm measures sample difficulty based on norm of sentence embedding, and uses a threshold function of encoder word embedding norm as curriculum schedule (Liu et al., 2020).
• SC unc measures sample difficulty by data uncertainty, and uses a threshold function of model uncertainty as curriculum schedule (Zhou et al., 2020).

Model & Training
For both SC and TC experiments, we adopt Transformer-base NMT model (Vaswani et al., 2017) as the baseline model. All implementations are based on fairseq  code-base with all experiments running with 32G NVIDIA V100 GPUs. For all datasets, we use ADAM (Kingma and Ba, 2014) as the optimizer with β = (0.9, 0.98). For low-resource datasets, we use a learning rate of 5 × 10 −4 with 8000 steps of warmup updates. For high-resource dataset En-De, we use a learning rate of 1 × 10 −3 with 4000 steps of warmup updates. See training details in A.2.
We fix λ 0 = 0.1 in TC hard experiments, and fix γ 0 = 0.7 and α 0 = 25 in TC soft experiments. We set the curriculum length I = 8000, 7000, 6500, 1100, 5400 for De-En, En-Vi, Fr-En, Ro-En and En-De. See hyperparameter selection details in A.3. For SC methods, we follow the recommended settings in the original papers with special configurations for the low-resource setting. See training details in A.4.
Consistent with previous practice, we use tokenized BLEU (Papineni et al., 2002) as the evaluation metrics. For all low-resource datasets, we report the BLEU score of the best checkpoint using a beam size of 5 and length penalty of 1. For high-resource dataset En-De, we report the average of the last 10 checkpoints with a beam size of 10 and length penalty of 0.6.

Main Results
We compare TC hard and TC soft with the baseline, and report the best testing BLEU among 5 runs with different random seeds in Table 2 and Table 3 (See A.5 for validation scores). As can be seen, TC hard outperforms the baseline in all cases, and TC soft further improves upon TC hard . This implies that TC soft finds a better balance between sample diversity and sample easiness than TC hard .
In the low resource setting (Table 2), all TC methods uniformly outperform SC methods by around 0.5 BLEU scores, while SC methods can sometimes hurt the baseline (e.g., in En-Vi and De-En, the two smallest datasets). Under the high resource setting (Table 3), all TC methods outperform the baseline by around 0.4 BLEU scores. However, we observe that the performance of TC methods show no clear improvement upon SC methods. We conjecture the reason is that the selected samples in high resource setting are sufficiently diverse for SC method to be well-performed. To further improve performance in high resource setting, we combine TC and SC methods, expecting that this combination selects not only diverse, but also easier samples than those selected by any single method. In particular, we first use SC to select sentences, and then use TC to select beginning sub-sequences upon these sentences. As can be seen, both TC soft + SC norm and TC soft + SC unc can further improve upon the best single method.
As TC soft uniformly outperforms TC hard , we use TC soft in the following experiments unless stated otherwise.   (Liu et al., 2020) 28.51 SC unc (Zhou et al., 2020) 28.55

Transfer Learning with Curriculum
We show that our curriculum can be further combined with transfer learning to improve NMT performance. Instead of training from scratch, transfer learning considers fine-tuning a pre-trained model on the limited parallel data. Specifically, we consider the following transfer learning settings: • Domain Transfer Learning. We consider transferring from a high-resource domain to a lowresource domain. Specifically, we fine-tune the Transformer-big NMT model pre-trained from News domain (WMT) 7 on TED domain (IWSLT). Table 4 shows that using our curriculum improves the domain transfer performance.
• Pre-trained Multilingual Language Model Fine-tuning. We also consider the case of transferring from high-resource monolingual data to lowresource parallel data. Specifically, we initialize an NMT model from XLM (Lample and Conneau, 2019), a multilingual language model pre-trained on extensive monolingual En and De data 8 . Then we fine-tune the NMT model on the En-De TED data. Table 4 shows that using our curriculum improves the fine-tuning performance on pre-trained multilingual language model.

Curriculum under Extremely Low-Resource Setting
We further show that our curriculum can improve NMT performance in both standard training (Table 5) and transfer learning (Table 4) under extremely low-resource setting. In standard training, the model is trained with a randomly sampled 50%/10% subset from all sentence pairs. In transfer learning, the model is finetuned with a randomly sampled 50%/10%/1% subset from all target domain sentence pairs. Table 4 and Table 5 show that TC soft attains a steady performance gain as training/fine-tuning data becomes more scarce, e.g., the domain transfer learning improvement is over 3 BLEU scores under the 1% data setting.

Analysis
We first verify our assumption that the error accumulation makes beginning tokens easier to predict. Then we analyze whether our curriculum improves the sample diversity in the early stage of training, and further improves optimization.
• Error Accumulation. To verify that error accumulation is a prevailing phenomenon in machine translation generation, we conduct beam search with beam size of 5 using Transformer-base NMT model on De-En dataset. We compute the error rate of the predictions at different relative positions of sentences. Specifically, we compute the prediction error rate within 10 evenly-divided partitions in each sentence and average over all sentences. Since we choose λ 0 invariant to sentence length, we further verify that the error accumulation exists for sentences with different length. As shown in Figure 4, sentences with different length suffer from error accumulation. We further verify that the token-wise curriculum can effectively alleviate the error accumulation in A.6.
• Sample Diversity. We compare the diversity of samples selected/created by SC unc and TC hard 8 The pre-trained XLM model and script for fine-tuning translation models are publicly available github.com/ facebookresearch/XLM. 9 on low-resource dataset De-En. Recall that the samples selected by sentence-level curriculum is a subset of all sentence pairs. In contrast, the samples created by token-wise curriculum consist of all source sentences as well as the selected subsequences from all target sentences. Up till a fixed training iteration (e.g., 25% of curriculum length), we measure the diversity by the number of unique trigrams summing over all selected/created sentences/sub-sequences. As shown in Figure 5, the samples created by TC hard are more diverse at the early stage of training.

25%FXUOHQJWK 50%FXUOHQJWK
1XPEHURIWULJUDPV×10 6 SC unc TC hard ZRFXU • Learning Curve. Figure 6 shows the validation performance of SC unc and TC soft in both early and later stages of training. As can be seen, the BLEU score under the token-wise curriculum increases   faster and more smoothly than sentence-level curriculum in the early stage. Furthermore, the model trained with the token-wise curriculum achieves a better generalization performance, while the model trained with sentence-level curriculum shows signs of over-fitting. We conjecture that such improvement comes from training with more diverse samples in the early stage.

Ablation Study
We ablate some crucial designs of our curriculum, including the design of selecting consecutive tokens and the design of expanding the sub-sequence from beginning to the end of the sentence (referred as left-to-right). We only consider TC hard in this section, as TC soft is the improved version of TC hard .
• Consecutive Tokens vs. Random Tokens. Here we study if selecting consecutive tokens is necessary for token-wise curriculum. It is natural to compare it with random sub-sequence curriculum, which uniformly samples the same number of tokens as TC hard (but not necessarily consecutive). Table 6 shows that the random curriculum does not show improvement upon the baseline.  • Teacher-forcing Loss vs. Beam Search Error Rate. The left-to-right design is motivated by the error accumulation of beam search decoding which makes the latter tokens more difficult to predict (Figure 4). Recall that, unlike beam search, the NMT models are trained in a teacher-forcing way. Therefore, we would like to know whether the teacher-forcing training loss can characterize sample difficulty. To answer this question, we select the sub-sequence with the lowest average teacherforcing loss and the same number of tokens as TC hard . Table 7 shows that the selection based on teacher-forcing loss outperforms the baseline, but does not work as well as the left-to-right design.  • Relative Positions of Sub-sequences. We further explore whether choosing a sub-sequence expansion direction misaligned with the left-to-right decoding order can also improve the performance. We select initial sub-sequence not from the beginning of each sentence, instead, in the range of 30 − 40%, 60 − 70% and 90 − 100% of each sentence with the same expansion schedule. For example, by selecting the initial range as 90 − 100%, the expansion is in the right-to-left direction. By selecting the initial range as 30 − 60%, the sub-sequence  is expanding bidirectionally. Table 8 shows that by choosing initial sub-sequence other than beginning of the sentence, the performance drops even below the baseline in some cases. This implies that the left-to-right design is essential as it aligns with the decoding order.  Table 8: BLEU score (test) comparison to selecting initial sub-sequence from different relative positions.

Language Modeling Experiments
To demonstrate our token-wise curriculum can be applied to other sequence generation task, we presents experimental results on language modeling.

Data Preparation & Processing
We conduct experiments on two popular wordlevel datasets: a preprocessed version of the Penn Treebank (PTB) (Mikolov et al., 2010) and the WikiText-2 (WT2) (Merity et al., 2016). PTB contains about 929K training words, 73K validation words, and 82K test words. All capitalization, numbers and punctuation are removed as part of the preprocessing step. WT2 consists of around 2M words extracted from Wikipedia articles. The dataset is lightly processed with capitalization, punctuation, and numbers retained. It is tokenized and preprocessed using the Moses (Koehn et al., 2007) with over 30K vocabulary size.

Model & Training
We use AWD-LSTM (Merity et al., 2017), a 3-layer standard LSTM equipped with the drop-connection (Wan et al., 2013) on recurrent weights. The model is trained with non-monotonically triggered averaged stochastic gradient descent (NT-ASGD), a variant of ASGD (Polyak and Juditsky, 1992). We follow the training settings from Merity et al. (2017) 10 and report performance in perplexity under static evaluation. We fix λ 0 , γ 0 and α 0 the same as in Section 4.3. The curriculum length I is set to be 2100 and 4200 for PTB and WT2. See hyperparameters selection details in A.3. Table 9 shows the language modeling performance on PTB and WT2. As can be seen, both TC hard and TC soft outperform the baseline performance by over 0.5 points of perplexity. Furthermore, TC soft slightly outperforms TC hard in both datasets.

Conclusion
In this paper, we introduce a novel token-wise curriculum learning method for NMT. We show its superiority in low-resource setting, and is beneficial in high-resource setting. Different from existing works, we only consider a vanilla curriculum schedule, where the created sub-sequences expand linearly, as our focus is to validate the idea of tokenwise design. We leave other potential scheduler design, e.g., training adaptive scheduler (Liu et al., 2020;Xu et al., 2020), as future discussion.

Broader Impact
This paper proposes a new curriculum learning method for training neural language models in sequence-to-sequence prediction tasks. Our designed curriculum neither introduces any social/ethical bias to the model nor amplify any bias in the data. We do not foresee any direct social consequences or ethical issues.

A.2 TC Methods Implementation Details
• NMT Standard Training Experiments. For all language pairs, we use a inverse square root schedule with weight decay rate of 1 × 10 −4 , label smoothing ratio of 0.1, and dropout rate of 0.3. For low resource setting, we share the decoder and encoder output embeddings. We use dynamic batching with maximum tokens of 4096 per GPU and train on 1 GPU for 60 epochs.
For high resource setting, we share all the embeddings. We use dynamic batching with 14336 tokens per GPU, accumulate gradient for 7 steps, and train for 150K updates.
For extremely low-resource setting, we follow the same hyperparameter setting for each language pair.
• NMT Transfer Learning Experiments. We use 2 NVIDIA V100 GPUs for each experiment. We choose finetuning learning rate from {1×10 −5 , 5× 10 −5 , 5×10 −4 }. We use dynamic batch size, which is limited by GPU memory (16G per GPU). We report the evaluation results by conducting beam search with beam size of 5 and length penalty of 0.6 for datasets in WMT, and beam size of 5 and length penalty of 2 for datasets in IWSLT.
• Selection of I. In NMT experiments, we determine I in a similar manner as Platanios et al. (2019): we train the baseline model and compute the number of training steps it takes to reach approximately 70% of its final BLEU score. We then set I to this value. In language modeling experiments, I is determined similarly: we train the baseline model and set I to be the number of training steps it takes to reach approximately 30% initial perplexity + 70% final perplexity.

A.4 SC Methods Implementation Details
• SC r-sqrt . We adopt the SR curriculum and c sqrt competence function setting in Platanios et al. (2019). We set initial competence c 0 to 0.01 for all language pairs and set curriculum length T in the same manner following Platanios et al. (2019). In addition, we adopt the special learning rate schedule as proposed in Equation (9) in the original paper, where we set T warmup = 8000.
• SC norm . Following Liu et al. (2020), we extract a word2vec embedding E w2v from a pre-trained Transformer-base model and measure sample difficulty on the source sentences embedding mapped through E w2v . The initial competence c 0 is set to 0.01 for all language pairs. For En-De, λ m and λ w are set to 2.5 and 0.5 following Liu et al. (2020). For low-resource datasets, we tune and choose λ m and λ w as 0.25 and 0.05, respectively. • SC unc . We follow Zhou et al. (2020) to use 4 baby steps. We measure sample difficulty using the "joint" source and target uncertainty. It is obtained by evaluating the perplexity measured by a pretrained 4-gram KENLM model (Heafield, 2011)).

A.5 Validation Performance
• NMT Standard Training Experiments. Table 10 shows the validation performance on lowresource datasets. We report the BLEU score on the best checkpoint. and Table 11 shows the validation performance on a high-resource dataset En-De. We report the BLEU score on the averaged last 10 checkpoints. We use the same beam search setting as in Section 4.3.

A.6 Additional Analysis
We further verify that the token-wise curriculum can effectively alleviate the error accumulation. We conduct beam search with beam size of 5 on Transformer-base model trained on De-En, and compute the averaged prediction error rate over the end 20% tokens of the sentences. As shown in the Table 13, the model trained with TC hard suffers less from error accumulation than the model trained with SC unc . In addition, TC hard particularly alleviates error accumulation in long sentences (i.e., sentences with length larger than 100).