Data Selection Curriculum for Neural Machine Translation

Neural Machine Translation (NMT) models are typically trained on heterogeneous data that are concatenated and randomly shuffled. However, not all of the training data are equally useful to the model. Curriculum training aims to present the data to the NMT models in a meaningful order. In this work, we introduce a two-stage curriculum training framework for NMT where we fine-tune a base NMT model on subsets of data, selected by both deterministic scoring using pre-trained methods and online scoring that considers prediction scores of the emerging NMT model. Through comprehensive experiments on six language pairs comprising low- and high-resource languages from WMT'21, we have shown that our curriculum strategies consistently demonstrate better quality (up to +2.2 BLEU improvement) and faster convergence (approximately 50% fewer updates).


Introduction
The notion of a curriculum came from the human learning experience; we learn better and faster when the learnable examples are presented in a meaningful sequence rather than a random order (Newport, 1990).In the case of machine learning, curriculum training hypothesizes presenting the data samples in a meaningful order to machine learners during training such that it imposes structure in the task of learning (Bengio et al., 2009).
In recent years, Neural Machine Translation (NMT) has shown impressive performance in highresource settings (Hassan et al., 2018;Popel et al., 2020).Typically, training data of the NMT systems are a heterogeneous collection from different domains, sources, topics, styles, and modalities.The quality of the training data also varies a lot, so as their linguistic difficulty levels.The usual practice of training NMT systems is to concatenate all available data into a single pool and randomly sample * Work done while Tasnim was interning at Meta AI training examples.However, not all of them may be useful, some examples may be redundant, and some data might even be noisy and detrimental to the final NMT system performance (Khayrallah and Koehn, 2018).So, NMT systems have the potential to benefit greatly from curriculum training in terms of both speed and quality.
In this work, we propose a two-stage training framework for NMT -model warm-up and model fine-tuning, where we apply the data-selection curriculum in the later stage.We initially train a base NMT model in the warm-up stage on all available data.In the fine-tuning stage, we adapt the base model on selected subsets of the data.The subset selection is performed by considering data quality and/or usefulness at the current state of the model.We explore two sets of data-selection curriculum strategies -deterministic and online.The deterministic curriculum uses external measures which require pretrained models for selecting the data subset at the beginning of the model fine-tuning stage and continues training on the selected subset.In contrast, the online curriculum dynamically selects a subset of the data for each epoch without requiring any external measure.Specifically, it leverages prediction scores of the emerging NMT model which are the training by-product.
For picking the data subset in the online curriculum, we investigate two approaches of dataselection window -static and dynamic.Even though the size of the data-selection window is constant throughout the training in the static approach, the samples in the selected subset vary from epoch-to-epoch due to the change in their prediction scores.In contrast, we change the dataselection window size in the dynamic approach by either expanding or shrinking.
Comprehensive experiments on six language pairs (12 translation directions) comprising lowand high-resource languages from WMT'21 (Akhbardeh et al., 2021) reveal that our curricu-lum strategies consistently demonstrate better performance compared to the baseline trained on all the data (up to +2.2 BLEU).We observe bigger gains in the high-resource pairs compared to the low-resource ones.Interestingly, we find that the online curriculum approaches perform on par with the deterministic approaches while not using any external pretrained models.Our proposed curriculum training approaches not only exhibit better performance but also converge much faster requiring approximately 50% fewer updates.

Proposed Framework
Let s and t denote the source and target language respectively, and D g = {(x i , y i )} N i=1 denote the general-domain parallel training data containing N sentence pairs with x i and y i coming from s and t languages, respectively.Also, let D d ⊆ D g be the in-domain parallel training data and M is an NMT model that can translate sentences from s to t.The overall training objective of the NMT model is to minimize the total loss of the training data: where P θ (y i |x i ) is the sentence-level translation probability of the target sentence y i for the source sentence x i with θ being the parameters of M.
We propose a two-stage training framework where in model warm-up stage we train M on general domain parallel data D g for K number of gradient updates; K is generally smaller than the total number of updates M requires for convergence.Then in model fine-tuning stage, we adapt M on selected subsets of in-domain parallel data D d .Based on the intuition: "not all of the training data are useful or non-redundant, some samples might be irrelevant or even detrimental to the model", we hypothesize that there exists a D s ⊂ D d , fine-tuning on which M will exhibit improved performance.
Our goal is to design a ranking of the training samples which will eventually help us to extract D s from D d .For this, we investigate two sets of data-selection curriculum strategies -deterministic and online.Both strategies require a measure of data quality and/or usefulness at the current state of the model to extract D s .While the deterministic curriculum uses external measures that require pretrained models, the online curriculum leverages the prediction scores of the emerging NMT models.

Deterministic Curriculum
In this strategy, we select a D s ⊂ D d initially and do not change it during the model fine-tuning stage.We first score each parallel sentence pair (x i , y i ) ∈ D d using an external bitext scoring method.We experiment with three such scoring methods as described below.
• LASER This approach utilizes the Language-Agnostic SEntence Representations (LASER) toolkit (Artetxe and Schwenk, 2019), which gives multilingual sentence representations using an encoder-decoder architecture trained on a parallel corpus.We use the sentence representations to score the similarity of a parallel sentence pair using the Cross-Domain Similarity Local Scaling (CSLS) measure (Conneau et al., 2017).
• Dual Conditional Cross-Entropy (DCCE) Junczys-Dowmunt (2018) proposed this method, which requires two inverse translation models -one forward model (f ) and one backward model (b), trained on the same parallel corpus.It then finds the score of a sentence pair (x i , y i ) by taking the maximal symmetric agreement of the two models which exploits the conditional cross-entropy (H). where • Modified Moore-Lewis (MML) MML ranks the sentence pairs based on domain relevance by calculating cross-entropy difference scores (Moore and Lewis, 2010;Axelrod et al., 2011).For this, we need to train four language models (LM): inand general-domain LMs in both source and target languages.Then we find the MML score of a parallel sentence pair (x i , y i ) as follows: where Here, b ∈ {s, t} refers to the bitext side and C ∈ {in, gen} refers to the corpus domain.

Online Curriculum
Unlike deterministic curriculum, in this strategy the selected data subset D s changes dynamically in each epoch of the model fine-tuning stage through instantaneous feedback from the current NMT model.Specifically, in each epoch, we rank (x i , y i ) ∈ D d by leveraging the prediction scores from the emerging NMT model which assigns a probability to each token in the target sentence y i .We then take the average of the token-level log probabilities to get the sentence-level probability score P θ (y i |x i ) which is regarded as the prediction score for the sentence pair (x i , y i ).Formally, This prediction score indicates the confidence of the emerging NMT model to generate the target sentence y i from the source sentence x i .Intuitively, if the model can predict the target sentence of a training data sample (x i , y i ) with higher confidence, it indicates that the sample is too easy for the model and might not contain useful information to improve the NMT model further at that state.Algorithm 2 presents the pseudo-code of our online data-selection curriculum strategy.After the model warm-up stage, we fine-tune M for n_epochs on data subset D s which is selected in every epoch based on the emerging NMT models' confidence.Specifically, in the beginning of each epoch in the model fine-tuning stage, we find the prediction score P θ (y i |x i ) of each sample (x i , y i ) ∈ D d .We then rank D d based on these scores and select D s ⊂ D d by picking a dataselection window in the ranked data.Finally, we fine-tune M on D s for that epoch.We present the conceptual demonstration of our online curriculum strategy in Figure 1.For picking the data-selection window in ranked D d , we investigate two methods: To change the data-selection window size, we use linear scheduler 2 which can be regarded as a function λ(t) to map the current training epoch t to a scalar value.This value is regarded as the dataselection window size at epoch t.Formally, where λ init is the initial window size which is smaller for expansion and larger for shrink, and l inc , l dec are the hyperparameters of the respective schedulers.

Experimental Setup
Datasets We conduct experiments on six language pairs: three high-resource including English (En) to/from German (De), Hungarian (Hu), and Estonian (Et); and three low-resource including English (En) to/from Hausa (Ha), Tamil (Ta), and Malay (Ms).We use the dataset provided in WMT 2021 3 -De and Ha are from News shared task, while the remaining four pairs are from Large-Scale Multilingual MT shared task.For En↔De, we use newstest2019 as validation set and report test results on newstest2020.For En↔Ha, we randomly split the provided dev set into validation and test set.For the other language pairs, we use the official evaluation data (dev and devtest) as validation and test sets.Table 1 presents the dataset statistics after cleaning and deduplication.For high-resource language pairs, we consider formal texts parallel data corpora sources as in-domain (D d ⊂ D g ), while for low-resource pairs, we do not differentiate between general-domain and in-domain corpus (D d := D g ).Table 2 shows the in-domain corpora sources for high-resource language pairs.Model Settings We use the Transformer (Vaswani et al., 2017) implementation in Fairseq (Ott et al., 2019); details of our model architecture settings are given in Appendix A. We use sentencepiece library4 to learn joint Byte-Pair-Encoding (BPE) of size 32,000 and 16,000 for En↔De and En↔Ha, respectively.For other language pairs, we use the official sentencepiece model provided in Large-Scale Multilingual MT shared task.We filter out parallel data with a length longer than 250 tokens during training.All experiments are evaluated using SacreBLEU (Post, 2018).
For LM training in the modified Moore-Lewis method ( §2.1), we use the implementation in Fairseq.For in-domain LM training, we use 5M sentences from newscrawl, while we combine 10M commoncrawl data with newscrawl totaling 15M sentences to train the general-domain LM.
Baselines We compare our methods with the converged model, which is a standard NMT model trained on all the general-domain data (D g ) until convergence.Additionally, we compare both the deterministic and online curriculum approaches with the traditional fine-tuning approach where we fine-tune the base model from the warm-up stage with all the in-domain train data (D d ) until convergence.

Results
The main results for the low-and high-resource languages are shown in Tables 3 and 4, respectively.For low-resource languages, we train the warm-up stage models for 20K updates, while the converged models are trained for approximately 50K updates.For high-resource languages, we train for 50K and 100K updates for the warm-up and converged models, respectively.In traditional fine-tuning (Traditional Ft. row in the Tables), we use all the available in-domain data (D d ) in each fine-tuning epoch.On the other hand, for both deterministic and online curricula, we use at most 40% of the available in-domain data (D s ⊂ D d ) in each fine-tuning epoch.
Comparing the performance of traditional finetuning with the Converged Model on low-resource languages (Table 3), we see that both of these perform on par.This is not surprising as both approaches use all the data (D g ) during the whole training (for low-resource languages D d := D g ).The only difference between the two approaches is -while the converged model continues to train the base model from the warm-up stage, the traditional fine-tuning approach resets the base model's meta-parameters (e.g., learning-rate, lr-scheduler, data-loader, optimizer) and continue the training.
For high-resource languages (Table 4), we finetune the base model only on the in-domain training data (D d ⊂ D g ) in traditional fine-tuning, while the converged model continues to train the base model on all the general-domain data (D g ).Here, traditional fine-tuning performs better than the converged model on En-De (+0.4) and En-Et (+0.9) but exhibits poor performance on the other four directions by 0.7 BLEU score on an average.
In the following, we discuss the performance of our data-selection curriculum approaches:

Performance of Deterministic Curricula
First, we consider the performance of deterministic curriculum approaches on low-resource languages.From Table 3, we see that fine-tuning the base model on the data subset (D s ) selected by LASER outperforms the baseline (Converged Model) on five out of six translation tasks with a +2.2 BLEU gain in Ha-En.For the other two scoring methods, dual conditional cross-entropy (DCCE) and modified Moore-Lewis (MML), we also see a better or similar performance on 5/6 translation tasks.Compared to the traditional fine-tuning, the deterministic approaches perform better in most of the tasks -on average +0.5, +0.4,+0.2 BLEU gains for LASER, DCCE, and MML, respectively.
In Table 4, we see a similar trend of better performance of the deterministic curricula over the converged model on high-resource languages.Specifically, fine-tuning on the data subset selected by utilizing the scoring of both LASER and DCCE performs better on four out of six translation tasks, while the MML-based method achieves better performances on three tasks.The margins of improved performances for the high-resource languages are higher compared to the low-resource languages: +1.4,+0.9, +0.7 BLEU gains on average for DCCE, LASER, and MML, respectively over the baseline.If we compare with the traditional fine-tuning, the deterministic curriculum approaches perform better in most of the tasks -on average +1.2, +0.8, +0.4 BLEU better for DCCE, LASER, and MML, respectively.
To observe the better performance of the deterministic curriculum approaches more clearly, we fine-tune the base model from the warm-up stage with different percentages of ranked parallel data selected by the bitext scoring methods.Figure 2 shows the results.We observe that there exist multiple subsets of data (D s ⊂ D d ), fine-tuning the base model on which demonstrates better performance compared to the Converged Model and traditional fine-tuning.For De-En, traditional fine-tuning (on 100% data) reduces BLEU score by 0.3 from the base model, while fine-tuning on most of the subsets selected by the deterministic curricula leads to improved performances.For Hu-En, traditional fine-tuning diminishes the performance of the base model by 0.5 BLEU.Unlike De-En, here we could not find a subset by the deterministic curricula finetuning on which improves the performance of the base model.

Performance of Online Curricula
Our online curriculum approaches perform on par with the deterministic curricula for both low-and high-resource languages as shown in Tables 3 and  4, respectively.Unlike deterministic approaches, here we leverage the emerging models' prediction scores without using any external pretrained scoring methods.In our static window approach, we discard the top 30% and bottom 30% sentence pairs from the ranked D d and fine-tune the base model from the warm-up stage on the remaining 40% data (D s ).The selected data in D s vary dynamically from epoch-to-epoch due to the change in the prediction scores of the emerging NMT models (Figure 4).From the results (Tables 3, 4), we notice that the data-selection by Static Window method outperforms the Converged Model on ten out of twelve translation tasks and the BLEU scores are comparable to the deterministic curriculum approaches.
In our dynamic window approach, we either expand or shrink the window size, where the selected window is confined to the range of 30% to 70% of the ranked D d , i.e., D s can be at most 40% of D d .In window expansion, we start D s with 10% of D d and linearly increase it to 40% in the subsequent epochs, while in the window shrink method we start D s with 40% and linearly decrease to 10% of D d .With dynamic window expansion, we achieve slightly better (up to +0.5 BLEU) performance on ten out of twelve translation tasks compared to the static window method.On the other hand, the dynamic window shrink method performs slightly lower than window expansion in most of the translation tasks.

Hybrid Curriculum
To benefit from both deterministic and online curricula, we combine the two strategies.Specifically, we consider three subsets of data comprising of the top 50% of D d ranked by each of the three bitext scoring methods in §2.1 and keep the common sentence pairs (intersection of three subsets).We then apply the static window data-selection curriculum on these sentence pairs, where we discard the top 10% and bottom 10% pairs (ranked by the emerging model's prediction scores) and fine-tune the base model from the warm-up stage on the remaining bitext.Depending on the language pairs, the data percentage for the fine-tuning stage (D s ) becomes 15-20% of D d .Despite being a smaller subset of data for fine-tuning, performances of the hybrid curriculum strategy are better on 10 out of 12 translation tasks compared to the baseline (Table 3, 4).Notably, for En-De and De-En, the hybrid curriculum attains +2.0 and +2.1 BLEU scores compared to the converged model.

Are All Data Useful Always?
Our proposed training framework uses all the data (D g ) in the model warm-up stage and then utilizes subsets of in-domain data (D s ) in the model finetuning stage.This resembles the "formal education system" where students first learn the general subjects with the same weights and later concen- trate more on a selected subset of specialized subjects.The first stage teaches them the base knowledge which is useful in the ensuing stage.We observe a similar phenomenon in our experiments.
From Table 5, we see that the performance of the NMT model using only the in-domain data is worse than using all general-domain data (-8.1 BLEU on average).Moreover, our data-selection curriculum training framework outperforms the converged model that uses all the data throughout the training in most of the translation tasks by a sizable margin.This indicates that not all data are useful all the time.Additionally, Figure 2 shows that in most scenarios, fine-tuning on selected data subsets D s outperform the traditional fine-tuning that uses all the data.This observation validates our intuition that some data samples are not only redundant but also detrimental to the NMT model's performance.

Do We Need the Two Stages?
For the online curricula, we leverage the model M for selecting D s based on the prediction scores, while in the deterministic curricula, we do not use   3, 4. We keep batch size same in each setting.the emerging model for selecting the data subset.
One might ask -do we need a base model in the deterministic curricula?Can we get rid of the warmup stage?To answer these questions, we perform another set of experiments where we train M from a randomly initialized state on the top p% of the selected data (p ={10, 40}) ranked by the three bitext scoring methods ( §2.1) and compare the results with our two-stage training framework where we fine-tune the base model from the warm-up stage on the same data subset.From the results in Table 6, it is evident that our proposed training framework utilizing the warm-up stage outperforms the approach not using any warm-up stage by a sizable margin in all the tasks.

Comparing Required Update Steps
Our proposed curriculum training approaches not only exhibit better performance but also converge faster compared to the baseline and traditional finetuning method.In Figure 3, we plot the number of update steps required by each of the settings in language pairs, the hybrid curriculum strategy requires the fewest updates as the size of selected subsets is much lower compared to other approaches.

Performance on Noisy Data
We further evaluate our framework on noisy data.We randomly selected 10M bitext pairs from the En-De ParaCrawl corpus (Bañón et al., 2020).We keep the experimental settings similar to §4 and present the results in Table 7. Fine-tuning on the data subset (D s ) selected by DCCE method outperforms the baseline (Converged Model) on both directions with a +4.4 BLEU gain in De-En.All the other deterministic and online curriculum methods perform better than the converged model on the De-En direction with a sizable margin.Compared to the traditional fine-tuning, all the curriculum methods perform better in both En to/from De.

Related Work
Curriculum Learning Inspired by human learners, Elman (1993) argues that optimization of neural network training can be accelerated by gradually increasing the difficulty of the concepts.Bengio et al. (2009) were the first to use the term "curricu-lum learning" to refer to the easy-to-hard training strategies in the context of machine learning.Using an easy-to-hard curriculum based on increasing vocabulary size in language model training, they achieved performance improvement.Recent work (Jiang et al., 2015;Hacohen and Weinshall, 2019;Zhou et al., 2020a) shows that manoeuvring the sequence of training data can improve both training efficiency and model accuracy.Several studies show the effectiveness of the difficulty-based curriculum learning in a wide range of NLP tasks including task-specific word representation learning (Tsvetkov et al., 2016), natural language understanding tasks (Sachan and Xing, 2016;Xu et al., 2020a), reading comprehension (Tay et al., 2019), and language modeling (Campos, 2021).Several studies show the effectiveness of the difficultybased curriculum learning in a wide range of NLP tasks (Cirik et al., 2016;Liu et al., 2018).
Curriculum Learning in NMT The difficultybased curriculum in NMT was first explored by Kocmi and Bojar (2017) (Zhao et al., 2020).In contrast, our proposed two-stage training framework for NMT fine-tunes the base model from the warm-up stage on selected subsets of data.Our data-selection curriculum training framework is more realistic, resembling the formal education system as discussed in §5.2.
Self-paced Learning in NMT Here, the model itself measures the difficulty of the training samples to adjust the learning pace (Kumar et al., 2010).In their approach, Wan et al. (2020) first train the NMT model for M passes on the data and cache the translation probabilities to find variance.The lower variance of the translation probabilities of a sample reflects higher confidence.Later, they use the confidence scores as factors to weight the loss to control the model updates.For low-resource NMT, Xu et al. (2020b) utilize the declination of the loss of a sample as the difficulty measure and train the model on easier samples (higher loss drop).In our online curriculum, we leverage the prediction scores of the emerging model in the model finetuning stage.However, after ranking the samples based on the prediction scores, we employ a variety of data-selection methods to select the better data subset ( §2.2).

Conclusion
We have presented a two-stage training framework for NMT where we apply a data-selection curriculum in the model fine-tuning stage.Our novel online curriculum strategy utilizes the emerging models' prediction scores for the selection of a better data subset.Experiments on six low-and high-resource language pairs show the efficacy of our proposed framework.Our curriculum training approaches exhibit better performance as well as converge much faster by requiring fewer updates compared to the baselines.

Limitations
Despite its effectiveness, our proposed data selection curriculum has some limitations: • The deterministic curriculum ( §2.1) uses external scoring methods that require pretrained models for selecting the data subset to be used in the model fine-tuning stage.These scoring methods' pretraining incur additional training costs.• Our online curriculum approach ( §2.2) on the other hand, is free from such additional pretraining costs.Nevertheless, it requires an extra forward propagation step in each of the epochs of the model fine-tuning stage to find the prediction scores of each of the sentence pairs.One possible way to avoid the extra forward propagation step is to cache the prediction scores while calculating the training losses in the previous epoch.However, there will be discrepancies in the prediction scores of samples as the predicted model will not be the same; it will get updated.Even though we did not investigate this avenue in our paper, we believe this can be an interesting research direction to observe the phenomenon.

Appendix A Model Architecture Settings
For En↔Ha, we use a smaller Transformer architecture with five layers, while for the other language pairs we use larger Transformer architecture with six encoder and decoder layers.We present the number of attention heads, embedding dimension, and the inner-layer dimension of both settings in Table 8.

B Variety of Data Samples in Static Data-selection Window
In the beginning of each epoch in the model finetuning stage in static data-selection window approach ( §2.2), we rank D d based on the prediction scores of each sentence pair (x i , y i ).We then pick a fixed data-selection window (confined to a range of data-percentage in the ranked D d e.g., 30% to 70%) by discarding too easy and too hard/noisy samples.Even though in this approach the size of the selected data subset (D s ) remains the same throughout the model fine-tuning stage, the samples in D s changes from epoch-to-epoch due to the change in their prediction scores by the current model.We present an illustrative example of this phenomenon in Figure 4.
In the current epoch of the fine-tuning stage (Figure 4(a)), samples 2, 3, 4, and 5 are selected to train the model while samples 1 and 6 are discarded -1 is too hard/noisy and 6 is too easy for the current model.In the next epoch (Figure 4(b)), some samples might be selected again (samples 3 and 5), while some earlier selected samples might have lower prediction scores and not be selected due to the hardness to the current model (sample 2).Again, some previously selected samples might have higher prediction scores and not be selected due to the easiness (sample 4).And some samples not selected in the previous epoch can now be selected (samples 1 and 6).

C Schedulers in Dynamic Data-selection Window
To change the data-selection window size in dynamic approach, we use schedulers which controls how the size of the window will grow in subsequent epochs ( §2.2).Apart from the linear scheduler (Eq.6), we also experiment with two other schedulers: • Exponential Scheduler We find the dataselection window size (for window expansion and shrink) at epoch t using the following formula: where λ init is the initial window size which is smaller for expansion and larger for shrink, and E inc , E dec are the hyperparameters of the exponential schedulers.
• Square-Root Scheduler We find the dataselection window size (for window expansion and shrink) at epoch t using the following formula: where λ init is the initial window size which is smaller for expansion and larger for shrink, and C 1 , C 2 , S inc , S dec are the hyperparameters of the square-root schedulers.
In our initial experiments, we explore the three schedulers -linear, exponential, and square-root.We found that linear scheduler performs better compared to the other schedulers.We present the results in Table 9.

Figure 1 :
Figure 1: Conceptual demonstration of online curriculum.We rank the sentence pairs based on the prediction scores of the emerging NMT model M and pick a dataselection window which discards easy and hard/noisy ones.

Figure 2 :
Figure 2: Fine-tuned warm-up stage model on different sizes of ranked data (deterministic curricula).

Figure 3 :
Figure 3: Number of update steps required for each setting of Tables3, 4. We keep batch size same in each setting.

Figure 4 :
Figure 4: Illustrative example of how data samples varies in static data-selection window approach in online curriculum.Even though the size of data-selection window is fixed throughout the model fine-tuning stage, the samples in the selected subsets vary from epoch-to-epoch due to the change in their prediction scores by the emerging model.

Table 1 :
Dataset statistics after cleaning and deduplication.

Table 2 :
In-domain corpora sources for high-resource language pairs.

Table 3 :
Main results for low-resource languages.Here, the data-percentage represents general-domain data (D g ) and we do not differentiate between general-domain and in-domain corpus (D d := D g ).Subscript values denote the BLEU score differences from the respective converged model.

Table 4 :
Main results for high-resource languages.Here, the data-percentage represents only In-domain data (D d ) from Table1and 100%+OOD denotes All-data (D g ).Subscript values denote the BLEU score differences from respective converged model.

Table 5 :
Results for high-resource languages on alldata (D g ) vs. in-domain data (D d ) when trained from a random state until convergence.

Table 6 :
Results for two-stage training framework vs. training without warm-up stage on top 10% and 40% of selected data ranked by three scoring methods ( §2.1).Main values denote the results of our two-stage framework utilizing warm-up stage, while subscript values represent results when model is trained on the same data subset without warm-up stage.

Table 7 :
Table 3 and 4. On average, we need about 50% fewer updates compared to the converged model.For high-resource languages, we need much fewer updates in the model fine-tuning stage.For all the Results for En↔De on noisy ParaCrawl corpus of 10M bitext pairs.Here, the data-percentage corresponds to all 10M bitext (D g ) and D d := D g .Subscript values denote the BLEU score difference from the respective converged model.