Training Dynamics for Curriculum Learning: A Study on Monolingual and Cross-lingual NLU

Curriculum Learning (CL) is a technique of training models via ranking examples in a typically increasing difficulty trend with the aim of accelerating convergence and improving generalisability. Current approaches for Natural Language Understanding (NLU) tasks use CL to improve in-distribution data performance often via heuristic-oriented or task-agnostic difficulties. In this work, instead, we employ CL for NLU by taking advantage of training dynamics as difficulty metrics, i.e., statistics that measure the behavior of the model at hand on specific task-data instances during training and propose modifications of existing CL schedulers based on these statistics. Differently from existing works, we focus on evaluating models on in-distribution (ID), out-of-distribution (OOD) as well as zero-shot (ZS) cross-lingual transfer datasets. We show across several NLU tasks that CL with training dynamics can result in better performance mostly on zero-shot cross-lingual transfer and OOD settings with improvements up by 8.5% in certain cases. Overall, experiments indicate that training dynamics can lead to better performing models with smoother training compared to other difficulty metrics while being 20% faster on average. In addition, through analysis we shed light on the correlations of task-specific versus task-agnostic metrics.

Contrastive to such approaches that take advantage of additional training data is Curriculum Learning (Bengio et al., 2009, CL), a technique that aims to train models using a specific ordering of the original training examples.This ordering typically follows an increasing difficulty trend where easy examples are fed to the model first, moving towards harder instances.The intuition behind CL stems from human learning, as humans focus on simpler concepts before learning more complex ones, a procedure that is called shaping (Krueger and Dayan, 2009).Although curricula have been primarily used for Computer Vision (Hacohen and Weinshall, 2019;Wu et al., 2021) and Machine Translation (Zhang et al., 2019a;Platanios et al., 2019), there are only a handful of approaches that incorporate CL into Natural Language Understanding tasks (Sachan and Xing, 2016;Tay et al., 2019;Lalor and Yu, 2020;Xu et al., 2020a).
Typically, CL requires a measure of difficulty for each example in the training set.Existing methods using CL in NLU tasks rely on heuristics such as sentence length, word rarity, depth of the dependency tree (Platanios et al., 2019;Tay et al., 2019), metrics based on item-response theory (Lalor and Yu, 2020) or task-agnostic model metrics such as perplexity (Zhou et al., 2020).Such metrics have been employed to either improve in-distribution performance on NLU or Machine Translation.However, their effect is still underexplored on other settings.
In this study instead, we propose to adopt training dynamics (Swayamdipta et al., 2020, TD) as difficulty measures for CL and fine-tune models with curricula on downstream tasks.TD were re-cently proposed as a set of statistics collected during the course of a model's training to automatically evaluate dataset quality, by identifying annotation artifacts.These statistics, offer a 3-dimensional view of a model's uncertainty towards each training example classifying them into distinct areas-easy, ambiguous and hard examples for a model to learn.
We test a series of easy-to-hard curricula using TD, namely TD-CL, with existing schedulers as well as novel modifications of those and experiment with other task-specific and task-agnostic metrics.We show performances and training times on three settings: in-distribution (ID), out-of-distribution (OOD) and zero-shot (ZS) transfer to languages different than English.To the best of our knowledge, no prior work on NLU considers the impact of CL on all these settings.To consolidate our findings, we evaluate models on different classification tasks, including Natural Language Inference, Paraphrase Identification, Commonsense Causal Reasoning and Document Classification.
Our findings suggest that TD-CL provides better zero-shot cross-lingual transfer up to 1.2% over prior work and can gain an average speedup of 20%, up to 51% in certain cases.In ID settings CL has minimal to no impact, while in OOD settings models trained with TD-CL can boost performance up to 8.5% on a different domain.Finally, TD provide more stable training compared to another taskspecific metric (Cross-Review).On the other hand, heuristics can also offer improvements especially when testing on a completely different domain.

Related Work
Curriculum Learning was initially mentioned in the work of Elman (1993) who demonstrated the importance of feeding neural networks with small/easy inputs at the early stages of training.The concept was later formalised by Bengio et al. (2009) where training in an easy-to-hard ordering was shown to result in faster convergence and improved performance.In general, Curriculum Learning requires a difficulty metric (also known as the scoring function) used to rank training instances, and a scheduler (known as the pacing function) that decides when and how new examples-of different difficulty-should be introduced to the model.Example Difficulty was initially expressed via model loss, in self-paced learning (Kumar et al., 2010;Jiang et al., 2015), increasing the contribution of harder training instances over time.This setting posed a challenge due to the fast-changing pace of the loss during training, thus later approaches used human-intuitive difficulty metrics, such as sentence length or the existence of rare words (Platanios et al., 2019) to pre-compute difficulties of training instances.However, as such metrics do not express difficulty of the model, modelbased metrics have been proposed over the years, such as measuring the loss difference between two checkpoints (Xu et al., 2020b) or model translation variability (Wang et al., 2019b;Wan et al., 2020).In our curricula we use training dynamics to measure example difficulty, i.e. metrics that consider difficulty from the perspective of a model towards a certain task.Example difficulty can be also estimated either in a static (offline) or dynamic (online) manner, where in the latter training instances are evaluated and re-ordered at certain times during training, while in the former the difficulty of each example remains the same throughout.In our experiments we adopt the first setting and consider static example difficulties.
Transfer Teacher CL is a particular family of such approaches that use an external model (namely the teacher) to measure the difficulty of training examples.Notable works incorporate a simpler model as the teacher (Zhang et al., 2018) or a larger-sized model (Hacohen and Weinshall, 2019), as well as using similar-sized learners trained on different subsets of the training data.These methods have considered as example difficulty, either the teacher model perplexity (Zhou et al., 2020), the norm of a teacher model word embeddings (Liu et al., 2020), the teacher's performance on a certain task (Xu et al., 2020a) or simply regard difficulty as a latent variable in a teacher model (Lalor and Yu, 2020).In the same vein, we also incorporate Transfer Teacher CL via teacher and student models of the same size and type.However, differently, we take into account the behavior of the teacher during the course of its training to measure example difficulty instead of considering its performance at the end of training or analysing internal embeddings.
Moving on to Schedulers, these can be divided into discrete and continuous.Discrete schedulers, often referred to as bucketing, group training instances that share similar difficulties into distinct sets.Different configurations include accumulating buckets over time (Cirik et al., 2016), sampling a subset of data from each bucket (Xu et al., 2020a;Kocmi and Bojar, 2017) or more sophisti-cated sampling strategies (Zhang et al., 2018).In cases where the number of buckets is not obtained in a straightforward manner, methods either heuristically split examples (Zhang et al., 2018), adopt uniform splits (Xu et al., 2020a) or employ schedulers that are based on a continuous function.A characteristic approach is that of Platanios et al. (2019) where at each training step a monotonically increasing function chooses the amount of training data the model has access to, sorted by increasing difficulty.As we will describe later on, we experiment with two established schedulers and propose modifications of those based on training dynamics.

Methodology
Let D = {(x i , y i )} N i=1 be a set of training data instances.A curriculum is comprised of two main elements: the difficulty metric, responsible for associating a training example to a score that represents a notion of difficulty and the scheduler that determines the type and number of available instances at each training step t.We experiment with three difficulty metrics derived from training dynamics and four schedulers: two are new contributions and the remaining are referenced from previous work.

Difficulty Metrics
As aforementioned, we use training dynamics (Swayamdipta et al., 2020), i.e. statistics originally introduced to analyse dataset quality, as difficulty metrics.The suitability of such statistics to serve as difficulty measures for CL is encapsulated in three core aspects.Firstly, training dynamics are straightforward.They can be easily obtained by training a single model on the target dataset and keeping statistics about its predictions on the training set.Secondly, training dynamics correlate well with model uncertainty and follow a similar trend to human (dis)agreement in terms of data annotation, essentially combining the view of both worlds.Finally, training dynamics manifest a clear pattern of separating instances into distinct areaseasy, ambiguous and hard examples for a model to learn-something that aligns well with the ideas behind Curriculum Learning.The difficulty of an example (x i , y i ) can be determined by a function f , where an example i is considered more difficult than example j if f (x i , y i ) > f (x j , y j ).We list three difficulty metrics that use statistics during the course of a model's training, as follows: CONFIDENCE (CONF) of an example x i is the average probability assigned to the gold label y i by a model with parameters θ across a number of epochs E. This is a continuous metric with higher values corresponding to easier examples.
VARIABILITY (VAR) of an example x i is the standard deviation of the probabilities assigned to the gold label y i across E epochs.It is a continuous metric with higher values indicating greater uncertainty for a training example.
Confidence and correctness are the primary metrics that we use in our curricula since low and high values correspond to hard and easy examples respectively.On the other hand, variability is used as an auxiliary metric since only high scores clearly represent uncertain examples while low scores offer no important information on their own.

Schedulers
We consider both discrete and continuous schedulers.Each scheduler is paired with the metric that is most suited, i.e. the discrete correctness metric combined with annealing and the continuous confidence metric is combined with competence.The ANNEALING (CORR ANNEAL ) scheduler proposed by Xu et al. (2020a) In addition to those schedulers, we introduce the following modifications that take advantage of the variability metric.CORRECTNESS + VARIABILITY ANNEALING (CORR+VAR ANNEAL ) is a modification of the Annealing scheduler and CONFIDENCE + VARIABILITY COMPETENCE (CONF+VAR COMP ) is a modification of the Competence scheduler.In both variations, instead of sampling uniformly across available examples, we give higher probability to instances with high variability scores (Equation (3)), essentially using two metrics instead of one.We assume that since the model is more uncertain about such examples further training on them can be beneficial.For all curricula, after the model has finished the curriculum stage, we resume training as normal, i.e. by random sampling of training instances.

Transfer Teacher Curriculum Learning
In order to train a model (student) with training dynamics provided by another model (teacher), the latter should be first fine-tuned on a target dataset.In other words, the proposed metrics are used in a transfer teacher CL setting (Matiisen et al., 2019).The two-step procedure that we follow in this study is depicted in Figure 1.Initially a model (the teacher) is fine-tuned on a target dataset and training dynamics are collected during the course of training.The collected dynamics are then converted into difficulty metrics, following Equations ( 1)-( 3).In the second stage, the difficulty metrics and the original training data are fed into a scheduler that re-orders the examples according to their difficulty (in our case from easy-to-hard) and feeds them into another model (the student) that is the same in size and type as the teacher.

Datasets
In this work we focus on four NLU classifications tasks: Natural Language Inference, where given a premise and a hypothesis the task is to identify if the hypothesis entails/contradicts/or is neutral based on the premise; Paraphrase Identification, where the task is to find if two sentences are paraphrases of one another; Commonsense Causal Reasoning, where given a premise, a question and a set of choices the task is to find the correct answer to the question based on the premise, and Document Classification where each document should be assigned the correct category.
We aim for a comparison across 3 settings: indistribution (ID), out-of-distribution (OOD) and zero-shot (ZS), hence, we select datasets that contain all those settings, if possible.We use a small subset from the GLUE benchmark (Wang et al., 2018) 2020)).HANS was selected for RTE because both are binary classification datasets and there is no need to convert the "neutral" label to "non-contradiction" for evaluation.CSQA was chosen as OOD for commonsense reasoning since it targets knowledge related to factual and physical commonsense, in contrast to SIQA or CosmosQA that focus on commonsense required during social/everyday situations.Finally, for MLDoc we could not find a dataset having the same classification categories to serve as OOD.The corresponding statistics are shown in Table 1 and more details can be found in Appendix A.

Evaluation Settings
We use the pre-trained versions of base RoBERTa (Liu et al., 2019) and XLM-R (Conneau et al., 2020).For all datasets, we report accuracy as the main evaluation metric across three random seeds, on the following settings.
In-Distribution (ID) and Out-Of-Distribution (OOD): We first fine-tune a monolingual (English) model on a target dataset and evaluate on their ID test set, e.g.train RoBERTa on MNLI, and evaluate on MNLI-M validation set.We also evaluate it on an OOD dataset, e.g.NLI Diagnostics.
Zero-Shot (ZS): Constitutes the zero-shot crosslingual transfer setting.In particular, we train a multilingual model on the same dataset, e.g.XLM-RoBERTa on (English only) MNLI and evaluate it on a zero-shot cross-lingual set, e.g.XNLI test set (Hu et al., 2020).
In all experiments, we select the best check-point based on the English validation set performance.When reporting significance tests we use the Approximate Randomization test with all seeds (Noreen, 1989).More details about experimental settings can be found in Appendix C.3.

Model Comparisons
We primarily compare all curricula that use training dynamics against each other and against a baseline (Random) that does not employ any curriculum and is using standard random order training.We also consider as another baseline the teachertransfer curriculum proposed by  Table 3: Zero-shot performance between curricula as the average accuracy across languages (mean and standard deviation over 3 random seeds) with XLM-R.We also report prior work results for reference as follows: PAWS-X (Chi et al., 2022), XNLI (Chi et al., 2022)

Performance & Training Time
Results on Tables 2 and 3 show performance and training time for various datasets.The reported numbers (Time) are calculated as the ratio S * TD /S CR anneal , i.e. the number of steps the Training Dynamics curriculum needs to reach best performance (S * TD ) divided by the number of steps the Cross-Review method needs to reach its best performance (S CR anneal ).We focus comparison between curricula to show the trade-back between performance and time (a lower score indicates a larger speedup).In parentheses the minimum time obtained across 3 random seeds is reported.
Table 2 shows accuracies for RoBERTa models when tested on ID/OOD data.We observe that CL has minimal improvements in ID and in particular, through statistical testing we find that the increases over the Random baseline or Cross-Review are not significant for any of the datasets, except for MNLI-M vs Random CONF+VAR COMP2 .Nevertheless, when tested on OOD performance improvement is larger.CONF+VAR COMP achieves the best performance on TwitterPPDB (+9.15, sign.p < 0.01), CommonSenseQA (+1.23) and HANS (+0.71) while CORR+VAR ANNEAL performs best for NLI Diagnostics (+0.58) and Adversarial SQuAD (+1.06, p < 0.01) over random.We speculate that CONF+VAR COMP is better on OOD thanks to its slow pacing and the more accurate difficulties of confidence.However, this comes at the cost of speedup by requiring either the same or  a few more steps than CR ANNEAL .
Investigating the cross-lingual transfer results on Table 3, initially we observe that CL with XLM-R seems to have a larger impact in terms of performance.On XNLI there is a +0.73 points increase over Random (p < 0.01).The difference with CR is not significant but TD achieved a 20% speedup on average.On XCOPA we observe +1.06 points increase, requiring however more training time with the CORR+VAR ANNEAL curriculum, over the random baseline.It is worth noting that for XCOPA, the competence-based curricula are able to also offer better performance with less additional training time.As for the remaining datasets, CL is unable to achieve any performance improvement on MLDoc while on PAWS-X CORR ANNEAL has an improvement of +0.2 points from Random and +0.35 from CR ANNEAL , both statistically significant (p < 0.01), with the cost of no speedup.As another drawback, Cross-Review is generally more resource demanding since it needs N fully-trained teacher models instead of 1.

Comparing Difficulties
We now present a comparison between taskagnostic (TA) and task-specific (TS) difficulty met-rics.We re-implement 3 additional difficulty metrics proposed in prior work for Neural Machine Translation.The first two, introduced in Platanios et al. ( 2019), correspond to sentence length (LENGTH) computed as the number of words in each sentence and word rarity (RARITY) computed as the negated logarithmic sum of the frequency of each word in a sentence.Frequencies are computed over the training set.Finally, we experiment with Perplexity (PPL) as the difficulty of a sentence (Zhou et al., 2020).We calculate sentence perplexity as the average perplexities of its subwords by masking one subword at a time and using the remaining context to predict it.Since we test on a task with two-sentence input, we sum the PPL of the two sentences and consider the entire input for LENGTH and RARITY.
Table 4 shows the results of the comparison between metrics on the PAWS and MNLI datasets.Interestingly, we observe that TA metrics perform on par with TS on ID data, worse on ZS data and can perform quite well for OOD data.In particular, RARITY is the third best on Twitter and the second best on NLI Diagnostics.This can be explained by the very different language used on Twitter vs Wikipedia in the training corpus, as well as the human-created nature of the NLI Diagnostics data.PPL is the best performing system in Twitter and third best on CSQA.We find statistically significant improvement (p < 0.01) compared with CONF+VAR COMP on the Twitter OOD test set.Masked word prediction of unknown words could be an informative signal for a very new domain.For the case of CSQA, length and rarity perform much worse than other metrics, possibly because the total length of the question and answer is quite small (approximately 15 tokens on average).
Furthermore, we analyse the relation of different difficulty metrics by calculating the Spearman rank correlation between all possible combinations.As shown in Figure 2, we observe very high correlation between confidence and correctness, as expected, but also a good correlation with Cross-Review, explaining their close performance.On the contrary, variability is negatively correlated with those metrics as higher values indicate more uncertainty from the model towards an example.As such, a combination of these opposing metrics can offer benefits than combining two already correlated metrics.Compared with task-agnostic metrics, interestingly, we see almost no (or negative) correlation with either LENGTH, RARITY or PPL, indicating that examples that the model deems difficult when fine-tuned on a task are very different than those before fine-tuning or based on heuristics.RARITY and LENGTH highly correlate as longer sentences are more likely to contain rare words.Finally, PPL is reverse analogous to them, probably because longer sentences have more context and it is thus easier for the model to predict the masked token.Overall, PPL has a slight positive relation with variability since both measure model uncertainty and high PPL of words might make the model to further fluctuate between its predictions.

Learning Curves
In order to examine the behavior of the curricula during the course of training, we further plot the average language performance on the validation set as a function of the number of training steps when using XLM-R models for the improved datasets (XNLI and XCOPA).In Figure 3 we draw the best performing curriculum (CONF+VAR COMP ), the CR ANNEAL and the Random baseline.
A first finding is that for CR ANNEAL we observe a performance drop around 20K steps in XNLI.Further investigation revealed that the drop happens

Training with limited budget
Since training a teacher model can add overhead to the general training process (training a teacher model plus a similar-sized student), we further conduct a minimal experiment on PAWS, where we collect training dynamics for a teacher XLM-R model for different number of epochs (stopping training early) and then train a student XLM-R model for longer, 10 epochs.Results are reported in Table 5 for our best overall curriculum for this dataset CORR+VAR ANNEAL as the average of the validation set languages performance.
We observe that it is not necessary to collect training dynamics for a long period of training (e.g. 10 epochs) as even with much less training, for in-

Conclusion
We presented a set of experiments using training dynamics (Swayamdipta et al., 2020) as difficulty metrics for CL on several NLU tasks.Differently from existing works, we focus our evaluation on indistribution, out-of-distribution and zero-shot crosslingual transfer data by testing existing discrete and continuous schedulers as well as modifications of those in a transfer-teacher curriculum setting.
Our findings offer evidence that simply reordering the training examples in a meaningful way has mostly an impact on zero-shot cross-lingual transfer and OOD data, with no improvement on ID.Our proposed Continuous scheduler with confidence and variability sampling provided a boost up to 8.5% on a challenging OOD dataset over prior work.Comparing our proposed application of training dynamics to other transfer-teacher curriculum methods that are using more than 1 teacher model, we observed greater speedups, improved performance and more stable training.In particular, we found that task-agnostic metrics do not perform better than task-specific ones on ID and ZS data but can offer good performance on OOD settings.
Overall, our experiments suggest there is no curriculum outperforming others by a large margin which is consistent with findings in Zhang et al. (2018) and that task-agnostic metrics should not be rejected when transferring to challenging new domains.However we show that training dynamics are potentially better difficulty metrics for CL in both monolingual and multilingual models even with a limited budget.
Although in this study we focused on using CL on a single language only (English), a reasonable extension is considering training data from other languages as well and investigate instance difficulties based on language or following efforts towards continual learning (Parisi et al., 2019).Finally, using TD in a dynamic rather than a static curriculum is another interesting direction that can potentially offer further training speedups as well as ways to improve model pre-training (Nagatsuka et al., 2021;Li et al., 2021).

Limitations
The presented work has certain limitations that we acknowledge in this section.Firstly, the experiments are limited to base-sized models to enable us to conduct more experiments across multiple seeds.Validating that the same conclusions hold for large models is a promising direction.The work is also focused on an offline curriculum approach, where difficulty metrics are obtained via the teacher model, before the student model training.This can indeed add an additional overhead to the overall process of collecting training dynamics.This limitation is partially addressed in Section 5.4, to reduce overhead.Nevertheless, converting this approach into a dynamic one can be beneficial.Finally, following the original training dynamics setting, the methods were mainly applied on classification datasets, since it is straightforward to use accuracy as a difficulty metric.it, have been inserted into the SQuAD (Rajpurkar et al., 2016) validation set.The original dataset follows the SQuAD format.We thus automatically convert it to a sentence-level one with binary labels, similarly to QNLI.Finally, we use NLI Diagnostics (Wang et al., 2018) as OOD test set for MNLI, a set of human-annotated examples that reveal model behavior on particular semantic phenomena.
PAWS-X (Yang et al., 2019) is the cross-lingual version of the English Paraphrase Adversaries from Word Scrambling dataset (Zhang et al., 2019b) containing paraphrase identification pairs from Wikipedia.It consists of human translated pairs in 6 topologically distinct languages.The training set contains only English examples taken from the original PAWS dataset.As ID test set we use the test set of the original PAWS dataset.As OOD we use the TwitterPPDB dataset (Lan et al., 2017) following Desai and Durrett (2020).
XNLI is the cross-lingual NLI dataset (Conneau et al., 2018), an evaluation set created by extending the development and test sets of the MultiNLI dataset (Williams et al., 2018) and translating it into 14 languages.Training data constitutes the original MultiNLI English training set.
XCOPA is the Cross-lingual Choice of Plausible Alternatives (Ponti et al., 2020), a typologically diverse multilingual dataset for causal common sense reasoning in 11 languages.The dataset consists of development and test examples for each language, which are translations from the English COPA (Roemmele et al., 2011) validation and test sets.Following Ponti et al. (2020) we use the Social IQA dataset (Sap et al., 2019) as training data (containing 3 possible choices), and the English COPA development set as validation data (containing 2 possible choices).For ID we report results on the SIQA test set (and validation set in Appendix D).For OOD, we consider the Common-SenseQA (CSQA) dataset (Talmor et al., 2019) that contains 5 possible choices.
MLDoc is a document classification dataset with 4 target categories: corporate/industrial, economics, government/social, and markets (Schwenk and Li, 2018).The dataset is an improved version of the Reuters benchmark (Klementiev et al., 2012) consisting of 7 languages and comes with 4 different sets of English training data (1k, 2k, 5k, 10k).Here, we use the 10k following prior work (Keung et al., 2020).

B Analysing Data Maps
To better understand the reason for the reported CL benefits we plot data maps that result from training an XLM-R model on each dataset in Figure 4, with confidence in the y-axis, variability in the x-axis and correctness in the legend.As observed, the easiest overall datasets, i.e.PAWS-X (4b), MLDoc (4g) and QNLI (4h) result in quite crisp maps with very few hard-to-learn examples, while in XNLI (4d) and SIQA (4f) the data maps are very dense and the number of difficult examples is high.This can potentially explain why CL with XLM-R models was more beneficial on those datasets in terms of performance, confirming that CL can be used to better prepare a model for harder instances.

C.1 Hyper-parameter Settings
We use base models, XLM-R and RoBERTa with 470M and 340M parameters respectively, from the HuggingFace library (Wolf et al., 2020).We fix sentence length to 128 for all datasets except ML-Doc where we use 256.We did minimal learning rate tuning on each dataset's English validation set, searching among [7e-6, 1e-5, 2e-5, 3e-5] and choosing the best performing one, depicted in table 7. We clip gradients to 1.0 after each update, use the AdamW optimizer (Loshchilov and Hutter, 2017) without any warmup.All reported experiments use the same 3 random seeds and all models were trained on a single Nvidia V100 16GB GPU.
In terms of training time, Table 6 shows the training time required for each dataset with the above parameters.

C.2 Multiple Choice QA
We treat SIQA-XCOPA as a sentence-pair classification task and feed the model a (premise-question, choice) tuple converting each cause into "What was the cause?" and each effect into "What was the effect?"question which is concatenated to the Performance is reported over three random seeds.

D Additional Results
In Table 8 we report test and validation set performance on the SIQA dataset.We provide the per-language performance results for the multilingual datasets in Tables 9-12.Finally, we report per-category performance for the NLI Diagnostics dataset in Table 13.

Figure 1 :
Figure 1: Transfer Teacher Curriculum Learning used in our study.A teacher model determines the difficulty of training examples by collecting training dynamics during fine-tuning (Stage 1).The collected dynamics are converted into difficulty metrics and are given to a student model via a scheduler (Stage 2).

Figure 2 :
Figure 2: Spearman rank correlation between difficulty metrics using RoBERTa-base.Observations are similar for XLM-RoBERTa-base.

Figure 3 :
Figure 3: Average validation set accuracy across languages as a function of learning steps (in thousands) with XLM-R models.Results are reported over 3 random seeds.

Figure 4 :
Figure 4: Data map for the training set of each dataset.We plot maximum 25K examples for clarity.
, assumes that training data are split into buckets {d 1 ⊂ D, . . ., d K ⊂ D} with possibly different sizes |d i |.In particular, we group examples into the same bucket if they have the same correctness score (see Equation (2)).In total, this results in E + 1 buckets, which are sorted in order of increasing difficulty.Training starts with the easiest bucket.We then move on to the next bucket by also randomly selecting 1/(E + 1) examples from each previous bucket.Following prior work, we train on each bucket for one epoch.The COMPETENCE (CONF COMP ) scheduler was originally proposed by Platanios et al. (2019).Here, we sort examples based on the confidence metric (see Equation (1)), and use a monotonically increasing function to obtain the percentage of available training data at each step.The model can use only the top K most confident examples as instructed by this function.A mini-batch is then sampled uniformly from the available examples.

Table 1 :
(Yang et al., 2019)18)(RTE, QNLI and MNLI) and four cross-lingual datasets: XNLI(Conneau et al., 2018), PAWS-X(Yang et al., 2019)for paraphrase detection, XCOPA (Ponti et al., Datasets statistics.ZS, ID and OOD correspond to zero-shot Cross-lingual transfer, in-distribution and out-of-distribution settings, respectively.ZS Validation and Test statistics are per language. and RTE where we split into 3, following the original paper.The difference between the cross-review and the correctness metrics is that Cross-Review uses N fully trained teacher models on subsets of data, while the latter uses E epochs of a single model trained on the entire training set.Finally, when comparing CR ANNEAL with our training-dynamics based curricula, via discrete and continuous schedulers, we ensure that all of them are trained for equal amount of time, in order to have a one-to-one comparison.To enforce this, after the end of the curriculum phase, training continues as normal for the remaining steps (if any) by randomly sampling examples, otherwise training stops early.
Xu et al. (2020a), namely Cross-Review (indicated as CR ANNEAL in the next sections).This curriculum uses the annealing scheduler, but does not employ training dynamics as difficulty scores.Instead, the method splits the training set into subsets and a model is trained on each subset containing 1/N of the training set.The resulting models are then used to evaluate all examples belonging in different subsets.The difficulty score of an example is considered the number of its correct classifications across teachers.We split each training set into 10 subsets for all datasets except MLDoc where we split into 5

Table 2 :
Accuracy results of RoBERTa on in-distribution (ID) and out-of-distribution (OOD) data.Time corresponds to the ratio S* TD/SCR anneal , where the numerator is the number steps a curriculum with TD needs to reach the reported performance and the denominator is the number of steps the CR ANNEAL baseline requires to reach its performance.Results are reported over 3 random seeds and in parentheses we include the minimum time required across seeds.

Table 6 :
Training time required for a full model training.

Table 13 :
NLI Diagnostics results.Results are averaged across 3 seeds.