Curating Datasets for Better Performance with Example Training Dynamics

,


Introduction
Breakthroughs in NLP are often the result of scaling up existing models in size and depth, and perhaps even more importantly-data (Hoffmann et al., 2022).To improve data quality, it has become common practice to train models on data that has been cleaned to some extent using simple heuristics (e.g. by removing non-language tokens), but not otherwise optimized for better performance.While some data-filtering approaches have been suggested to improve Out-Of-Distribution (OOD) generalization (Le Bras et al., 2020, Swayamdipta et al., 2020), they usually result in a decrease of In-Distribution (ID) performance.
We propose a method for curating datasets for better performance in both ID and OOD testing, enhancing data quality rather quantity.Our method is orthogonal to model architecture or size, and as such can be used alongside any LM to further improve results over specific tasks.
To implement our method we use the concept of Example Training Dynamics (ETD; Swayamdipta et al., 2020), which builds on the idea that the training process of models sheds light on the relative importance of specific examples within the datasets used.Specifically, Swayamdipta et al. (2020) have shown that over several epochs of training, a model may predict some examples in a dataset less consistently than others, and that those "ambiguous" examples are important for OOD generalization.
We propose a new method for computing ETD, as well as a new paradigm for using them in training.We show that by computing ETD over separate training processes (rather than over consecutive epochs of the same training process), and using ETD to weigh the importance of each example in the dataset, we can train a DeBERTa model (He et al., 2020) on the weighted versions of several NLI and multiple-choice datasets, improving average performance by 0.35% for ID testing and by 0.95% for OOD.
We next demonstrate that ETD can be transferable across models, i.e., ETD computed from the training process of model M 1 can be used to weigh a dataset, and train model M 2 on it with improved results, where M 2 differs from M 1 in initial weights, structuring details and pre-training scheme.Though we only show that for a specific use-case, if transferability of ETD holds generally, it may allow us to create weighted versions of datasets once, and use them for multiple training scenarios, including that of future models.
Finally, we propose Dynamic Training, a method for training a model while computing ETD and reweighing the training set between epochs.This method performs on par with our transferability method while requiring no additional compute beyond that of standard training, which makes it even more applicable under low computational budgets.
Our proposed method of computing ETD may allow practitioners to get more value out of their existing datasets, and pave the way towards similar methods to be used for improving large scale language model training.We publicly release our code, as well the weighted versions of the datasets used in this work.

Computation of Example Training Dynamics
We introduce a new method for working with Example Training Dynamics, which has two separate components: the computation of ETD, and their application in training new models.Fig. 1 illustrates the full pipeline of the two components as described below.Let D = {(x i , y i )} N i=1 be a dataset of N examples x i with corresponding labels y i , and let M be a model with initial parameters θ that defines a probability distribution over the possible labels for examples in D.
To compute the ETD of D with respect to M, we train M on D for one epoch, and save the probabilities M assigns to the possible labels of each example in D. We repeat the process E times with the same experimental setting at each iteration, except for the random seed, and for the ordering of the examples in D, which is random and different at each iteration.After the first time M sees an example during training, it learns a bias towards its true label, and, if the same example is encountered again, this bias might affect the probability assigned to the different labels.To prevent this kind of bias and compute more informative ETD, the parameters θ are being reset between iterations.
Using the probabilities accumulated over all E iterations, we compute the two ETD metrics.The confidence of an example x i w.r.t.M is defined as the mean probability M assigns to x i 's true label, y * i , across iterations: Where P Me (y * i |x i ) is the probability M assigns to the gold label y * i in iteration e.In a slight abuse of notation, we use the term P Me , though in practice the probability function P may change between examples even within the same iteration, as training progresses and the parameters of M change. 2he variability of an example w.r.t.M is defined as the standard deviation of said probability: We follow Swayamdipta et al. (2020) and refer to high-confidence examples as easy-to-learn (for M), and high-variability examples as ambiguous.Swayamdipta et al. (2020) have shown that training only on high-variability (or ambiguous) examples can lead to better out-of-distribution (OOD) performance, at a small cost for in-distribution (ID) performance.In this work, we show that they can be used to improve both ID and OOD performance.

Using ETD
Let D = {(x i , y i )} N i=1 be a dataset whose ETD are computed w.r.t.some model M 1 , and let 0

Training with ETD Improves Performance
We follow Swayamdipta et al. (2020) and first test the capacity of our method to improve the performance of a model of the same architecture as the model used for computing the ETD.We use six tasks, divided into three groups.Three multiple choice tasks: WinoGrande (WG; Sakaguchi et al., 2019), Abductive NLI (αNLI; Bhagavatula et al., 2019), and HellaSwag (HS; Zellers et al., 2019); Two NLI tasks: SNLI (Bowman et al., 2015) and ANLI (Nie et al., 2019); and a question answering task: BoolQ (Clark et al., 2019).
For each of the tested tasks, we compute ETD using one copy of DeBERTa-large (He et al., 2020).Following Swayamdipta et al. (2020), we use E = 5 as the number of iterations for the ETD computation process.We then train a second copy of DeBERTa on the ETD-weighted dataset D (.25,3) .The specific values of f and b are chosen using a grid search for the best mean performance over the development sets of all tasks. 3ue to computational constraints, we do not tune any hyperparameters other than those defined specifically for this work, i.e. f and b.For other hyperparameters such as learning-rate and batch-size we follow the values used in Swayamdipta et al. (2020) for training on the SNLI (Bowman et al., 2015) dataset.
For each task, we test the trained model on its designated test set to evaluate ID performance, as well as on the test sets of all other tasks from the same group to evaluate OOD performance (e.g., we evaluate a model trained on WinoGrande on αNLI and Hellaswag as well).
As baselines, we train DeBERTa on three additional datasets: • D is the original, unaltered version of each dataset.
• D (.33) -NR is the dataset resulting from the approach of Swayamdipta et al. (2020) Each training process is repeated with s = 5 different seeds, and the reported result is the mean result across seeds.For each task, we train for a fixed number of steps regardless of the size of the dataset we train on.The fixed number of steps is task-dependent, and is the number of steps required to pass E = 5 times on D and compute its ETD.
Table 1 shows the full results of this experiment.Table 2 provides a summarized version of the results, as the improvement gained by training on D (0.25,3) compared to D in each task, on both ID and OOD test sets.Our method is the only one to outperform training on D across all 14 categories, obtaining mean improvement of 0.35% ID and 0.95% OOD.It also outperforms Dataset Cartography's approach (D (.33) -NR) on 11/14 categories, and the ablation version (D (.25,3) -NR) on 10/14 categories, demonstrating the importance of the weight reset.
These improvements are statistically significant: modelling the result of the baseline method in each category C as a sample from a normal distribution N C , the probability of outperforming the baseline when sampling from the same N c is 0.5.Therefore, the probability of outperforming the baseline on all 14 tasks when sampling from their respective distributions can be calculated using a Binomial Random Variable B(14, 0.5), which gives p-value ≤ P [B(14, 0.5) = 14] = 0.00006.

ETD are Transferable
The process of computing ETD is costly in terms of compute, requiring compute roughly equivalent to that of training the desired model.Thus, computing the ETD of a dataset separately for every model we wish to train is expensive, and, depending on the size of the dataset, may become prohibitively so.This problem can be bypassed if ETD are transferable across models, i.e., if ETD can be computed using a model M 1 , and then used to train a different model M 2 .
To test for transferability, we use DeBERTa as the ETD-computing model, and ELECTRA (Clark et al., 2020) as the training model.We conduct experiments similar to those in Section 3, with the exception that the training model M 2 is ELECTRA.Tables 3 and 4 show the results and summarized results of these experiments, respectively.
When computing ETD with DeBERTa and creating the respective D (0.25,4) , training ELECTRA on D (0.25,4) outperforms training on D in 5 out of the 6 ID categories, and in 9 out of all 14 categories.Though not as consistent as our main method, the ETD transfers well, with mean improvement of 0.2% ID and 0.33% OOD.   5 and 6 show the results and summarized results of these experiments.
Dynamic Training outperforms training on D in 5 out of the 6 ID categories, and in 11 out of all 14 categories.Though the ANLI task suffers decrease in performance, results for the other tasks improve relatively consistently, with mean improvement of 0.23% ID and 0.38% OOD.Though not as effective as ETD-weighted training, Dynamic Training improves performance in the majority of the cases, without any pre-processing of the data.

Robustness to Hyperparameter tuning
Throughout this work we report results training models on ETD-weighted datasets of the form D (f,b) , with specific values chosen for f, b.These values are hyperparameters, chosen using a grid search for best performance over the development set of the different tasks.Table 7 shows the other values of f, b we tested for in Section 3, and their

Related Work
Previous research offered various methods that consider the relative importance of examples within a dataset in order to improve training.Methods such as Curriculum learning (Bengio et al., 2009) or self-paced learning (Kumar et al., 2010)  Other methods rank the importance of examples with the goal of filtering datasets.Liu et al. (2021) use a train-twice approach with some resemblance to ETD, and upweight train examples that were missclassified in the first round of training.AFLite (Le Bras et al., 2020) ranks examples based on the ability of a linear classifier to solve them, and then filters out easy examples in order to eliminate artifacts and biases from the dataset.Other works advocated bias and artifacts removal, and by extension the removal of easy examples from datasets (Gururangan et al., 2018;Li and Vasconcelos, 2019).
Several approaches have used training metrics Core-set selection (Wei et al., 2013) uses submodular functions to find a subset of a dataset representative of the whole, to be used under low computational budget.Conversely, our approach uses training metrics to find a subset not necessarily representative of the whole, but rather one that can be emphasized within the whole dataset to improve performance, regardless of budget considerations.Similarities can also be drawn between our approach and active learning (Settles, 2009;Peris and Casacuberta, 2018;P.V.S and Meyer, 2019), which searches unlabeled data for the most useful examples to label and uses them for training.Sanh et al. (2020) aim to remove biases from models without re-sampling of the dataset, using Product of Experts between a weak (biased) model and a main model and achieving improvements over OOD test-sets.Karimi Mahabadi et al. (2020) suggested a somewhat similar approach of de-biasing a main model by contrasting it with a "bias-only" model, to achieve OOD improvements in tasks where the bias is known.Nam et al. (2020) also used biased models as a foil to a main model, to achieve de-biasing in vision-related tasks.Utama et al. (2020) proposed a debiasing method that improves OOD testing while maintaining ID test results.Their method regulates the model's confidence over biased examples in the dataset, using knowledge distillation in combination with biased models.This approach requires a-priori knowledge of the dataset's biases in order to formulate the biased model, and as such is not applicable to many NLP tasks.It also requires a large amount of compute, as it trains a full-size teacher model and a biased model besides the main model.

Conclusion
We presented a new method for computing Example Training Dynamics, which can be used to increase both ID and OOD performance, without any changes to model size or architecture.We demonstrated that ETD can be transferable, i.e., they can be computed once and used many times for different models, reducing the computation cost at a long term.Finally, we have shown that ETD can be computed on the fly using Dynamic Training, which may hold the key to improved performance using ETD at no extra compute cost.
As the field of NLP leans more and more into the self-supervised pre-training paradigm, further research on ETD may be focused on adjusting our method for larger and self-supervised datasets in order to improve and reduce the cost of pre-training as a whole.
Our method aims to expand and improve on Dataset Cartography(Swayamdipta et al., 2020), a method for visualizing and characterizing the different training examples in a given dataset.Dataset Cartography uses Example Training Dynamics (ETD), which are metrics derived by examining the probability distribution that a model assigns to the possible labels for each example in a dataset, and following the changes in that distribution over several epochs of training.ETD derives two metrics for each example in the dataset: Confidence, which is the mean score the model assigns to the true label across epochs, and Variability, which is the standard deviation of that score.Swayamdipta et al. (2020) have shown that training a new model only on examples with high variability improves performance in OOD testing, as well as shortens training time.However, this comes at a slight cost in ID testing performance.In this work, we propose a new method for computing ETD, which differs from that of Swayamdipta et al. (2020) in two ways: First, rather than following several epochs of a single training process to compute the metrics, we compute them by observing separate training processes of one epoch each.Second, we use the variability metric differently; rather than training a new model only on high-variability examples, we train it on the entire dataset, while emphasizing highvariability examples in training.We formally define our method and compare it to Swayamdipta et al. (2020) below.

Figure 1 :
Figure 1: Computation and Usage of ETD.Multiple copies of model M 1 are trained on a dataset D, each for one epoch.The probability scores M 1 outputs are used to compute ETD.The ETD are then used to to curate D (f,b) , a new version of D, which reweighs the examples in D. A (potentially different) model M 2 trains on D (f,b) .
we add b copies of it to D (f,b) .Otherwise, we add 1 copy of it to D (f,b) .That is, every example from the original dataset D is also in D (f,b) , but D (f,b) is biased towards high-variability examples by a factor of b.The new model M 2 is then trained on D (f,b) .For example, if f = 0.5 and b = 2, D (f,b) contains each of the top 50% highest-variability examples in D twice, and every other example once.Note that this method differs from Swayamdipta et al. (2020) in that it includes all examples from the D in the new dataset D (f,b) .Training on all examples helps prevent the decrease in ID performance observed by Swayamdipta et al. (2020), and even improves ID performance, as we next show.
Fig. 2 illustrates the flow of a Dynamic Training process.Experiments To test the effectiveness of Dynamic Training , we use the method to train a De-BERTa model on the six datasets we used for our previous experiments.We use the hyperparameters e * = 3, f = 0.33, b = 2 (based on a grid search for the best performance on the development sets).We compare the results against a DeBERTa model trained on D without any form of ETD-weighted Training.Tables

Figure 2 :
Figure 2: Dynamic Training of ETD.M is trained on D, accumulating ETD with each epoch.Starting at epoch e * and on each epoch afterwards, M is trained on D (f,b) instead of D. The ETD and D (f,b) keep updating until the end of training.
Table2: Performance improvement when training on D (.25,3) (our approach) compared to D (unaltered dataset).Average is the weighted average of the scores in each column (scores are weighted by the number of tasks they represent).

Table 3 :
Transferring ETD from DeBERTa to ELECTRA: Accuracy of trained models on ID and OOD test sets, compared between training on the unaltered dataset D and the ETD-weighted dataset of our approach, D (.25,4) .

Table 4 :
Transferring ETD from DeBERTa to ELEC-TRA: performance improvement when training on D (.25,4) (our approach) compared to D (unaltered dataset).OOD is the mean score of all OOD test sets for a given dataset.Average is the weighted average of the scores in each column (scores are weighted by the number of tasks they represent).Similarto computing regular ETD, as training progresses, we save the model's probabilities for the labels of each example in D as ETD.On some epoch e * > 1, we start using them to weigh D before each epoch.From e * onwards, we train each epoch on D (f,b) .We keep saving the model's output probabilities to update the ETD, and before the start of each epoch, calculate a new D (f,b) .As in the previous experiments, training is set to a fixed number of steps for each task, regardless of the varying size of D (f,b) during training.

Table 5 :
Dynamic Training results.D is training on the unaltered dataset, and DT -D (.33,2) is Dynamic Training on D (.33,2) starting at epoch e * = 3.

Table 7 :
Performance improvement when training DeBERTa-large on an ETD-weighted dataset (ETD computed with DeBERTa-large) over training without ETD, compared between different values for f, b.OOD is the mean score of all OOD development sets.Positive scores are marked in bold font