We Need to Talk About train-dev-test Splits

Standard train-dev-test splits used to benchmark multiple models against each other are ubiquitously used in Natural Language Processing (NLP). In this setup, the train data is used for training the model, the development set for evaluating different versions of the proposed model(s) during development, and the test set to confirm the answers to the main research question(s). However, the introduction of neural networks in NLP has led to a different use of these standard splits; the development set is now often used for model selection during the training procedure. Because of this, comparing multiple versions of the same model during development leads to overestimation on the development data. As an effect, people have started to compare an increasing amount of models on the test data, leading to faster overfitting and “expiration” of our test sets. We propose to use a tune-set when developing neural network methods, which can be used for model picking so that comparing the different versions of a new model can safely be done on the development data.

1 Dataset Splits in NLP

Current State
In Natural Language Processing (NLP), a highly empirical field, it is common to benchmark multiple models to each other on a standard dataset.However, since most current models are supervised, and thus require labeled training data, the datasets have to be split.To ensure a fair comparison, most datasets in NLP have standard splits.Most datasets consist of three splits (also visualized in Figure 1(a)): • train: Used for training models, in some setups this split can be omitted (zero-shot or unsupervised learning).

Test phase
Development phase Figure 1: Overview of the use of data splits.red :test orange :dev green :train yellow :tune.a) standard splits for traditional machine learning models b) standard splits as used for neural network models c) our proposed splits for neural network models.
• development (also called validation/evaluation): Used to compare all different versions of the proposed model(s).Can also be used to get preliminary answers to the main research questions.
• test: Used to confirm the final answer to the research question.
One often raised worry is that if too many papers are written based on the same test-set, overfitting occurs, especially when only positive results are published (Scargle, 2000).It should be noted that we do not refer to overfitting of the models parameters, but on design decisions (hyperparameters etc.), in line with "bias from research design" as defined by Hovy and Prabhumoye (2021).This means that there is a bias towards methods that perform well on this specific set.We agree that this is a danger.If we consider a more general perspective to this problem, a certain split becomes more prone to this when more different models are evaluated on this exact same data.Let's assume that there is a threshold N that limits the number of times we can re-use the same split for evaluation.The number of papers that can use the same dataset for a fair comparison is then equal to N divided by the average number of evaluated models per paper.From this, it follows that, no matter how large N is, a larger average number of runs per paper will drastically reduce the lifespan of a dataset.
For this reason, it used to be common to evaluate all varieties of a newly proposed model on the development data, and only confirm the main findings (e.g.comparison of 2 most relevant models) on the test set.This means that if we propose a new model B, and we want to prove that it outperforms existing model A, we would first evaluate and tune all our varieties of model B (B 1...n ) on the dev set, and then only compare the best version of model B to model A. These varieties of model B can include differences in hyperparameters as well as design decisions.From this it also logically follows that (qualitative) analysis should not be done on the test data.
It should be noted that in some situations a hidden test-set is enforced to circumvent overfitting, for example in shared tasks, where the test data is only shared at the end, and on benchmark websites, where the test-labels remain hidden for the participants (Kim et al., 2011;Wang et al., 2018;Aguilar et al., 2020;Khanuja et al., 2020).This setup is enforced for good reasons, and, in our opinion, should be the standard setup in NLP

What Has Changed?
Since the introduction of neural networks, the use of the dev set has changed.Neural network models are commonly trained for multiple epochs over the training data, because they are prone to overfitting (on the training data) it is common to evaluate the model on the dev data after each epoch, and use the model from the epoch with the highest performance on the dev data.This "best model selection" (i.e. the best epoch) differs from other hyperparameters, as it is re-tuned every run.In other words, the development data is integrated into the training procedure.This model selection has shown to be important for final performance (Chen and Ritter, 2020).A problem now arises when we want to compare our new model B to model A and train multiple models B 1...n on the same dev split.Namely, the performance on the development data of each model B i is likely to be overly optimistic. . 2 To confirm this trend, we counted the number of novel models (i.e.non-baseline) evaluated on the test data for 100 random papers of the ACL 2010 and 2020 proceedings.Results show a clear trend: in 2020 there are more models evaluated on the test set per paper (Figure 2).For example, in 2010 50% of the papers evaluated less than 4 models on the test data, in 2020 this was the case for only 25% of the papers.
The annotator (with 7 years full-time research experience in NLP) observed that in many cases it is not explicitly reported for results on which split they are based (especially in 2020), but in most cases, this could be derived from comparing the analysis results with the main results or from the repository.Furthermore, the papers in 2010 more often used non-benchmark datasets created specifically for a study, for which running multiple models on the test data is arguably less severe.More details about the annotation are reported in the appendix.
To sum up, when using the development set for model picking, one is left with a choice for model comparison: use the dev data or the test data.If the dev data is used, performance is easily overestimated because the model picking was done on the same set.If the test data is used, overfitting of design decisions will more quickly happen on the test data, and it becomes obsolete faster.
2 Related Work Gorman and Bedrick (2019) and Çöltekin (2020) propose to use random data splits instead of the standard splits.In other words, they propose to shuffle the whole dataset multiple times, and extract a train, dev, and test split from each random shuffle.This would avoid overfitting, as "the use of a single standard split, may result in avoidable Type I error" (Gorman and Bedrick, 2019).As pointed out by Søgaard et al. (2021), these random splits have another danger.It is good practice to create stratified datasplits based on some attributes (e.g., time, speaker, document etc.).This stratified sampling leads to more realistic performance estimates for real-world situations (as we assume we want to employ our models for new samples, from other time-periods, speakers or documents).The problem now becomes that after shuffling and re-splitting, it is very likely that sentences from, for example, the same document are both in the training data and the test data, which leads to (unrealistically) higher performances in the experiments of both Gorman and Bedrick (2019) and Çöltekin (2020).Therefore, Søgaard et al. (2021) propose other strategies to resplit the data.They show that using biased splits better approximate real-world performance on new samples (i.e. from another dataset) as standard splits, but still lead to a large overestimation of performance.In both of these proposed setups (random and biased splits), the splits that are proposed are still train-dev-test splits.This means that if the same splits are used across different papers, over-estimation on either dev/test would still occur (depending which one is used to compare B 1...n ), and overfitting still occurs.If instead, new splits are generated for each paper, overestimation still happens, and direct comparison between different papers is more complex.
Recently, there has been an increasing interest in other aspects of evaluation of NLP models, including automatic testing of specific abilities (Ribeiro et al., 2020), significance testing (Dror et al., 2018;Sadeqi Azer et al., 2020), effect of random seeds (Reimers and Gurevych, 2018) and reproducibility (Fokkens et al., 2013;Cohen et al., 2018;Wieling et al., 2018;Branco et al., 2020;Belz et al., 2021).We consider all of these problems (including random/biased splits) orthogonal to the problem of overfitting on the test set, as in all of the proposed setups/solutions train-dev-test splits are still used.This is also the case for k-fold cross-validation which is a standard method to combat overfitting, within the k folds, there are still dev-sets on which one will overfit if for each fold, hyperparameter tuning, model-picking and analysis is done on the same data.

The Tune Split
The solution we propose to the problem introduced in Section 1.2 follows logically from the observation that we do not have a data split left for comparing the models.We simply introduce an additional data split, which we call the tune split (Figure 1(c)).This tune data can be used to pick the best model, thereby leaving the development set out of the training procedure.Then the best model B out of B 1...n to compare against model A can be picked based on the development data, and the superiority of model B i can be confirmed on the test data.This also makes a comparison to traditional machine learning models fairer, as they also do not make use of the dev data during training.
One clear downside of this approach is that there is less data remaining for the other splits.To overcome this, one could also pick the best hyperparameters/settings for model B based on the dev split, while using the tune split for model picking, and then for the final comparison add the tune split to the train split and use the development data for picking the best model.This procedure is the same as it would be in a shared task setup, where the train+dev data can be used however the participants see fit, but the test data remains unseen until the final comparison.
It should be noted that in cross-domain or crosslingual setups, similar solutions have recently been proposed.In these setups, people commonly use the source dataset dev split for model picking (Keung et al., 2020).To have a pure cross-domain or cross-lingual setup, it is important to not tune on all target domains/languages as you are likely to overestimate performance when no target data is available.Artetxe et al. (2020) therefore argue to only use the dev set of one target language and report test results on other languages.ANother case where a similar solution was sometimes used, is the devtest set in machine translation, which is used at least since the WMT 2006 shared task (Koehn and Monz, 2006).This split is an effect of having many sequential shared task, where new test-data is added every year.In some work, the dev split is used for model-picking and the testdev split is used as development data.However, to the best of our knowledge, there is no offical use (nor guidelines) on the function of the devtest split.An alternative solution is introduced by Chen and Ritter (2020), who propose methods for picking the best model that do not rely on any labeled data.

Case Study
To evaluate the effect of having a separate tune split, we perform a case study in which we fine-tune a transition-based (Nivre, 2008) Bi-LSTM (Graves and Schmidhuber, 2005) parser and a transformer-based (Vaswani et al., 2017) deep biaffine parser (Dozat and Manning, 2017) on the same datasets.We use the Universal Dependencies (UD) 2.8 data (Zeman et al., 2021) as benchmark, and use the UUParser (Smith et al., 2018a) and the MaChAmp (van der Goot et al., 2021) implementations of the corresponding parsers.

Experimental Setup
We use the datasets selected by Smith et al. (2018b).We concatenate the train and dev set (we omit the test data in these experiments, to avoid overanalyzing it), and resplit the resulting data in 4 splits: the last 3,000 sentences are used for 1,000 sentences respectively for the test, dev, and tune split, and the remaining data is used as training data.We do not shuffle the sentences, as they are chronologically ordered in many cases, resulting in a (somewhat) stratified split, thereby avoiding overestimation of performance because train/test have overlapping sources (as done by Gorman and Bedrick (2019) and Çöltekin (2020)). 3We consider two finetuning setups: • train+tune for training, model-picking and hyperparameter tuning on dev (Figure 1(a)).
• train for training, model-picking on tune, hyperparameter tuning on dev (Figure 1(c), our proposed setup).In this setup, we concatenate train and tune for the final evaluation on the test set with the optimal hyperparameters.
For both parsers we make a selection of hyperparameters to tune, and take the default values as starting point.We use no external embeddings for the UUParser, and initialize MaChAmp with mBERT to cover a variety of setups.Hence, a fair Dif reports the number of optimal hyperparameters that differ between the two setups, -T(une) is using dev for model picking as well as hyperparameter-tuning, and +T(une) is our proposed setup.
comparison can only be made between the setups, and not between the parsers.The exact hyperparameter ranges that were evaluated are reported in the appendix.We perform a grid search for each dataset, and compare the performance on test as well as the number of hyperparameters that have a different optimal value across both setups.

Results
Results (Table 1) show that performance of both evaluated setups only have minimal differences on the test data. 4Even though there are different optimal hyperparameters found for all datasets for MaChAmp and for 6/9 for the UUParser, none of the differences are significant with a paired bootstrap test (10,000 resamples), both with and without Bonferroni correction (Bonferroni, 1936).Hence, the results indicate that for the final performance it is irrelevant which splits to use in this setup.However, when the tune split is used, we can do a much more valuable (qualitative or quantitative) analysis on the development data, which would be less realistic to do when we used dev already for hyperparameter tuning as well as model picking.

Conclusion
We have reflected on the default dataset splits used in NLP (and actually also more widely in machine learning) to tune design decisions (architectures, hyperparameters, etc.) of neural network based models, which can easily lead to overfitting on the test data.This is an effect of the fact that in standard setups, neural networks use the development data during training, and it thus became more common to compare multiple versions of the same model on the test data.The solution to this problem is simple, we need another data split to do model picking, or avoid using the dev set in the training procedure, by learning which model to pick using other heuristics (Chen and Ritter, 2020).We call this split the tune-split.The only downside of using a separate tune-split, is that there is less data available for the other splits.This can be circumvented by using train+tune for the final (test-)runs of the model.We evaluated the effect of the tune-split for two common NLP benchmarks by tuning two different types of models.One of them showed to be more robust against the evaluated hyperparameter ranges, whereas the other showed a clear performance improvement when using a tune-split.This proposed solution is orthogonal to other proposed practices a hygienic experimental setup like significance testing, random splits, and evaluating specific abilities of our models.positive (the difference is significant for both the UUParser and MaChAmp with a paired bootstrap test p=0.05;dataset results are used as samples).

E Results of Hyperparameter Search
The optimal hyperparameters for both setups are shown in Table 4 for the UUParser and in Table 5 for MaChAmp.

Figure 2 :
Figure 2: A boxplot visualizing the median and quantiles of the number of models evaluated on test data for a selection of 100 random papers from the ACL 2010 and ACL 2020 proceedings.

Table 1 :
Results (LAS) of tuning with both strategies.

Table 2 :
Evaluated hyperparameters of MaChAmp and the UUParser(defaults are bold).

Table 3 :
Difference in performance between dev and test set.Lower scores indicate that performance on the test set is lower as compared to dev.