We Need To Talk About Random Splits

(CITATION) argued for using random splits rather than standard splits in NLP experiments. We argue that random splits, like standard splits, lead to overly optimistic performance estimates. We can also split data in biased or adversarial ways, e.g., training on short sentences and evaluating on long ones. Biased sampling has been used in domain adaptation to simulate real-world drift; this is known as the covariate shift assumption. In NLP, however, even worst-case splits, maximizing bias, often under-estimate the error observed on new samples of in-domain data, i.e., the data that models should minimally generalize to at test time. This invalidates the covariate shift assumption. Instead of using multiple random splits, future benchmarks should ideally include multiple, independent test sets instead; if infeasible, we argue that multiple biased splits leads to more realistic performance estimates than multiple random splits.


Introduction
It is common practice in NLP to collect and annotate a text corpus -and split it into training, development and test data. These splits are often based on the order in which texts were published or sampled, and are referred to as 'standard splits'. Gorman and Bedrick (2019) recently showed that system ranking results based on standard splits differ from results based on random splits and used this to argue in favor of using random splits. While perhaps less common, random splits are already used in probing (Elazar and Goldberg, 2018), interpretability (Pörner et al., 2018), as well as core NLP tasks (Yu et al., 2019;Geva et al., 2019). 1 Gorman and Bedrick (2019) focus on whether there is a significant performance difference δ Figure 1: Data splitting strategies. Each ball corresponds to a sentence represented in (two-dimensional) feature space. Blue (dark)/orange (bright) balls represent examples for training/test. Numbers represent sentence length. Heuristic splits can, e.g., be based on sentence length; adversarial splits maximize divergence. between systems S 1 and S 2 ; M(G test , S 1 ) − M(G test , S 2 ), in their notation. They argue Mc-Nemar's test (Gillick and Cox, 1989) or bootstrap (Efron, 1981) can establish that δ = 0, using random splits to sample from G test . This, of course, relies on the assumption that data is representative, i.e., was sampled i.i.d. (Wolpert, 1996).
In reality, what Gorman and Bedrick (2019) call the true difference in system performance, i.e., δ = M(G test , S 1 ) − M(G test , S 2 ), is the system difference on data that users would expect the systems to work well on (see §2 for practical examples) -and not just on the corpus that we have annotations for. Our corpus-based estimates of δ can in fact be very misleading, i.e., very different from performance on new samples of data. In this paper, we investigate how misleading our estimates can be: We show that random splits consistently over-estimate performance at test time. This favors systems that overfit. We investigate alternatives across a heterogeneous set of NLP tasks. Based on our experiments, our answer to community-wide overfitting to standard splits is not to use random splits but to collect more diverse data with different biases -or if that is not feasible, split your data in adversarial, not random, ways. In general, we observe that estimates of test time error are worst for random splits, slightly better for standard splits (if those exist), better for heuristic and adversarial splits, but error still tends to be higher on new (in-domain) samples; see Figure 1.
Our results not only refute the hypothesis that δ can be estimated using random splits (Gorman and Bedrick, 2019), 2 but also the covariate shift hypothesis (Shimodaira, 2000;Shah et al., 2020) that δ can be estimated using reweightings of the data. While biased splits are useful in the absence of multiple held-out samples, and have been proposed before (Karimi et al., 2015), 3 they often overestimate performance in the wild. Our code is made publicly available at https://github.com/ google-research/google-research/talk_ about_random_splits.

Experiments
We consider 7 different NLP tasks: POS tagging (like Gorman and Bedrick (2019)), two sentence representation probing tasks, headline generation, translation quality estimation, emoji prediction, and news classification. We experiment with these tasks, because they a) are diverse, b) have not been subject to decades of community-wide overfitting (with the exception of POS tagging), and c) three of them enabled temporal splits (see Appendix §A.5).
Data splits The datasets which we will use in our experiments are presented in Table 1. For all seven tasks, we will present results for standard splits when possible (POS, PROBING,QE, HEAD-LINES), random splits, heuristic and adversarial splits, as well as on new samples. In the case of EMOJIS, HEADLINES and NEWS, which are all time-stamped datasets, we leave out historically 2 Or cross-validation, as more recently proposed in Szymański and Gorman (2020). In this very interesting followup paper, about Bayesian inference of δ, the authors write that their "estimates are valid insofar as the data sets used to estimate the Bayesian models comprise a representative sample of a coherent population of data sets." Our results show how off this assumption is.
3 Karimi et al. (2015) discuss temporal splits and splits based on neighbor-based heuristics that are similar in spirit to our worst-case splits. more recent data as our new samples. All new samples are in-domain samples of data where models are supposed to generalize, i.e, samples from similar text sources. 4 This is a key point: These are samples that any end user would expect decent NLP models to fair well on. Examples include a sample of newspaper articles from newspaper A for a POS tagger trained on articles from newspaper B; tweets sampled the day after the training data was sampled; or news headlines sampled from the same sources, but a year later.
We resample random splits multiple times (3-10 per task) and report average results. The heuristic splits are obtained by finding a sentence length threshold and putting the long sentences in the test split. We choose a threshold so that approximately 10% of the data ends up in this split. The idea of checking whether models generalize to longer sentences is not new; on the contrary, this goes back, at least, to early formal studies of recurrent neural networks, e.g., Siegelmann and Sontag (1992). In the §A.3, we present a few experiments with alternative heuristic splits, but in our main experiments we limit ourselves to splits based on sentence length.
Finally, the adversarial splits are computed by approximately maximizing the Wasserstein distance between the splits. The Wasserstein distance is often used to measure divergence between distributions (Arjovsky et al., 2017;Tolstikhin et al., 2018;Shen et al., 2018;Shah et al., 2018), and while alternatives exist (Ben- David et al., 2006;Borgwardt et al., 2006), it is easy to compute and parameter-free. Since selecting the worst-case split is an NP-hard problem (e.g., by reduction of the knapsack problem), we have to rely on an approximation. We first compute a ball tree encoding the Wasserstein distances between the data points in our sample. We then randomly select a centroid for our test split and find its k nearest neighbors. Those k nearest neighbors constitute our test split; the rest is used to train and validate our model. We repeat these steps to estimate performance on worst-case splits of our sample. See §A.4 for an algorithm sketch. Random, heuristic, and adversarial results are averaged across five runs.
POS tagging We first consider the task in Gorman and Bedrick (2019), experiment with heuristic and adversarial splits of the original Penn Treebank (Marcus et al., 1993), and add the Xinhua section of OntoNotes 5.0 5 as our New Sample. Our tagger is NCRF ++ with default parameters. 6 Probing We also include two SentEval probing tasks (Conneau et al., 2018) with data from the Toronto Book Corpus: PROBING-WC (word classification) and PROBING-BSHIFT (whether a bigram was swapped) (Conneau et al., 2018). Unlike the other probing tasks, these two tasks do not rely on external syntactic parsers, which would otherwise introduce a new type of bias that we would have to take into account in our analysis. We use the official SentEval framework 7 and BERT (Devlin et al., 2019) as our sentence encoder. The probing model is a logistic regression classifier with L 2 regularization, tuned on the development set. As our New Samples, we use five random samples of the 2018 Gutenberg Corpus 8 for each task, preprocessed in the same way as Conneau et al. (2018).
Quality estimation We use the WMT 2014 shared task datasets for QUALITY ESTIMATION. Specifically, we use the Spanish-English data from Task 1.1: scoring for perceived post-editing effort. The dataset comes with a training and test set, and a second, unofficial test set, which we use as our New Sample. In the §A.2, we also present results training on Spanish-English and evaluating on German-English. We present a simple model that only considers the target sentence, but performs better than the best shared task systems: we train an MLP over a LASER sentence embedding (Schwenk et al., 2019) with the following hyperparameters: two hidden layers with 100 parameters each and ReLU activation functions, trained using the Adam stochastic gradient-based optimizer (Kingma and Ba, 2015), a batch size of 200, and L 2 penalty of strength α = 0.01.
Headline generation We use the standard dataset for headline generation, derived from the Gigaword corpus (Napoles et al., 2012), as published by Rush et al. (2015). The task is to generate a headline from the first sentence of a news article.  Table 2: Error reductions over random baselines on Standard (original) splits, if available, Random splits (obtained using cross-validation), Heuristic splits resulting from a sentence length-based threshold, Adversarial splits based on (five) approximate maximizations of Wasserstein differences between splits, and on New Samples. We bold face the lowest error reduction, i.e., where results differ the most from the random baseline. We see that standard and random splits consistently over-estimate real performance on new samples, which is sometimes even lower than performance on adversarial splits. We also report the mean squared error (MSE) with respect to New Samples, which shows Adversarial estimates empirical error best. Note: While annotator bias could explain POS tagging results, there is no annotator bias in the other tasks. * : For HEADLINES we use an identity baseline. Scores are ROUGE-2; see §A.1 for more. † : For QUALITY ESTIMATION, we report RMSE. The WMT QE 2014 best system obtained RMSE of 0.64; our system is significantly better with 0.50 on the standard split.

Results
Our results are presented in Table 2. Since the results are computed on different subsamples of data, we report error reductions over multinomial random (or, for HEADLINES, identity) baselines, following previous work comparing system rankings across different samples (Søgaard, 2013). More formally, we present error reduction as r = ps−p b 1−p b , where p s and p b are the performances of the system at hand and the multinomial random baseline.
Our main observations are the following: (a) Random splits (and standard splits) consistently under-estimate error on new samples. The absolute differences between error reductions over random baselines for random splits and on new samples are often higher than 20%, and in the case of PROBING-BSHIFT, for example, the BERT model reduces 80% of the error of a random baseline when data is randomly split, but only 45% averaging over five samples of new data from the same domain. (b) Heuristic splits sometimes under-estimate error on new samples. Our heuristic splits in the above experiments are quite aggressive. We only evaluate our models on sentences that are longer than any of the sentences observed during training. Nevertheless for 5/7 tasks, this leads to more optimistic performance estimates than evaluating on new samples! (c) The same story holds for adversarial splits based on approximate maximization of Wasserstein distances between training and test data. While adversarial splits are very challenging, results on adversarial splits are more optimistic than on new samples in 4/7 cases. Note the fact that random splits over-estimate real-life performance also leads to misleading system rankings. If, for example, we remove the CRF inference layer from our POS tagger, performance on our Random splits drops to 0.952; on the New Sample, however, performance is 0.930, which is significantly better than with a CRF layer.

Discussion
In the spirit of earlier work (Sakaguchi et al., 2017;Madnani and Cahill, 2018; Gorman and Bedrick, 2019), we provide recommendations for future evaluation protocols: (i) In the absence of multiple held-out samples, using biased splits better approximates real-world performance and can help determine what data characteristics affect performance. (ii) Evaluating on new samples is superior and also enables significance testing across datasets (Demsar, 2006), providing confidence estimates. Several benchmarks already provide multiple, diverse test sets (e.g. Hovy et al., 2006;Petrov and McDonald, 2012;Williams et al., 2018); we hope more will follow. What explains the high variance across samples in NLP? One reason is the dimensionality of language (Bengio et al., 2003), but in §A.5 we also show significant impact of temporal drift.

Conclusions
We have shown that out-of-sample error can be hard to estimate from random splits, which tend to underestimate error by some margin, but even biased and adversarial splits sometimes underestimate error on new samples. We show this phenomenon across seven very different NLP tasks and provide practical recommendations on how to best bridge the gap between experimental practices and what is needed to produce truly robust NLP models that perform well in the wild.

A Appendices
We present supplementary details about two of our tasks in §A.1 and §A.2 and discuss variations over heuristic splits in §A.3. In §A.4, we present the pseudo-algorithm for how we compute adversarial splits, and finally, in §A.5, we present our results documenting temporal drift. Table 3 reports the error reduction in ROUGE-1, ROUGE-2 and ROUGE-L over the identity baseline (see §2) for the different data splits. The results are consistent with Table 2. Figure 2 gives more details on an interesting drift phenomenon, which contributed to the superior performance of the model trained on the most recent five years (1999)(2000)(2001)(2002)(2003). Apparently, the dotless spelling of U.S./US ('United States') became more common over time. Consequently, the model trained on the 1999-2003 part generated US more frequently than the model trained on 1994-1998.

A.2 Quality Estimation
In the results above, we train and test our quality estimation regressor on Spanish-English from WMT QE 2014. We also ran a similar experiment where we used the German-English test data as our New Sample. Here, we see a similar pattern to the one above: The RMSE on the Standard split was 0.630, which is slightly higher than for Spanish-English; with our Heuristic split, RMSE is 0.652; for Adversarial, it is 0.626 (which is slightly better than with standard splits), and on our New Sample, RMSE is 0.813.

A.3 Alternative Heuristic Splits
For both SentEval tasks we experimented with the following alternatives for heuristic splits.
Bootstrap Resampling Instead of crossvalidation, a random split can be generated by bootstrap resampling. For this we randomly select 10% of the data as test set and then randomly sample (with replacement) a new training and dev set from the remaining examples.
Random Length As alternative to the length threshold heuristic in earlier experiments we randomly sample a length and select all examples having this length to be part of the test set. We repeat this procedure until approximately 10% of the data ends up in the test set. With this procedure we create 5 different test sets. We included this heuristic in order to see how fragile the probing setup is.
Rare Words Another alternative for heuristic splits is to use word frequency information. Here we assign those sentences containing at least one of the rarest words of the dataset to the test set. This way we end up again with approximately 10% of the data in the test set. Note that this way we create only 1 dataset, because it's not a random process.
Results Table 4 lists the results. While bootstrap resampling leads to slightly lower error reduction than cross-validation we decided to report the latter in the main part of this paper, because it is a more wide-spread way to randomly split datasets. Random Length results are comparable to standard splits results. The split based on word frequency (Rare Words) leads to considerable drop in both tasks. However, it is not as strong as the drop of the heuristic split (length threshold) in the main part of the paper.

A.4 Computing adversarial splits
We present the pseudo-algorithm of our implementation of approximate Wasserstein splitting in Algorithm 1. We also make the corresponding code available as part of our code repository for this paper.   Figure 3: Temporal drift in emoji prediction. The correlation between temporal gap and performance is significant (p < 0.05).

Splits
Data: Dataset G train Result: Adversarial split:

A.5 The significance of drift
Some of our splits in the main experiments were based on slicing data into different time periods (HEADLINES, EMOJIS). Since temporal drift is a potential explanation for sampling bias, we analyze this in more detail here. We show that temporal drift is pervasive and leads to surprising drops in performance. We note, however, that temporal drift is not the only cause of sampling bias, of course. Since we have time stamps for two of our datasets we study these in greater detail. For similar studies of temporal drift, see Lukes and Søgaard (2018); Rijhwani and Preotiuc-Pietro (2020).
Headline generation Our headline generation data covers the years 1994 to 2004. Having reserved 20,000 sentence-headline pairs from the first half of 2004 for validation and the same amount from the second half for testing, we use 50% of the years 1994-2003 for training three models. The models' architectures and parameters are identical (same as in Sec. 3). The only difference is in what the models are trained on: (a) a random half, (b) the first, or (c) the second half of 1994-2003. The training data sizes are comparable (1.63-1.76M), the publisher distributions (AFP, APW, CNA, NYT or XIN) are also similar. Hence, the models are expected to perform similarly on the same test set. As Table 5 indicates, shifting the training data by five years to the past results in a big performance drop. Sampling training data randomly or taking the most recent period produces models with similar ROUGE scores, both much better than the identity baseline. However, about half of the gap to the identity baseline disappears when older training data is taken. In the §A.1, we give an example of temporal drift in the HEADLINES data: US largely replaces U.S. in the newer training set and the test set.
Emoji prediction For emoji prediction, Go et al. (2009) provide data for a temporal span of 62 days. We split the data into single days and keep the splits with more than 25,000 datapoints in which both classes are represented. We use the last of these, June 16, as our test sample and vary the training data from the first day to the day before June 16. Figure 3 (left) visualizes the results.