Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level

Larger language models have higher accuracy on average, but are they better on every single instance (datapoint)? Some work suggests larger models have higher out-of-distribution robustness, while other work suggests they have lower accuracy on rare subgroups. To understand these differences, we investigate these models at the level of individual instances. However, one major challenge is that individual predictions are highly sensitive to noise in the randomness in training. We develop statistically rigorous methods to address this, and after accounting for pretraining and finetuning noise, we find that our BERT-Large is worse than BERT-Mini on at least 1-4% of instances across MNLI, SST-2, and QQP, compared to the overall accuracy improvement of 2-10%. We also find that finetuning noise increases with model size and that instance-level accuracy has momentum: improvement from BERT-Mini to BERT-Medium correlates with improvement from BERT-Medium to BERT-Large. Our findings suggest that instance-level predictions provide a rich source of information; we therefore, recommend that researchers supplement model weights with model predictions.


Introduction
Historically, large deep learning models (Peters et al., 2018;Devlin et al., 2019;Lewis et al., 2020;Raffel et al., 2019) have improved the state of the art on a wide range of tasks and leaderboards (Schwartz et al., 2014;Rajpurkar et al., 2016;Wang et al., 2018), and empirical scaling laws predict that larger models will continue to increase performance (Kaplan et al., 2020). However, little is understood about such improvement at the instance (datapoint) level. Are larger models uniformly better? In other words, are larger pretrained models better at every instance, or are they better at some instances, but worse at others?
Prior works hint at differing answers. Hendrycks et al. (2020) and Desai and Durrett (2020) find that larger pretrained models consistently improve out-of-distribution performance, which implies that they might be uniformly better at a finer level. Henighan et al. (2020) claim that larger pretrained image models have lower downstream classification loss for the majority of instances, and they predict this trend to be true for other data modalities (e.g. text). On the other hand, Sagawa et al. (2020) find that larger non-pretrained models perform worse on rare subgroups; if this result generalizes to pretrained language models, larger models will not be uniformly better. Despite all the indirect evidence, it is still inconclusive how many instances larger pretrained models perform worse on.
A naïve solution is to finetune a larger model, compare it to a smaller one, and find instances where the larger model is worse. However, this approach is flawed, since model predictions are noisy at the instance level. On MNLI in-domain development set, even the same architecture with different finetuning seeds leads to different predictions on ∼8% of the instances. This is due to under-specification (D'Amour et al., 2020), where there are multiple different solutions that can minimize the training loss. Since the accuracy improvement from our BERT-BASE 1 to BERT-LARGE is 2%, most signals across different model sizes will be dominated by noise due to random seeds.
To account for the noise in pretraining and finetuning, we define instance accuracy as "how often a model correctly predicts an instance" (Figure 1 left) in expectation across pretraining and finetuning seeds. We estimate this quantity by pretraining 10 models with different seeds, finetuning 5 times for each pretrained models (Figure 1 middle), and averaging across them.
However, this estimate is still inexact, and we might falsely observe smaller models to be better at some instances by chance. Hence, we propose a random baseline to estimate the fraction of false discoveries (Section 3, Figure 1 right) and formally upper-bound the false discoveries in Section 4. Our method provides a better upper bound than the classical Benjamini-Hochberg procedure with Fisher's exact test.
Using the 50 models for each size and our improved statistical tool, we find that, on the MNLI in-domain development set, the accuracy "decays" from BERT-LARGE to BERT-MINI on at least ∼4% of the instances, which is significant given that the improvement in overall accuracy is 10%. These decaying instances contain more controversial or wrong labels, but also correct ones (Section 4.2). Therefore, larger pretrained language models are not uniformly better.
We make other interesting discoveries at the instance level. Section 5 finds that instance-level accuracy has momentum: improvement from MINI to MEDIUM correlates with improvement from MEDIUM to LARGE . Additionally, Section 6 attributes variance of model predictions to pretraining and finetuning random seeds, and finds that finetuning seeds cause more variance for larger models. Our findings suggest that instance-level predictions provide a rich source of information; we therefore recommend that researchers supplement model weights with model predictions. In this spirit, we release all the pretrained models, model predictions, and code here: https://github.com/ ruiqi-zhong/acl2021-instance-level.

Data, Models, and Predictions
To investigate model behavior, we considered different sizes of the BERT architecture and finetuned them on Quora Question Pairs (QQP 2 ), Multi-Genre Natural Language Inference (MNLI; Williams et al. (2020)), and the Stanford Sentiment Treebank (SST-2;Socher et al. (2013)). To account for pretraining and finetuning noise, we averaged over multiple random initializations and training data order, and thus needed to pretrain our own models rather than downloading off the internet. Following Turc et al. (2019) we trained 5 architectures of increasing size: MINI (L4/H256, 4 Layers with hidden dimension 256), SMALL (L4/H512), MEDIUM (L8/H512), BASE (L12/H768), and LARGE (L24/H1024). For each architecture we pre-trained models with 10 different random seeds and fine-tuned each of them 5 times (50 total) on each task; see Figure 1 middle. Since pretraining is computationally expensive, we reduced the context size during pretraining from 512 to 128 and compensated by increasing training steps from 1M to 2M. Appendix A includes more details about pretraining and finetuning and their computational cost, and Appendix B verifies that our cost-saving changes do not affect accuracy qualitatively.
Notation. We use i to index an instance in the evaluation set, s for model sizes, P for pretraining seeds and F for finetuning seeds. c is a random variable of value 0 or 1 to indicate whether the prediction is correct. Given the pretraining seed P and the finetuning seed F , c s i = 1 if the model of size s is correct on instance i, 0 otherwise. To keep the notation uncluttered, we sometimes omit these superscripts or subscripts if they can be inferred from context.
Unless otherwise noted, we present results on the MNLI in-domain development set in the main paper.

Comparing Instance Accuracy
To find the instances where larger models are worse, a naïve approach is to finetune a larger pretrained model, compare it to a smaller one, and find instances where the larger is incorrect but the smaller is correct. Under this approach, BERT-LARGE is worse than BERT-BASE on 4.5% of the instances and better on 7%, giving an overall accuracy improvement of 2.5%.
However, this result is misleading: even if we compare two BERT-BASE model with different finetuning seeds, their predictions differ on 8% of the instances, while their accuracies differ only by 0.1%; Table 1 reports this baseline randomness across model sizes. Changing the pretraining seed also changes around 2% additional predictions beyond finetuning. Table 1 also reports the standard deviation of overall accuracy, which is about 40 times smaller. Such stability starkly contrasts with the noisiness at the instance level, which poses a unique challenge. Figure 1: Left: Each column represents the same architecture trained with a different seed. We calculate accuracy for each instance (row) by averaging across seeds (column), while it is usually calculated for each model by averaging across instances. Middle: A visual layout of the model predictions we obtain, which is a binary-valued tensor with 4 axes: model size s, instance i, pretraining seeds P and finetuning seeds F . Right: for each instance, we calculate the accuracy gain from MINI to LARGE and plot the histogram in blue, along with a random baseline in red. Since the blue distribution has a bigger left tail, smaller models are better at some instances.

Instance-Level Metrics
To reflect this noisiness, we define the instance accuracy Acc s i to be how often models of size s predict instance i correctly, The expectation is taken with respect to the pretraining and finetuning randomness P and F . We estimate Acc s i via the empirical averageÂcc s i accross 10 pretraining × 5 finetuning runs.
We histogramÂcc s i in Figure 2 (a). On most instances the model always predicts correctly or incorrectly (Âcc = 0 or 1), but a sizable fraction of accuracies lie between the two extremes.
Recall that our goal is to find instances where larger models are less accurate, which we refer to as decaying instances. We therefore study the instance difference between two model sizes s 1 and s 2 , defined as which is estimated by the difference between the accuracy estimatesÂcc We histogram BASE LARGE∆ Acc i in Figure 2 (b). We observe a unimodal distribution centered near 0, with tails on both sides. Therefore, the estimated differences for some instances are negative.
However, due to estimation noise, we might falsely observe this accuracy decay by chance. Therefore, we introduce a random baseline s 1 s 2 ∆Acc to control for these false discoveries. Recall that we have 10 smaller pretrained models and 10 larger ones. Our baseline splits these into a group A of 5 smaller + 5 larger, and another group B of the remaining 5 + 5. Then the empirical accu-raciesÂcc A andÂcc B are identically distributed, so we take our baseline s 1 s 2 ∆Acc to be the differ- . We visualize and compare how to calculate s 1 s 2∆ Acc and s 1 s 2 ∆Acc in Figure 3. We histogram this baseline BASE LARGE ∆Acc in Figure 2 (b), and find that our noisy estimate BASE LARGE∆ Acc has a larger left tail than the baseline. This suggests that decaying instances exist. We similarly compare MINI to LARGE in Figure 2 (c) and find an even larger left tail.

Quantifying the Decaying Instances
The left tail of∆Acc noisily estimates the fraction of decaying instances, and the left tail of the random baseline ∆Acc counts the false discovery fraction due to the noise. Intuitively, the true fraction of decaying instances can be captured by the difference of these left tails, and we formally quantify this below. , ∆Acc (blue) and its baseline ∆Acc (red) compares BASE and LARGE . To better visualize, we truncated the density (y-axis) above 2. Since the blue histogram has a larger left tail than the red one, there are indeed instances where larger models are worse. Figure 3: The tables are model predictions with visual notations established in Figure 1 middle.∆Acc (blue) is the mean difference between the left and the right table, each corresponding to a model size. The random baseline ∆Acc (red) is the mean difference between group A (orange) cells and group B (green), which are identically and independently distributed.
Suppose instance i is drawn from the empirical evaluation distribution. Then we can define the true decaying fraction Decay as Since ∆Acc i is not directly observable and ∆Acc i is noisy, we add a buffer and only consider instances with∆Acc i ≤ t, which makes it more likely (but still uncertain) that the true ∆Acc i < 0. We denote this "discovery fraction"Decay(t) aŝ Similarly, we define a baseline control (false discovery fraction) Decay (t) : Hence,Decay and Decay are the cumulative distribution function of∆Acc and ∆Acc (Figure 4).
We have the following theorem, which we formally state and prove in Appendix D: Theorem 1 (Informal) If all the random seeds are independent, then for all thresholds t, Proof Sketch Suppose we observe c s 1 R 1...2k and c s 2 R 2k+1...4k , where there are 2k different random seeds for each model size 3 . Then and hence the discovery rateDecay(t) is defined asD For the random baseline estimator, we have and the false discovery control Decay is defined as Formally, the theorem states that which is equivalent to +P[∆Acc i ≤ t]) ≥ 0 3 We assumed even number of random seeds since we will mix half of the models from each size to compute the random baseline 6% Figure 4: The cumulative distribution function of the histogram in Figure 2 (c); only the negative x-axis is shown because it corresponds to decays. The maximum difference between the two curves (6%) is a lower bound of the true decaying fraction.
Hence, we can declare victory if we can prove that for all i, if ∆Acc i ≥ 0, This is easy to see, since ∆Acc i and∆Acc i are both binomial distributions with the same n, but the first has a larger rate. 4 Roughly speaking, the true decaying fraction is at least the difference betweenDecay(t) and Decay (t) at every threshold t. Therefore, we take the maximum difference betweenDecay(t) and Decay (t) to lower-bound the fraction of decaying instances. 5 For example, Figure 4 estimates the true decaying fraction between MINI and LARGE to be at least 6%.
We compute this lower bound for other pairs of model sizes in Table 2, and the full results across other tasks and model size pairs are in Appendix C. In all of these settings we find a non-zero fraction of decaying instances, and larger model size differences usually lead to more decaying instances.
Unfortunately, applying Theorem 1 as above is not fully rigorous, since some finetuning runs share the same pretraining seeds and hence are dependent. 6 To obtain a statistically rigorous lower bound, we slightly modify our target of interest. Instead of examining individual finetuning runs, we ensemble our model across 5 different finetuning runs for each pretraining seed; these predictions 4 More details are in Appendix D. 5 Adaptively picking the best threshold t depending on the data may incur a slight upward bias. Appendix E estimates that the relative bias is at most 10% using a bootstrap method. 6 Although we anticipate such dependencies do not cause a substantial difference, as discussed in Appendix D.  ThresholdDecay are essentially the same as individual finetuning runs, except that the finetuning randomness is averaged out. Hence we obtain 10 independent sets of model predictions with different random seeds, which allows us to apply Theorem 1.
We compare MINI to LARGE using these ensembles and report the discoveryDecay and the baseline Decay in Table 3. Taking the maximum difference across thresholds, we estimate at least ∼4% of decaying instances. This estimate is lower than the previous 6% estimate, which used the full set of 50 models' predictions assuming they were independent. However, this is still a meaningful amount, given that the overall accuracy improvement from MINI to LARGE is 10%.

Fisher's Test + Benjamini-Hochberg
Here is a more classical approach to lower-bound the decaying fraction. For each instance, we compute a significance level α under the null hypothesis that the larger model is better, using Fisher's exact test. We sort the significance levels ascendingly, and call the p th percentile α p . Then we pick a false discovery rate q (say, 25%), find the largest p s.t. α p < pq, and estimate the decaying fraction to be at least p(1 − q). This calculation is known as the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995).
To compare our method with this classical ap-  Table 4: We compare our method to the Fisher's exact test + Benjamin-Hochberg (BH) procedure described in Section 4. For all different model size pairs and number of pretrained models available, ours always provides a higher (better) lower bound of the decaying fraction.
proach, we estimate the lower bound of the decaying fraction for different pairs of model sizes with different numbers of pretrained models available. To make sure our choice of the false discovery rate q does not bias against the classical approach, we adaptively choose q to maximize its performance. Appendix F includes the full results and Table 4 is a representative subset.
We find that our approach is more powerful, particularly when the true decaying fraction is likely to be small and only a few models are available, which is usually the regime of interest. For example, across all pairs of model sizes, our approach only needs 2 random seeds (i.e. pretrained models) to provide a non-zero lower bound on the decaying fraction, while the classical approach sometimes fails to do this even with 10 seeds. Intuitively, when fewer seeds are available, the smallest possible significance level for each instance is larger than the decaying fraction, hence hurting the classical approach.

Understanding the Decaying Instances
We next manually examine the decaying instances to see whether we can find any interpretable patterns. One hypothesis is that all the decaying fractions are in fact mislabeled, and hence larger models are not in fact worse on any instances.
To investigate this hypothesis, we examined the group of instances where MINI LARGE∆ Acc i ≤ −0.9. MINI is almost always correct on these instances, while LARGE is almost always wrong, and the false discovery fraction is tiny. For each instance, we manually categorize it as either: 1) Correct, if the label is correct, 2) Fine, if the label might be controversial but we could see a reason why this label is reasonable, 3) Wrong, if the label is wrong, or 4) Unsure, if we are unsure about how to label this instance. Each time we annotate, with 50% probability we randomly sample either a decaying 1% 0% Table 5: MINI vs. LARGE . We examine whether there are mislabels for the Decaying fractions (superscript D ) and the rest of the dataset (Control group C ). The decaying fraction contains more mislabels, but includes correct labels as well.
instance or an instance from the remaining dataset as a control. We are blind to which group it comes from.
For each task of MNLI, QQP, and SST-2, the first author annotated 100 instances (decay + control group) (Table 5). We present all the annotated decaying instances in Appendix J.
Conclusion We find that the decaying fraction has more wrong or controversial labels, compared to the remaining instances. However, even after we adjust for the fraction of incorrect labels, the Decay fraction still exceeds the false discovery control. This implies that MINI models are better than LARGE models on some correctly labeled instances. The second author followed the same procedure and reproduced the same qualitative results.
However, we cannot find an interpretable pattern for these correctly labeled decaying instances by simply eyeballing. We discuss future directions to discover interpretable categories in Section 7.

Correlation of Instance Difference
We next investigate whether there is a momentum of instance accuracy increase: for example, if the instance accuracy improves from MINI(s 1 ) to MEDIUM(s 2 ), is it more likely to improve from MEDIUM(s 2 ) to LARGE(s 3 )?
The naïve approach is to calculate the Pearson correlation coefficient between MINI MEDIUM∆ Acc and MEDIUM LARGE∆ Acc, and we find the correlation to be zero. However, this is partly an artifact of accuracies being bounded in [0, 1]. If MEDIUM drastically improves over MINI from 0 to 1, there is no room for LARGE to improve over MEDIUM. To remove this inherent negative correlation, we calculate the correlation conditioned on the accuracy of the middle-  Therefore, we bucket instances by their estimated MEDIUM accuracy into intervals of size 0.1, and we find the correlation to be positive within each bucket (Table 6, row 2). This fixes the problem with the naïve approach by getting rid of the negative correlation, which could have misled us to believe that improvements by larger models are uncorrelated.
We additionally find that the correlations between improvements become stronger when model size differences are smaller. Table 6 row 1 reports results for another model size triplet with smaller size difference, i.e. (s 1 , s 2 , s 3 ) = (SMALL, MEDIUM, BASE), and the correlation is larger for all buckets. Results for more tasks and size triplets are in Appendix G and the same conclusions hold qualitatively.

Variance at the Instance Level
Section 3 found that the overall accuracy has relatively low variance, but model predictions are noisy. This section formally analyzes variance at the instance level. For each instance, we decompose its loss into three components: Bias 2 , variance due to pretraining randomness, and variance due to finetuning randomness. Formally, we consider the 0/1 loss: where c i is a random variable 0/1 indicating whether the prediction is correct or incorrect, with respect to randomness in pretraining and finetuning. Therefore, by bias-variance decomposition and total variance decomposition, we have where, by using P and F as pretraining and finetuning random seeds:  capturing "how wrong is the average prediction", variance due to pretraining, and variance due to finetuning seeds, respectively. We can directly estimate FineVar by first calculating the sample variance across finetuning runs for each pretraining seed, and then averaging the variances across the pretraining seeds. Estimating PretVar is more complicated. A naïve approach is to calculate the empirical variance, across pretraining seeds, of the average accuracy across finetuning seeds. However, the estimated average accuracy for each pretraining seed is noisy itself, which causes an upward bias on the PretVar estimate. We correct this bias by estimating the variance of the estimated average accuracy and subtracting it from the naïve estimate; see Appendix H for details, as well as a generalization to more than two sources of randomness. Finally, we estimate Bias 2 by subtracting the two variance estimates from the estimated loss.
For each of these three quantities, Bias 2 , PretVar and FineVar, we estimate it for each instance, average it across all instances in the evaluation set, and report it in Table 7. The variances at the instance level are much larger than the variance of overall accuracy, by a factor of 1000.
We may conclude from Table 7 that larger models have larger finetuning variance and smaller pretraining variance. However, lower bias also inherently implies lower variance. To see this, suppose a model has perfect accuracy and hence zero bias; then it always predicts the same label (the correct one) and hence has zero variance. This might favor larger models and "underestimate" their variance, since they have lower bias. Therefore, we calculate and compare the variances conditioned on the bias, . We estimate PretVar s (b 2 ) using Gaussian process regression and plot it against b 2 in Figure 5. We find that larger models still have lower pretraining variance across all levels of bias on the specific task of MNLI under the 0/1 loss. To further check whether our conclusions are general, we tested them on other tasks and under the squared loss L i := (1 − p i ) 2 , where p i is the probability assigned to the correct class. Below are the conclusions that generally hold across different tasks and loss functions.
Conclusion We find that 1) larger models have larger finetuning variance, 2) LARGE has smaller pretraining variance than BASE ; however, the ordering between other sizes varies across tasks and losses, and 3) finetuning variance is 2−8 times as large as pretraining variance, and the ratio is bigger for larger models.

Discussion and Future Directions
To investigate model behaviors at the instance level, we produced massive amounts of model predictions in Section 2 and treated them as raw data. To extract insights from them, we developed better metrics and statistical tools, including a new method to control the false discoveries, an unbiased estimator for the decomposed variances, and metrics that compute variance and correlation of improvements conditioned on instance accuracy. We find that larger pretrained models are indeed worse on a nontrivial fraction of instances and have higher variance due to finetuning seeds; additionally, instance accuracy improvements from MINI to MEDIUM correlate with improvements from MEDIUM to LARGE .
Overall, we treated model prediction data as the central object and built analysis tools around them to obtain a finer understanding of model performance. We therefore refer to this paradigm as "instance level understanding as data mining". We discuss three key factors for this paradigm to thrive: 1) scalability and the cost of obtaining prediction data, 2) other information to collect for each instance, and 3) better statistical tools. We analyze each of these aspects below.
Scalability and Cost of Data Data mining is more powerful with more data. How easy is it to obtain more model predictions? In our paper, the main bottleneck is pretraining. However, once the pretrained models are released, individual researchers can download them and only need to repeat the cheaper finetuning procedure.
Furthermore, model prediction data are undershared: while many recent research papers share their code or even model weights to help reproduce the results, it is not yet a standard practice to share all the model predictions. Since many researches follow almost the same recipe of pretraining and finetuning (McCoy et al., 2020;Desai and Durrett, 2020;Dodge et al., 2020), much computation can be saved if model predictions are shared. On the other hand, as the state of the art model size is increasing at a staggering speed 7 , most researchers will not be able to run inference on a single instance. The trend that models are becoming larger and more similar necessitate more prediction sharing.

Meta-Labels and Other Predictions
Data mining is more powerful with more types of information. One way to add information to each instance is to assign "meta-labels". In the HANS (McCoy et al., 2019) dataset, the authors tag each instance with a heuristic 8 that holds for the training distribution but fails on this instance. Naik et al. (2018a) and Ribeiro et al. (2020) associate each instance with a particular stress test type or subgroup, for example, whether the instance requires the model to reason numerically or handle negations. Nie et al.
(2020) collects multiple human responses to estimate human disagreement for each instance. This meta-information can potentially help us identify interpretable patterns for the disagreeing instances where one model is better than the other. On the flip side, identifying disagreeing instances between two models can also help us generate hypothesis and decide what subgroup information to annotate.
We can also add performance information on other tasks to each instance. For example, Pruksachatkun et al. (2020) studied the correlation between syntactic probing accuracy (Hewitt and Liang, 2019) and downstream task performance. Turc et al. (2019) and Kaplan et al. (2020) studied the correlation between language modelling loss and the downstream task performance. However, they did not analyze correlations at the instance level. We may investigate whether their results hold on the instance level: if an instance is easier to tag by a probe or easier to predict by a larger language model, is the accuracy likely to be higher?
Statistical Tools Data mining is more powerful with better statistical tools. Initially we used the Benjamini-Hochberg procedure with Fisher's exact test, which required us to pretrain 10 models to formally verify that the decaying instances exist. However, we later realized that 2 is in fact enough by using our approach introduced in Section 4. We could have saved 80% of the computation for pretraining if this approach was known before we started.
Future work can explore more complicated metrics and settings. We compared at most 3 different model sizes at a time, and higher order comparisons require novel metrics. We studied two sources of randomness, pretraining and finetuning, but other sources of variation can be interesting as well, e.g. differences in pretraining corpus, different model checkpoints, etc. To deal with more sophisticated metrics, handle different sources and hierarchies of randomness, and reach conclusions that are robust to noises at the instance level, researchers need to develop new inference procedures.
To conclude, for better instance level understanding, we need to produce and share more prediction data, annotate more diverse linguistic properties, and develop better statistical tools to infer under noises. We hope our work can inform researchers about the core challenges underlying instance level understanding and inspire future work.

A Pretraining and Finetuning Details
Here we explain how to obtain the model predictions, which are analyzed in later sections. To obtain these predictions under the "pretraining and finetuning" framework (Devlin et al., 2019), we need to decide a model size, perform pretraining, finetune on a training set with a choice of hyperparameters, and test the model on an evaluation set. We discuss each bolded aspects below.
Since pretraining introduces randomness, for each model size s, we pretrain 10 times with different random seed P ; since finetuning also introduces noise, for each pretrained model we pretrain 5 times with different random seed F ; besides, we also evaluate the model at the checkpoints after E epochs, where E ∈ [3, 3 1 3 , 3 2 3 , 4]. Pretraining 10 models for all 5 model sizes altogether takes around 3840 hours on TPU v3 with 8 cores. Finetuning all of them 5 times for all three tasks in our paper requires around 1200 hours.

B Compare Our Models to the Original
Since we decreased the pre-training context length to save computation, these models are not exactly the same as the original BERT release by Devlin et al. (2019) and Turc et al. (2019). We need to benchmark our model against theirs to ensure that the performance of our model is still reasonable and the qualitative trend still holds. For each each size and task, we finetune the original model 5 times and calculate the average of overall accuracy.
The comparison can be seen in Table 8. We find that our model does not substantially differ from the original ones on QQP and SST-2. On MNLI, the performance of our BERT-BASE and BERT-LARGE is 2∼3% below the original release, but the qualitative trend that larger models have better accuracy still holds robustly. Figure 4, for all 10 pairs of model sizes and all in-distribution instances of MNLI, SST-2, and QQP, we plot the cumulative density of∆Acc and ∆Acc , or say,Decay(t) and Decay (t) in Figure 6, 7, and 8.

Similar to
Additionally, for each pair of model sizes s 1 and s 2 , we estimate "how much instances are getting better/worse accuracy?" by taking the maximum difference between the red curve and the blue curve. We report these results for MNLI, SST-2, and QQP in Table 9. We find that larger model size gaps  lead to larger decaying fraction, but also larger improving fraction as well.

D Proof of Theorem 1
Formal Setup Our goal is to show that if all the random seeds are independent, More concretely, suppose each instance is indexed by i, the set of all instances is T , and the random seed is R; then c s R ∈ {0, 1} |T | is a random |T | dimensional vector, where c s R,i = 1 if the model of size s correctly predicts instance i under the random seed R. We are comparing model size s 1 and s 2 , where s 2 is larger; to keep notation uncluttered, we omit these indexes whenever possible.
Suppose we observe c s 1 R 1...2k and c s 2 R 2k+1...4k , where there are 2k different random seeds for each model size 11 . Then and hence the discovery rateDecay(t) is defined asD 11 We assumed even number of random seeds since we will mix half of the models from each size to compute the random baseline For the random baseline estimator, we have and the false discovery control Decay is defined as To reiterate, the definition of the true decay rate is Our goal is to prove that Proof By re-arranging terms and linearity of expectation, Equation 22 is equivalent to the following Hence, we can declare victory if we can prove that for all i, To prove Equation 24, we observe that if Acc i < 0, since the probabilities are bounded by 0 and 1, its left-hand side must be positive. Therefore, we only need to prove that which will be proved in Lemma 1.
Lemma 1  For m = 1, 2, define Since c s 1 i and c s 2 i are both Bernoulli random variables with rate p s 1 i and p s 2 i respectively, we can write down the probability distribution of∆Acc i and ∆Acc i as the sum/difference of several binomial variables, i.e.

D.1 Independent Seed Assumption
We notice that Theorem 1 requires the seeds R to be independent. This assumption does not hold on our data, since some finetuning runs share the same pretraining seeds. Therefore, the above proof no longer holds. Specifically, Lemma 1 fails becausê ∆Acc and ∆Acc are no longer binomial variables, and the later does not necessarily dominate the first. Here is a counter-example, if the seeds are not entirely independent.
Hypothetically, suppose we are comparing a smaller model s 1 and a larger model s 2 . For the smaller model, with .1 probability it finds a perfect pretrained model that always predict correctly across all finetuning runs and with .9 probability it finds a bad pretrained model that predict always incorrectly. For the larger model, with probability 1 it finds an average pretrained model that predict correctly for .2 fraction of finetuning runs. The larger model is on average better, because it has .2 > .1 probability to be correct. Hence, ∆Acc > 0 Suppose we observe 2 independent pretraining seeds for each size and infinite number of finetuning seeds for each pretraining seed, and let us consider the threshold -0.8. Then The event that∆Acc i ≤ −0.8 happens with probability 0.01 when both of the two small pretrained models have good pretraining seeds, and ∆Acc i is at least -0.5 and will never be less than -0.8. Figure 6: Similar to Figure 4, on MNLI in-distribution development set, for each pair of model sizes, we plot the cumulative density function of instance differences. The key idea behind this counter-example is that even if the larger model has better average, the distribution of average finetuning accuracy for different pretraining seeds might not stochastically dominate the one with lower average because of outliers. Hence, a priori, this is unlikely to happen in practice, since pretraining variance is generally small, and we have multiple pretraining seeds to average out the outliers. Nevertheless, future work is needed to make a more rigorous argument.

E Upward Bias of Adaptive Thresholds
In section 3 we picked the best threshold that can maximize the lowerbound, which can incur a slight upward bias. Here we estimate that the bias is at most 10% relative to the unbiased lowerbound with a bootstrapping method.
We use the empirical distribution of 10 pretrained models as the ground truth distribution for bootstrapping. We first compute a best threshold with 10 sampled smaller and larger pretrained models, and then compute the lowerbound L with this threshold on another sample of 10 smaller and larger models. Intuitively, we use one bootstrap sample (which contains 10 smaller pretrained models and 10 larger pretrained models) as the development set to "tune the threshold", and then use this threshold on a fresh bootstrap sample to compute the lowerbound. We refer to the lowerbound that uses the best threshold as L * , and compute the relative error E[(L * − L)]/E[L)], where the expectation is taken with respect to bootstrap samples.
We report all results in Table 10. In general, we find that the upward bias is negligible, which is at most around 10%.

F Comparison with Significance Testing
We also experimented with the classical approach that calculates the significance-level for each instance and then use the Benjamini-Hochberg procedure to lowerbound the decaying fraction. To make sure that we are comparing with this approach fairly, we lend it additional power by picking the false discovery rate that can maximize the true discovery counts. We report the decaying fraction on MNLI in-domain development set found by this classical method and compare it with our method for different model size differences in Table 11; we also simulate situations when we have fewer models.
In general, we find that our method always provide a tighter (higher) lowerbound than the classical method, and 2 models are sufficient to verify the existence (i.e. lowerbound > 0) of the decaying fraction; in contrast, the classical method sometimes fails to do this even with 10 models, e.g., when comparing BASE to LARGE .
Intuitively, our approach provides a better lowerbound because it better makes use of the information that on most instances, both the smaller and the larger models agree and predict completely correctly or incorrectly 12 . For an extreme example, suppose we only observe 2 smaller models and 2 larger models, and infinite number of datapoints, whose predictions are independent. On 99.98% datapoints, both models have instance accuracy 1; on 0.01% datapoints, smaller model is completely correct while bigger completely wrong, while on the rest 0.01% smaller completely wrong but bigger completely correct. Setting threshold to be 2, our decay estimateDecay is 0.01%, while Decay = 0: since the models either completely predict correct or wrongly, there is never a false discovery. Therefore, our method can provide the tightest lowerbound 0.01% in this case. On the other hand, since we only have 4 models in total, the lowest significance-level given by the fisher exact test is 17% 0.1%, hence the discovery made by the Benjamin-Hochberg procedure is 0.

G More Results on Momentum
We report more results on the correlation between instance differences. Specifically, for one triplet of model sizes (e.g. MINI ⇒ MEDIUM ⇒ LARGE ), for each group of instances that have similar Acc MEDIUM , we calculate the correlation between instance differences, i.e. the Pearson-R score between MINI MEDIUM ∆Acc and MEDIUM LARGE ∆Acc. All results can be seen in Table 12.
We observe that • For nearly all buckets, the improvements are positively correlated.
• When model size gap becomes larger (e.g. MINI, MEDIUM, LARGE has the largest model size differences), the correlation decreases.

H Loss Decomposition and Estimation
In this section, under the bias-variance decomposition and total variance decomposition framework, we decompose loss into four components: bias, variance brought by pretraining randomness, by    Specifically, the main paper focused on scenarios with 2 sources of randomness: pretraining and finetuning. We discuss the case with 3 sources of randomness in the appendix, rather than 2 as in the main paper, because it is easier to understand the general estimation strategy in the case of 3.

H.1 Formalizing Decomposition
Recall that P is the pretraining seed, F is the finetuning seed, E represents a model checkpoint, i indexes each instance (datapoint). c s,i P,F,E = 1 if the model of size s with pretraining seed p and finetuning seed F , and trained for E epochs is correct on datapoint i, and 0 otherwise. Notice that we move the instance index from the subscript to the superscript, since we now use subscript for random seeds, and instance index can be omitted in most of our derivations.
The expected squared loss L of model s on in-stance i can then be written as Since we will analyze this term at a datapoint level, we drop the subscript s and i to keep the notation uncluttered. By the standard bias variance decomposition and total variance decomposition, we decompose the loss L into four terms: L =Bias 2 + PretVar (34) + FineVar + CkptVar.
We will walk through the meaning and definition of these four terms one by one. Bias 2 captures how bad is the average prediction, defined as PretVar captures the variance brought by randomness in pretraining, and is defined as Similarly, we define the variance brought by randomness in finetuning FineVar and that by fluctuations across checkpoints e

H.2 Unbiased Estimation
We first describe the data on which we apply our estimator. Suppose we pretrain P models with different random seeds, for each of the P pretrained models we finetune with F different random seeds, and we evaluate at E different checkpoints. Then ∀j ∈ [P], k ∈ [F], l ∈ [E] 13 , we observe P j , F jk , E jkl , and c P j ,F jk ,E jkl , where each observed P, F and E are i.i.d. distributed. Our goal is to estimate from c the four quantities described in the previous section.

H.2.1 Estimating CkptVar
It is straightforward to estimate CkptVar. The estimatorĈkptVar defined below is unbiased: wherê V ar CkptVar is unbiased, sinceV ar P j ,F jk E is an unbiased estimation of variance of c with fixed P j and F jk , and randomness E, i.e.
Therefore, ∀j ∈ [P], k ∈ [F], we have and hence by linearity of expectation

H.2.2 Estimating FineVar
As before, by linearity of expectation, we can declare victory if we can develop an unbiased estimator for the following quantity and then average across P j : which verbally means "variance across different finetuning seeds of the mean of c over different 13 [L] := {l : l ∈ N, l ∈ [0, L − 1]} checkpoints E, conditioned on the pretraining seed P j ." Since P j is fixed for this estimator, we drop the subscripts P to keep notation uncluttered. Therefore, we want to estimate A naive solution is to take first take the mean c F k of c for each F k , i.e.
and then calculate the sample varianceṼ ar F ofc with respect to F : wherec However, this would create an upward bias: the empirical meanc F jk is a noisy estimate of the population mean E E [c F jk ,E ], and hence increases let V ar F over-estimate the variance. Imagine a scenario where V ar F is in fact 0; however, sincec F jk is a noisy estimate,Ṽ ar F will sometimes be positive but never below 0. As a result, E[Ṽ ar F ] > 0, which is a biased estimator.
We introduce the following general theorem to correct this bias.
Theorem 2 Suppose D k , k ∈ [F] are independently sampled from the same distribution Ξ, which is a distribution of distributions;μ k is an unbiased estimator of E X∈D k [X], andφ k to be an unbiased estimator of the variance ofμ k , then is an unbiased estimator for whereμ In this estimator, the first term "pretends" thatμ · are perfect estimator for the population mean and calculate the variance, while the second term corrects for the fact that the empirical mean estimation is not perfect. Notice the theorem only requires thatμ andφ are unbiased, and is agnostic to the actual computation procedure by these estimators.
Proof We define the population mean of D k to be µ k , i.e.
and the population mean of µ k across randomness in D to be µ, i.e.
We look at the first term of the estimator in equation 50: There are 5 summands within k∈[F ] , and we look at them one by one: Putting these five terms together, we continue calculating Equation 55: Then from Equation 50, we can tell thatV ar F is unbiased. Now we come back to the topic of developing an unbiased estimator for V ar F as defined in Equation 46. To utilize Theorem 2, we need two components: Therefore, to develop an unbiased estimator for V ar E (c F jk ), it suffices to have an unbiased estimate of V ar E [c F,E |F = F k ]. We definê and we can plug inφ F k andμ F k =c F k into Theorem 2 as an unbiased estimator to obtain an unbiased estimator for V ar F [E E [c P,F,E ]|P = P j ], and we average the estimation for each P j to obtain an unbiased estimate.

H.2.3 Estimating PretVar
We next estimate V ar P [E F,E [c P,F,E ]] We can still apply the idea from Theorem 2, which requires • An unbiased estimatorφ P j for the variance of Again, the first is easy to obtain:μ P j =c P j is an unbiased estimator for E F,E [c P,F,E |P = P j ], wherec However, we cannot straightforwardly estimate V ar F,E [μ P j ] as before, since samples c P j ,F jk ,E jkl are no longer independent. We need to use Equation 57 to develop an unbiased estimator (the LHS is exactly what we want!), i.e.
V ar E (c P j ,F jk ), and we already know how to estimate these two summands from the previous discussion on estimating FineVar.

H.2.4 Estimating Bias 2
It is easy to see that the followingL is an unbiased estimator for the loss L.

H.3 Generalization
We can generalize this estimation strategy to decompose variance into arbitrary number of randomness. In general, we want to estimate some quantity of the following form from the data that has an hierarchical tree structure of randomness.
For the goal of developing an unbiased estimator, we can get rid of the outer expectation r 1 . . . r n−1 easily by linearity of expectation: simply estimate the Variance conditioned on r 1...n−1 and average them together, as discussed in Section H.2.1.
To estimate we make use of Theorem 2, which requires • an unbiased estimatorμ r n+1 for the quantity E r n+1 ...r N [c r 1 ...,c N ], which we can straightforwardly obtain by average the examples that has the same random variables r 1...n (e.g.c P j ) • an unbiased estimator for the variance of µ r n+1 . If N = n+1, we can directly compute the sample variance of the c as our estimate (e.g. in Equation 63). Otherwise, we use Equation 57 to decompose the desired quantities into two, and estimate them recursively by applying Theorem 2 and Equation 57.
For readability we wrote the proof with the assumption that, in the tree of randomness, the number of branches for each node at the same depth is the same. However, our proof does not make use of this assumption and can be applied to a general tree structure of randomness as long as the the number of children is larger or equal to 2 for each non-terminal node.

I Variance Conditioned on Bias
Since lower bias usually implies lower variance, to tease out the latent effect, we estimate the variance "given a fixed level of bias Bias 2 of b 2 ∈ [0, 1]", i.e. We estimate PretVar s (b 2 ) and FineVar s (b 2 ) using gaussian process and plot them against b 2 in Figure 9 for MNLI, QQP, and SST-2. We find that larger models usually have larger finetuning variance across all levels of biases (except for MEDIUM and MINI on SST-2), and BASE model always has larger pretraining variance than LARGE .
We also experimented with the squared loss: where p i is the probability the assigned to the correct label for instance i. We plot the same curve in Figure 10 and observe the same trend.

J Example Decaying Instances
We manually examined the group of instances where MINI LARGE∆ Acc i ≤ −0.9 in Table 3. In other words, MINI is almost always correct on these instances, while LARGE is almost always wrong. For each instance in this group, we manually categorize it into one of the four categories: 1) Correct, if the label is correct, 2) Fine, if the label might (e) (f) Figure 10: The same figure as 9, except for using the squared loss function L = (1 − p) 2 , where p is the probability assigned to the correct label, instead of 0/1 loss.
be controversial but we could see a reason why this label is reasonable, 3) Wrong, if the label is wrong, and 4) Unsure, if we are unsure how to label this instance. As a control, we also examined the remaining fraction of the dataset. Each time we annotate an instance, with 50% probability it is sampled from the decaying fraction or the remaining fraction, and we do not know which group it comes from. We show below all the annotated instances from this decaying fraction and their categories for MNLI (Section J.1), QQP , and SST-2(Section J.3).

J.1 MNLI
MNLI is the abbreviation of Multi-Genre Natural Language Inference (Williams et al. (2020)). In this task, given a premise and a hypothesis, the model needs to classify whether the premise entails/contradicts the hypothesis, or otherwise. The instances can be seen below.
Premise : and that you're very much right but the jury may or may not see it that way so you get a little anticipate you know anxious there and go well you know Hypothesis : Jury's operate without the benefit of an education in law.  Premise : An organization's activities, core processes, and resources must be aligned to support its mission and help it achieve its goals. Hypothesis : An organization is successful if its activities, resources, and goals align. Label : Entailment Category : Fine Premise : A more unusual dish is azure, a kind of sweet porridge made with cereals, nuts, and fruit sprinkled with rosewater. Hypothesis : Azure is a common and delicious food made with cereals, nuts and fruit. Label : Entailment Category : Wrong Premise : once you have something and it's like i was watching this program on TV yesterday in nineteen seventy six NASA came up with Three D graphics right Hypothesis : I was watching a program about gardening. Label : Contradiction Category : Correct Premise : , First-Class Mail used by households to pay their bills) and the household bill mail (i.e. Hypothesis : Second-Class Mail used by households to pay their bills Label : Contradiction Category : Unsure Premise : Rightly or wrongly, America is seen as globalization's prime mover and head cheerleader and will be blamed for its excesses until we start paying official attention to them. Hypothesis : America's role in the globalization movement is important whether we agree with it or not. Premise : Part of the reason for the difference in pieces per possible delivery may be due to the fact that five percent of possible residential deliveries are businesses, and it is thought, but not known, that a lesser percentage of possible deliveries on rural routes are businesses. Hypothesis : We all know that the reason for a lesser percentage of possible deliveries on rural routes being businesses, is because of the fact that people prefer living in cities rather than rural areas. Label : Neutral Category : Correct Premise : right oh they've really done uh good job of keeping everybody informed of what's going on sometimes i've wondered if it wasn't almost more than we needed to know Hypothesis : I don't think I have shared enough information with everyone. Label : Contradiction Category : Correct Premise : To reach any of the three Carbet falls, you must continue walking after the roads come to an end for 20 minutes, 30 minutes, or two hours respectively. Hypothesis : There are three routes to the three Carbet falls, each a different length and all continue after the road seemingly ends. Label : Entailment Category : Correct Premise : But when the cushion is spent in a year or two, or when the next recession arrives, the disintermediating voters will find themselves playing the roles of budget analysts and tax wonks. Hypothesis : The cushion will likely be spent in under two years. Label : Entailment Category : Correct Premise : But, Slate protests, it was [Gates'] byline that appeared on the cover. Hypothesis : Slate was one hundred percent positive it was Gates' byline on the cover. Label : Neutral Category : Correct Premise : But it's for us to get busy and do something." Hypothesis : "We don't do much, so maybe this would be good for us to bond and be together for the first time in a while.". Label : Neutral Category : Fine Premise : Pearl Jam detractors still can't stand singer Eddie They say he's unbearably self-important and limits the group's appeal by refusing to sell out and make videos. Premise : When a GAGAS attestation engagement is the basis for an auditor's subsequent report under the AICPA standards, it would be advantageous to users of the subsequent report for the auditor's report to include the information on compliance with laws and regulations and internal control that is required by GAGAS but not required by AICPA standards. Hypothesis : The report is required by GAGAS but not AICPA. Label : Entailment Category : Correct Premise : i'm on i'm in the Plano school system and living in Richardson and there is a real dichotomy in terms of educational and economic background of the kids that are going to be attending this school Hypothesis : The Plano school system only has children with poor intelligence. Label : Contradiction Category : Correct Question 1 : Why do Muslims think they will conquer the whole world? Question 2 : Do you think Muslims will take over the world? Label : Non-paraphrase Category : Correct Question 1 : Is dark matter a sea of massive dark photons that ripple when galaxy clusters collide and wave in a double slit experiment? Question 2 : Does a superfluid dark matter which ripples when Galaxy clusters collide and waves in a double slit experiment relate GR and QM? Label : Paraphrase Category : Correct Question 1 : What is Batman like? Question 2 : What is Batman's personality like? Label : Non-paraphrase Category : Correct

J.3 SST-2
SST-2 is the abbreviation of Stanford Sentiment Treebank (Socher et al., 2013). In this task, the model needs to recognize whether the phrases or sentences reflect positive or negative sentiments.