Follow the leader(board) with confidence: Estimating p -values from a single test set with item and response variance

Among the problems with leaderboard culture in NLP has been the widespread lack of confidence estimation in reported results. In this work, we present a framework and simulator for estimating p -values for comparisons between the results of two systems, in order to understand the confidence that one is actually better (i.e. ranked higher) than the other. What has made this difficult in the past is that each sys-tem must itself be evaluated by comparison to a gold standard. We define a null hypothesis that each system’s metric scores are drawn from the same distribution, using variance found naturally (though rarely reported) in test set items and individual labels on an item (responses) to produce the metric distributions. We create a test set that evenly mixes the responses of the two systems under the assumption the null hypothesis is true. Exploring how to best estimate the true p -value from a single test set under different metrics, tests, and sampling methods, we find that the presence of response variance (from multiple raters or multiple model versions) has a profound impact on p -value estimates for model comparison, and that choice of metric and sampling method is critical to providing statistical guarantees on model comparisons.


Introduction
AI and NLP evaluation is facing a scientific reproducibility crisis that, despite increasing awareness, continues to worsen (Gundersen and Kjensmo, 2018).Published results may often show only epsilon improvements to state-of-the-art results, with no effort to estimate whether or not the results are statistically significant.The reasons for this crisis are complex, and it is easy to implicate the culture created by leaderboards (e.g., Wang et al. (2018)).
Our work is motivated by the need to provide statistical testing alongside NLP results in order to reliably demonstrate model improvement, * Work completed while interning at Google.as opposed to solely depending on leaderboards.Our work naturally ensues from studies of rater response disagreement (see e.g., (R Artstein; Snow et al., 2008;L Aroyo, 2013;Plank et al., 2014;Fornaciari et al., 2022), among others).Further, the issue of insufficient statistical analysis in NLP work is well-documented, with many ACL papers not reporting statistical significance (Dror et al., 2018).Considering the reliance on system comparison for benchmarking and leaderboards, statistical guarantees that consider the performance of both systems are critical, yet understudied.
Statistical tests for paired data (e.g.McNemar (1947)) are not appropriate for this setting because of their reliance on strong assumptions about the data (Dietterich, 1998b); even extensions of Mc-Nemar's test such as the Cochran-Mantel-Haenszel test (Mantel, 1963) only apply when the metric can be applied independently to each item or responder (human or machine), and is then aggregated.Therefore, these metrics are not applicable for this use case, in large part due to three potential challenges: (1) three sets of data are involved in this comparison, (2) there is variance in all three of those sets, and (3) many different metrics are used in NLP evaluation.Moreover, variance can come at the item or response level, due to stochastic inference or training, changes in training data such as cross-validation, or annotator disagreement in gold labels.
We investigate the use of null hypothesis significance tests (NHST) to add a dimension of confidence to NLP evaluations.The purpose of NHST is to determine whether differences between multiple sets of observations are significant, after accounting for sampling variance in the observations.When comparing two NLP systems, each is first compared to a gold standard, resulting in some metric score (e.g.BLEU (Papineni et al., 2002)), and then those metric scores for the two models are compared to each other.While all p-values are esti-mates, there are many ways to sample and measure the results from a single test set, each producing a different p-value estimate.We explore how to determine which method (of sampling, aggregating, and measuring responses) produces the most accurate p-value estimate from a single test set in comparison to the true/ground truth p-value.
In this work, we present a framework for effective p-value estimation when comparing two systems against gold judgments, with the aim of identifying with statistical rigor if one system outperforms the other.Our findings indicate that the amount of response variance has an impact on pvalue estimates, item-wise mean absolute error is consistently a reliable metric, and-while most metrics and sampling methods perform well when machine output is dissimilar-metric choice and sampling method is especially critical when the performance of the two machines is similar.
Our primary contributions include: • combining, for the first time, the related notions of response disagreement from machines (Gorman and Bedrick, 2019) and from raters (Aroyo and Welty, 2015); • a new framework for NHST that allows comparisons across different test metrics and sampling strategies; • a simulator capable of producing informative null hypotheses and computing p-values that account for both item and response variance, • a thorough evaluation of how well eight metrics and six re-sampling strategies estimate the "true" p-value from a single test, on simulated data; and • a demonstration of our framework on realworld data.Our findings give insight into which statistics are most informative when designing NHSTs for contemporary NLP systems, and is applicable to any NLP setting that makes use of comparisons to quantitative gold judgments (e.g.sentiment analysis, semantic similarity, search relevance, translation quality, etc.), when response variance is prevalent.We plan to share our code upon publication.

Related Work
Our approach to generating a statistical guarantee associated with the comparison of two NLP systems (a p-value) is rooted in the statistical inference method of NHST.Our formulation also incorporates variance in rater and system responses.

NHST for evaluation
Existing notions of p-values are built on a null hypothesis H 0 which states that the effect size between the control and test set is zero.The p-value is then the probability that an effect of the observed size or greater would occur under the assumption that H 0 is true.Here, the "control" and "test" sets are the outputs of distinct models that we wish to compare, and the effect size represents the performance of the first system compared to the second on gold standard data.
Dietterich (1998a) considers hypothesis testing on machine learning problems (specifically comparing the performance of two learning algorithms with a small amount of data), but does not consider response variance or accuracy of the p-value estimate.Our approach builds on Dietterich's (1998a); we also observe that the standard null hypotheses do not quite fit the use case of comparing the output of two systems, since the error is the result of a comparison with a third, gold standard, dataset, and we investigate the effect of different sources of variance, as well as different metrics, on the p-value estimate from a single test set.Søgaard et al. (2014) explore the effects of sample size, covariates (such as sentence length), and the variance introduced by multiple metrics, and conclude that current approaches to p-value tests are not reproducible or sufficient.They suggest that the usual upper bound of p < 0.05 is too high, and that p < 0.0025 provides a better guarantee that the false positive rate is less than 5%.One problem faced in coming to this conclusion was how to determine what the correct p-value actually is.Note that they use the false positive rate as the target of the guarantee, which is an intuitive but completely non-standard approach to hypothesis testing.We address this by utilizing a simulator that is capable of generating thousands of test sets, which then allows us to make a better estimate as to the true p-value, and compare the effects of many more sources of variance.
Related work has surveyed statistical significance testing techniques in NLP systems (Dror et al., 2020) and studied permutation and bootstrapping methods for computing significance tests and confidence intervals on text summarization evaluation metrics (Deutsch et al., 2021).Haroush et al. (2021) observe that out of distribution detection can be recast as a p-value problem, using p-values for inference, not significance testing.
Prior work critical of the utility of the p-value cites the impact of sample size and bias on the level of significance (Sullivan and Feinn, 2012;Thiese et al., 2016), as well as the variability of p-values across samples (Halsey et al., 2015).Kohavi et al. (2022) examine the misunderstandings and errors related to statistics reported on A/B test experiments, including the erroneous perception that p-value indicates the chance of a false positive.Kohavi et al. (2022) suggest that p-values are widely inaccurately applied even by experts and that intentional efforts need to be made to report meaningful statistical measures.
Though there is some criticism of the use of p-values, we propose that they can be useful in bridging the lack of confidence estimation in NLP system evaluations.Further, we aim to address the effect of variability across samples by using a large number of samples to determine the best approach to p-value generation.
Null hypothesis statistical testing alone as a method for significance testing also does not lead to reproducibility, due to the use in evaluation of inconsistent train-test splitting (Gorman and Bedrick, 2019).We address this as well in our approach by incorporating response variance, discussed in more detail below.

Response Variance
For each item in a test set, a human rater can provide a response, such as a class label or a Likert scale.Prior work indicates the importance of eliciting such responses from multiple raters per item, to account for ambiguity and different perspectives (L Aroyo, 2014;Uma et al., 2021).Regardless of the task, gathering multiple responses results in disagreement.Machine systems also provide a response for each item, and these responses can vary with stochastic training conditions, hyperameter changes, cross-validation, and other causes.System variance can be incorporated into model prediction by merging answers rather than simply ranking (Gondek et al., 2012).
Response variance may be indicative of true features of the data and thus be incorporated into the model (Reidsma and Carletta, 2008).Recent work has indicated that taking a majority vote aggregation may not be effective at resolving/incorporating annotator variance (Davani et al., 2022;Barile et al., 2021).
Prior work has explored the role of variance and data collection in metrics on human annotated datasets (Welty et al., 2019).Homan et al. (2022) provides a framework for analyzing the amount of variance, and types of disagreement, in crowdsourced datasets.Wong et al. (2021) addresses the variance in crowdsourced annotations by presenting a more contextualized measure of inter-rater reliability based on Cohen's kappa.Bayesian models of annotation have also been used and evaluated as potential methods for identifying annotator accuracy and item difficulty (Paun et al., 2018).Recent work has also considered incorporating logical justifications of human viewpoints as a twodimensional judgment (Draws et al., 2022).
Our simulator produces scores with variance according to different distributions (specified as hyperparameters), allowing us to include response variance in our evaluation.

Problem Formulation
Comparing two NLP systems often involves measuring a baseline B and a candidate A against gold judgments G, to determine whether A is an improvement over B. This comparison is made using a metric δ run over a test set that is drawn from a population of data.For each item i in the test set, both A and B have a distribution of responses A i and B i , and it is possible to have multiple responses for each item.In addition, due to rater disagreement, there is a distribution of human responses, G i .The metric δ compares each system's responses to the human responses and produces a pair of metric scores, δ (A,G) and δ (B,G).Finally, the per-system metric scores are compared to each other so that when δ (A,G) > δ (B,G) we can say A is an improvement over B.
The null hypothesis, denoted H 0 , is that the two sets of responses being compared (i.e.A i, j and B i, j , where i is an item and and j is a responses for a given item) are drawn from the same distribution.This is compared against an alternative hypothesis, denoted H 1 , that A i, j and B i, j are true to the underlying distributions from which A, B, and G were drawn, and therefore that the comparison δ (A,G) and δ (B,G) is a fair representation of the comparison between A and B. We aim to provide a p-value for this comparison.
By contrast, in the vast majority of NHST settings, A and B are sets of individual responses and there is no notion of variance in i once it is drawn; the only source of variance comes from the sam-pling of the items.For simple test statistics like mean, a closed-form estimate such as a paired t-test (Student, 1908) will suffice.
However, many metrics used in NLP are not amenable to such closed-form estimates, and the presence of response-level variance means that even many simple metrics cannot be reliably estimated in closed form.Therefore, it is necessary to rely on resampling methods, such as bootstrapping or permutation sampling, to estimate p-values.
Here we focus on bootstrapping variants, where variance is estimated by resampling from a dataset with replacement.
Usually, the most important design issue in NHST is whether the sample has enough statistical power to detect a difference between A and B when one exists; in our setting, there are two equally fundamental questions: what approach to resampling to use in order to estimate variance, and what metric to use for reliably estimating p-values.These design issues led us to the following research questions: RQ1. Can response-level variance be used to estimate p-values?RQ2.What method of sampling response variance generates the most accurate p-value?RQ3.What metrics generate the most accurate p-value?RQ4.How sensitive are the measurements as two systems' responses draw closer to each other?

Simulator
To produce and analyze p-value estimates from a test set, we built a simulator that operates in three stages.The main idea is to sample a reference test set from a known, fixed underlying distribution of items and responses and use a resampling method to estimate the p-value of that test set.Then, we use the same underlying distribution to directly estimate the "true" p-value of the distribution.

Generating reference test data
First, we generate the reference test data, described in detail in Algorithm 3 in appendix A. For the true p-value (true) H 1 , Algorithm 1 constructs the samples corresponding to each of A, B and G by ignoring A ref , B ref , and G ref and instead sampling directly from the underlying distribution described in §3.2.1.For the H 0 data, we use the same underlying distribution, except that in order to operate under the H 0 assumption that A ref and B ref are drawn from the same distribution, each item i and response for each of A and B (the process is unchanged for G) is sampled by first uniformly drawing X ∼ {A,B} and then sampling from the normal distribution parameterized by (µ i + ε i X ,σ i ).

Applying hypothesis tests to (sub)sampled distributions
Finally, for each of the reference test sets and the true distribution, we sample from the distribution M times and feed the output to Algorithm 2, which estimates p-values with respect to a given metric.
Algorithm 1 SAMPLE Input parameters G,A,B: pointers to reference data or underlying distributions Φ: item index sampler Π: response sampler r ∈ {rts,true} whether to use the input sets for re-sampling or to sample directly from the true underlying distribution.

Results
G * ,A alt ,B alt : vector (or matrix) samples We perform a set of experiments on datasets where N = 1000 and K = 5.These numbers are representative of the number of items in typical test sets and of the numbers of responses in test sets where multiple responses are reported.We consider 6 sampling methods, 8 metrics, and 5 levels of perturbation of system B (we fix the perturbation ε A = 0 and treat it as an ideal model1 ).

Sampling strategies for response variance
We experiment with 6 test set sampling methods to calculate a p-value.By implementing these methods, we are able to determine which of these ap- proaches on a single test set best approximates the true p-value.

Metrics
We implement 8 metrics to compare the gold scores and the systems output: • Mean absolute error (MAE).Calculate the error for each item, i.e. the distance (absolute value of the difference) from gold to system responses, then take the mean of the item-wise error.Note that if the size of the response sample per item is greater than 1, the responses per item are aggregated to the mean.• (Inverse) Mean-squared error (MSE).Mean squared error (inverted so that higher is better) across all items.• Item-wise metric wins (Wins δ ).Compare the system responses to gold for each item using a metric δ , and count the number of items in the set for which each system performs better (i.e.wins).In

Results of simulation study
We examine which of the metrics and sampling methods on a single test set best estimate the true p-value, by calculating the error between the estimated p-value and true p-value across five response distribution perturbations (ε A = 0,ε B ∈ {0.0,0.05,0.1,0.3,0.7}, q.v.Alg. 3).We expect that as the amount of perturbation applied to system B increases, it should be clearer that the data is drawn from two separate distributions.A metric that is more sensitive to the effect of perturbation/distance should have a smaller dif-ference between the estimated p-value and true p-value when the perturbation is increased.Consequently, the metric should have a harder time producing the estimated p-value when the systems are closer together-meaning a larger difference between the estimated p-value and true p-value when the perturbation is less.
Table 1 shows estimation error for each of the sampling methods (minimized across all metrics).The estimation error is the difference between the p-value estimated from a single test set and the true p-value.With ε B = 0.1, the (all_items, sample(5)), (bootstrap_items, all), and (bootstrap_items, sample( 5)) all perform well.These three sampling methods are clearly the best performers for all ε > 0. On the other hand, sampling strategies which reduce the amount of responses per item (i.e.sample(1) and first_element) are not as effective.These findings indicate that incorporating the variance into the evaluation enables a more accurate statistical comparison.
Table 2 shows estimation error for each of the metrics (minimized across all sampling methods).To illuminate trends across perturbation levels, Figure 1 visualizes the results from Table 2, and some interesting patterns emerge.As discussed above, we expect a good method to decrease its p-value estimates as the perturbation of B (the x axis) increases.
Multiple metrics (cosine similarity, Wins MAE , and Spearman Rho) show lower minimum differences at each increasing interval of perturbation.This suggests that these metrics, when operating under unknown conditions / distances between system A and system B, may behave most predictably.Wins MAE has the lowest difference in true and estimated p-value for ε > 0, making this the preferred metric.
The least consistent metric is Aggregated EMD vectorized, which increased, decreased, and increased again in minimum difference between estimated and true p-values at increasing levels of perturbation (Table 2).
It is important to note that p=0.05 is a critical value when considering statistical guarantees, so differences in estimated and true p-values close to or exceeding 0.05 are not acceptable; if the difference in estimated and true p-value is close to 0.05, there is sufficient room for error for it to seem like there is evidence of model difference, when in fact there is not.

Application to real-world data
To apply our method on actual data, we need the item and response data for the ground truth and the two machines (G ref , A ref , and B ref , respectively).
For our example, we chose Kumar et al. (2021), a dataset of 107,620 social media comments that are labeled by five annotators each on the toxicity of each comment, using a 5-level Likert scale from 0-4.We randomly sampled 1000 items from it for G ref , normalizing the annotations into [0,1], yielding possible responses {0,0.2,0.4,0.6,0.8}.Next, we match the hyperparameters of Algo- rithm 3 to the actual underlying distributions.We assume that each response G i,k is drawn from a normal distribution with a specific mean and standard deviation for each item, as before, except rather than assuming they come from uniform distributions as in Algorithm 3 we now take parameterized models foldednormal([0, 0.28]) and triangular([-0.05,0.21, 0.45] for the means and standard deviations, respectively, fitted to the 107,620-comment dataset.We visually inspect the histograms to determine the probabilistic model Sampling Method 6 0.00108 0.00545 0.02909 0.01037 < 10 −5 (all_items,sample(5)) 9 0.02020 0.00390 0.00585 < 10 −5 < 10 −5 (bootstrap_items, all) 9 0.00014 0.00120 0.00604 < 10 −5 < 10 −5 (bootstrap_items, first_element) 6 0.10096 0.03511 0.02359 0.01801 < 10 −5 (bootstrap_items, sample(1)) 6 0.00193 0.00893 0.02965 0.00958 < 10 −5 (bootstrap_items, sample(5)) 9 0.00406 0.00265 0.03939 < 10 −5 < 10 −5 Table 3: On real toxicity data: minimum p-value estimation error by sampling method, a tuple of item and response sampler, based on n experiments per method, for five different levels of M2 perturbation (ε B ), with ε A = 0.  to use, and then choose the hyperparameters that minimizes the mean absolute error between the observed data distributions and those predicted by the models.This process is described in appendix B. We also assume that, after sampling from a normal distribution the results in the range [0,0.2) are converted to 0.2, those in the range [0.2,0.4) are converted to 0.4 etc. to simulate the discrete nature of Likert responses.With these parameters set, we can run the framework described in §3.2, with the toxicity dataset sample as our reference test set and simulated system responses to choose the best metric and sampling method to use on G ref .
We expect to see results similar to those from the pure simulation study, although the fact that responses are now discrete, rather than continuous, there will be sharper differences in performance between different values of ε B .
The results on the toxicity dataset (Table 3, Table 4, Figure 2) exhibit some of the same patterns seen in the pure simulation results.(Boot-strap_items, all) is the best sampling strategy, and the strategies that take only one response per item seem to do the worst.Among the metrics, Spearman Rho has the best overall performance.
However, it should be noted that for ε B ∈ {0,0.05,0.1} the maximum amount of perturbation is relatively small compared to the 0.2 interval between successive elements in the response do-main.There is not much observable difference in the performance between A and B until ε B = 0.3.At this point, many of the metrics do well.Another interesting pattern in the metric results is that Spearman Rho is among the better performers in most cases, particularly for ε B ∈ {0.3,0.7}.

Discussion
These experiments suggest answers to our four research questions: RQ1. Can response-level variance be used to estimate p-values?Yes.In Table 1 and Table 3 we see that (bootstrap_items, first_element)-the only sampling method that does not make use of response variance-generally performs poorly in the three lowest perturbation settings, and becomes competitive with the best approach only at ε B = 0.7.The response variance appears to make the measurements more sensitive to smaller differences between evaluated systems.
RQ2.What method of sampling response variance generates the most accurate p-value?The most promising sampling method is (boot-strap_items, all) (Table 1 and Table 3).RQ3.What metrics generate the most accurate p-value?In the purely simulated data, Wins MAE is the best (Table 2).In the toxicity dataset, MSE, MAE, Wins MAE , and Spearman Rho all do very well for ε B ≥ 0.3, and MSE and Spearman Rho do better than the rest for smaller perturbation levels (Table 4).
RQ4.How sensitive are the measurements as two systems' response distributions draw closer to each other?On the purely simulated data, Wins MAE is the most consistent metric when considering sensitivity to distance between system A and system B (Figure 1).
Compared to the purely simulated data, the real toxicity dataset exhibits a much sharper difference among the performance of the better metrics as ε B increases.This is likely due to binning the responses into five discrete levels, meaning that levels of perturbation that are detectable in a continuous domain (which the purely simulated data has) are negligible over a discrete domain (which the toxicity data has) when they are much smaller than the size of each bin.However, as the perturbation levels approach or exceed the bin size, the binning quite suddenly creates starker differences in the toxicity dataset than in the purely simulated data (Table 2 and Table 4).
Our results suggest that, of the methods explored here, the Wins MAE metric, in combination with the (bootstrap_items, all) sampling technique, provides the most effective p-value estimate on a single test set.That Wins MAE performed poorly on ε B = 0 (or, on the toxicity dataset, ε B ≤ 0.1) should not distract from its superior performance for other choices of ε B .Recall that ε B = 0 (or, on the toxicity dataset, ε B ≤ 0.1) means that the null hypothesis is (effectively) true (Hung et al., 1997;Boos and Stefanski, 2011;Colquhoun, 2014) and p-values are very large, so larger errors are less critical.
Compared to MAE, Wins MAE 's counting wins likely outperforms taking the mean due to the small sample of responses taken for each item (i.e., no more than five for each).With such a small sample it is hard to estimate the mean with any degree of precision, so when these means are aggregated over all 1000 items, this lack of precision accumulates.Even though we cannot reliably estimate the mean with two samples, in comparing two samples of size five, it is still possible to tell when one mean is likely greater than the other: a win is a binary, whereas the mean is a continuous, variable, so the mean carries more information.Thus, at lower sample sizes it is harder to estimate.
Our results suggest that, among the metrics and sampling methods studied here, the choice of best metric is independent of the choice of sampling method, and vice versa.

Conclusion 8.1 Overview & Findings
In this work we address the lack of statistical rigor in system evaluation and propose a framework to help tackle this problem.Here, we constructed a statistical approach to comparing two systems against gold/human judgments.After developing a simulator to test the utility of sampling methods and metrics on many test sets, we experimented with 6 sampling methods, 8 metrics, and 5 levels of distance between system A (proposed system) and system B (baseline).We find that sampling methods which incorporate variance perform better, and that Wins MAE and Spearman Rho are reliable metrics.

Recommendations for Practitioners
While this testing regime is our general recommendation for future work evaluating NLP systems, our findings indicate that evaluation protocol requires tuning to the specific task and data.Generally, our results show that incorporating variance into sampling strategy enables more rigorous statistical evaluation, and both Wins MAE and Spearman Rho are metrics which seem to be strong in their sensitivity to perturbation.
These methods are useful for designing an experiment, as they can indicate an optimal metric or sampling strategy, as well as number of necessary items or annotators for the task.
Beyond specific recommendations for metrics and sampling methods, our results demonstrate that machine similarity (distance in distribution between the baseline and the proposed system), sampling method, and metric chosen affect leaderboard performance, and statistical guarantees should be provided when claiming that a proposed model outperforms an existing model.

Future Work
In future work, we would like to consider further hyperparameters, such as the effect of number of responses on the measurement sensitivity, categorical responses as opposed to continuous numerical data, and different item and response distributions.In the latter case, we believe that understanding the item and response distributions of an evaluated system will be an important element in choosing sampling strategies and metrics.

Limitations
As the contributions of this work include a framework and preliminary experimentation, there are a number of constraints that we leave to future work.Firstly, we considered only one family of response distributions.We chose normal distributions because their behavior is well-understood and they are easy to work with.However, the structural similarities between normal distributions and the best performing metrics-namely, absolute errorsuggests that, more generally, the best test metrics for NHST may vary depending on the underlying response distributions.Therefore, we recommend that use of our framework should potentially vary depending on the dataset being considered, and might have other distributions commonly found in model and gold standard items and responses, such as exponential or multinomial distributions.
Similarly, we only considered p-value estimators that are based on bootstrap sampling.Implementation of our framework in future use would benefit from matching the estimator to the test metric.For instance, permutation tests are the most common way to estimate p-values for Spearman correlation, and analytical tests such as Student's or MacNemar's, which are commonly used even when the underlying assumptions on which they are based are not likely to hold (as, we expect is the case here).As such, the sampling method could change based on which metric is best for the task/data.

A Algorithm used by the simulation framework
Here we include formalizations of an algorithms used in our work.In Algorithm 3 we specify the process for generating the reference test data.To fit the distribution of the simulated system responses to the dataset, we take the mean and standard deviation of the responses of each item in the dataset.We then inspect histograms of theses values.We noted that the distribution of the item-wise means (Figure 3, left) seems to follow a folded normal distribution that has been clamped to the range [0,.8] (i.e., values falling outside that range are assigned to the nearest value in the range, namely

C Complete Results
Here we include the full results for our experiments on both simulated and real data.
Figure 1: Minimum difference between estimated pscore and true p-score for each of the 8 metrics, at the 5 levels of perturbation.
Figure2: For our application to real-world data: minimum difference between estimated p-score and true p-score for each of the 8 metrics and 5 levels of perturbation.
set size K: number of responses per item ε A : Perturbation of A scores from G ε B : Perturbation of B scores from G Results µ i : Response means per item σ i : Response standard deviations per item G: item, response matrix for human annotations A: item, response matrix for test system B: item, response matrix for baseline system for all i ∈ [0,N) doµ i ∼ uniform([0,1]) σ i ∼ uniform([0,.2])ν A ∼ uniform([−ε A ,ε A ]) ν B ∼ uniform([−ε B ,ε B ]) for all k ∈ [0,k) do G i,k ∼ normal(µ i ,σ i ) A i,k ∼ normal(µ i + ν 0 ,σ i ) B i,k ∼ normal(µ i + ν 1 ,σ i ) end for end forB Fitting the mean and standard deviation models to the toxicity dataset

Figure 3 :
Figure 3: Distribution of item-level response means in the Toxicity dataset (left), and from a 1000-item sample of a folded normal distribution with mean .20 and standard deviation of 0.16, where values greater than zero have been assigned to 0 and values greater than one have been assigned to 0.8.

Figure 4 :
Figure 4: Distribution of item-level response standard deviations in the Toxicity dataset (left), which has a mean of 0.19 and standard deviation of 0.11 and from a 1000-item sample of a triangular distribution with minimum −0.05, apex 0.21, and maximum 0.45, where values greater than zero have been assigned to zero and values greater than one have been assigned to one.
responses from a normal distribution parameterized by µ i and σ i .For A ref and B ref , we use the same sample of means and standard deviations as for G ref , but with µ i replaced by µ i + ε i X , where ε i X is chosen uniformly at random over the interval [−ε X ,ε X ] for X ∈ {A,B}, respectively.This process makes the items in the three sets the same, while keeping the responses in each set independent (conditioned on each item i) where the magnitudes of difference in the response distributions are parameterized by ε A and ε B .In each case it produces data supporting H 0 and H 1 .For the reference test set (rts) H 1 , we construct three samples corresponding to A ref B ref and G ref by resampling from each according to the given sampling strategy.For the reference test set H 0 , we reuse the sample of G ref constructed for H 1 .For A ref and B ref , we operate under the H 0 assumption that they are drawn from the same underlying distribution (when in fact they were drawn from similar distributions, perturbed according to ε A and ε B ).We do this by first combining A ref and B ref into a single set A ref |B ref , where each item i in the combined set has all of the responses from both A ref i and B ref i .We sample responses for each of A ref and B ref by sampling from A ref |B ref .
ref and B ref ).Each sample has N items and K responses per item.The responses are continuous values in the interval [0,1].To construct G ref , for each item i we sample a mean µ i and standard deviation σ i from specific uniform distributions.Then, we sample K vide rules for resampling from a dataset, such as sample the items, take all responses or sample the items, then sample from the responses for each item.Algorithm 1 is actually used twice: once for the data needed to estimate the p-value based on the reference test set and once for the true p-value.
Table 2, we show the results only for Wins MAE .

Table 2 :
Minimum p-value estimation error by metric, based on n experiments per metric, for five different levels of M2 perturbation (ε B ), with ε A = 0. n is the number of experiments using a given metric (i.e.number of sampling methods used in combination with this metric).

Table 4 :
On real toxicity data: minimum p-value estimation error by metric, based on n experiments per metric, for five different levels of M2 perturbation (ε B ), with ε A = 0.