Faithful Model Evaluation for Model-Based Metrics

Statistical significance testing is used in natural language processing (NLP) to determine whether the results of a study or experiment are likely to be due to chance or if they reflect a genuine relationship. A key step in significance testing is the estimation of confidence interval which is a function of sample variance. Sample variance calculation is straightforward when evaluating against ground truth. However, in many cases, a metric model is often used for evaluation. For example, to compare toxicity of two large language models, a toxicity classifier is used for evaluation. Existing works usually do not consider the variance change due to metric model errors, which can lead to wrong conclusions. In this work, we establish the mathematical foundation of significance testing for model-based metrics. With experiments on public benchmark datasets and a production system, we show that considering metric model errors to calculate sample variances for model-based metrics changes the conclusions in certain experiments.


Introduction
In the field of natural language processing (NLP), continuous progress hinges upon the development of novel techniques that outperform existing ones.However, accurately assessing the effectiveness of these new techniques requires a comprehensive evaluation framework.Model evaluation serves as the foundation for assessing the performance and impact of NLP advancements.Significance testing is a crucial tool in the evaluation process, enabling us to derive accurate conclusions.It allows us to determine whether the obtained evaluation results hold significance or are merely coincidental.
As the cost of human annotation for evaluating models using deterministic metrics is substantial, there is a growing trend towards utilizing modelbased metrics for evaluation purposes.Model-* These authors contributed equally to this work.
based metrics evaluate the outputs of an NLP model using another machine learning model such as using toxicity classifier to evaluate the toxicity of the generated texts by a text generation model, while deterministic metrics evaluate the outputs of a NLP model using annotated ground truth labels.In significance testing, computing the confidence interval plays a pivotal role in reaching precise conclusions.The computation of this interval relies on the sample variance, which differs depending on whether deterministic metrics or model-based metrics are used.In the case of deterministic metrics, the sample variance corresponds to the variance of the collected samples.However, for model-based metrics, where results are predicted by machine learning models, the sample variance is influenced by the model's prediction errors.Existing works using model-based metrics for model evaluation do not consider prediction errors for significance testing, risking inaccurate conclusions.In this work, we establish the mathematical foundation of significance testing for model-based metrics.
We conduct several experiments on using model-based metrics including hate speech detection (Hartvigsen et al., 2022) and user perceived defects detection.The experimental results show that considering prediction errors in significance testing changes the conclusions in certain experiments.Thus, we propose that the research community utilizes our framework for performing statistical testing with model-based metrics.In the following sections, we derive the mathematical equations for significance testing for model-based metrics and conduct experiments on several public benchmark datasets and a production system.Finally, we discuss related works and draw final conclusion. it to incorporate model's prediction errors when using model-based metrics.

Background -Significance Testing
Significance testing is a statistical analysis used to estimate the relationship between two statistical variables.When evaluating two models, we want to know if the performance of the two models is significantly different.In this work, we assume the model evaluation is a binary classification task such as whether the classified domain is correct or not in domain classification task, or the generated text is toxic or non-toxic in toxicity classification task.Given two models C and T , their outputs are evaluated using a deterministic metric.The evaluation results are as following, where f C,. , f T,. ∈ {0, 1} are the evaluation results for outputs generated by models C and T respectively; N C and N T are the number of samples used to evaluate models C and T .Their performance is estimated as the mean of the results, and The variances of the mean of evaluation results for C and T using deterministic metric are respectively, where D represents deterministic metric.
We want to know if their performance is significantly different, which is formally stated as a null hypothesis H 0 and an alternate hypothesis H a : According to the central limit theorem, two sample means are statistically different if the following symmetric confidence interval does not contain 0 (Smithson, 2003).
where z α 2 is the critical value and α is the confidence level, fd = fT − fC , Var D (d) = Var D (T − C) = Var D (C) + Var D (T ) (for the case C and T are dependent, the formula is derived in Appendix A.3).For 95% confidence level, z α 2 = 1.96.

Significance Testing with Model-based Metrics
For model-based metrics, the performance of models C and T are evaluated by a metric model M , which is usually a statistical model with prediction errors.Thus, the sample variances calculated by Equation 5 and 6 are the variances of the observed evaluation values instead of the true evaluation values.In this section, we derive the sample variance considering the prediction errors.Note that the following equations apply to both models C and T .Assume we have N independent and identically distributed (IID) samples of evaluations, let N O + be the random variable denoting the number of observed positive samples.As we assume a binary classification task, each observation observes Bernoulli distribution.Therefore, the random variable N O + observes binomial distribution with success probability p O (which can be estimated by using Equation 3 or 4).Thus, we have We aim to estimate the variance of distribution for the real positive samples, N R + ∼ Bin(N, p R ).Towards this goal, we derive the probability p R = P (R = 1) as following where p R|O and p R|O ′ are precision and false omission rate, respectively, which can be estimated from the metric model M 's performance on its testing data.The variance of a binomial distribution is N p(1 − p), therefore, the variance of where M represents model-based metric.
The variance of the model performance is (12) Since the population mean is unknown and variance is estimated with sampled mean p O , the above estimator is a biased estimation.The corrected unbiased estimation using Bessel's correction (So, 2008) to account for the decreased degree of freedom is Therefore, the 95% confidence interval for modelbased metrics is Note that if metric model is perfect, the variance and confidence interval becomes the same as equations 5, 6 and 8.For proof, see Appendix A.1.Note that the formula can be easily extended to multi-class case (see Appendix A.2).

Experiments
We perform several experiments on public benchmark datasets and a production system to validate the proposed framework.In this section, we first introduce the experimental details on public benchmark datasets and then describe the experiments on the production system.Finally, we report the experimental results and analysis.

Experiments on Public Datasets
We select toxicity detection in natural language generation as the base task.The goal of this task is to detect if the generated text is toxic using a toxicity classifier.We adopt a state-of-the-art toxicity classifier, RoBERTa-ToxiGen (Hartvigsen et al., 2022) to detect toxicity in the generated text.We estimate precision and false omission rate (FOR) of this classifier on the manually annotated test set from ToxiGen.The estimated precision and FOR are 0.8897 and 0.22769, respectively.
We compare two text generation models GPT2 (Radford et al., 2019) and GPT-Neo (Black et al., 2021).To generate the text, we utilize prompts from BOLD (Dhamala et al., 2021) and RealToxici-tyPrompts (Gehman et al., 2020).BOLD (Dhamala et al., 2021) is a manually curated dataset for bias measurement in open-ended language generation, which consists of 23,679 English text generation prompts for bias benchmarking in five domains including profession, gender, race, religion, and political ideology.RealToxicityPrompts (Gehman et al., 2020) has 100K naturally occurring, sentence-level prompts extracted from a large corpus of English web text.

Result Analysis
Table 1 shows the experimental results.In the table, we show average toxicity score, average treatment effect (ATE), variance and confidence interval.ATE is calculated as the difference between average toxicity score of the two models, specifically, it is the average toxicity score of GPT-Neo subtracted by the average toxicity score of GPT2 (we consider GPT2 as the baseline).
From the table, we can see that there is a significant increase in variance when we consider the metric model errors.For example, on BOLD dataset, the variance of GPT2 changes from 1.92e-7 to 7.50e-6 (39x increase in variance).Disregarding the metric model errors, the confidence interval is (-0.00325,-0.00114), leading to the conclusion that we can reject the null hypothesis and reaching the conclusion that GPT-Neo produces output with significant lower toxicity than GPT2.However, when we consider the metric model errors, the confidence interval is (-0.00978,0.00538), which shows insignificant difference and we cannot reject null hypothesis.In this case, considering metric model errors changes the final conclusion.On Re-alToxicityPrompts dataset, we also see big difference in variance change, but the conclusion is not changed.

Experiments in Production System
Besides conducting experiments on public models and benchmark datasets, we also perform experiments on live traffic in a production system of a lead voice agent.The task is natural language understanding such as domain classification, intent classification, etc.We compare the performance of two NLU models estimated by a machine learning model based on customer utterances and system responses (Gupta et al., 2021).The precision and FOR of the metric model are estimated on manually annotated datasets.The dataset used for experiment is de-identified.

Result Analysis
Table 2 shows the experimental results in a production system.The experiments are conducted on different NLU domains and devices.From the results, we can see that considering metric model errors has a big impact on variance estimation and also changes the final conclusion of the experiments.

Related Work
In their paper, Dror et al. (Dror et al., 2018)

Conclusion
Significance testing is an important tool for us to draw accurate conclusions for evaluating NLP models.Existing evaluation works using model-based metrics do not consider model variance for significance testing, which can lead to wrong conclusions.In this work, we lay the mathematical foundation of significance testing for model-based metrics.We conduct experiments on public benchmarks and a production system.The significance testing results on these experiments show that model based errors need to be considered and incorporated for accurate evaluation.
In this work, we focus primarily on computing confidence interval with model-based metrics which use binary classification.In the future, we plan to extend our work to more general types of model-based metrics.Further, we assumed that the samples are independent and identically distributed.In practice, we often have a score associated with the metric model which can be used to relax this assumption.We leave this as future work.

Limitations
This work mainly focuses on significance testing of binary categorical metrics and two sample ttest.We do not explore other types of metrics and statistical tests.We leave them to future work.

Table 1 :
Experiment Results on Public Benchmark Datasets.In the table, Mean GPT2 means the average toxicity score of GPT2 model.Similarly, Mean GPTNeo is the average toxicity score of GPT-Neo model. 2 Var D GPT2 means the variance of of GPT2 model using deterministic metric.Var M GPT2 means the variance of the GPT2 model using model-based metric. 3CI D means confidence interval using deterministic metric, and CI M means confidence interval using model-based metric. 1

Table 2 :
Experiment Results in a Production System.The notation is similar as in Table1. *