On the Effectiveness of Automated Metrics for Text Generation Systems

A major challenge in the field of Text Generation is evaluation, because we lack a sound theory that can be leveraged to extract guidelines for evaluation campaigns. In this work, we propose a first step towards such a theory that incorporates different sources of uncertainty, such as imperfect automated metrics and insufficiently sized test sets. The theory has practical applications, such as determining the number of samples needed to reliably distinguish the performance of a set of Text Generation systems in a given setting. We showcase the application of the theory on the WMT 21 and Spot-The-Bot evaluation data and outline how it can be leveraged to improve the evaluation protocol regarding the reliability, robustness, and significance of the evaluation outcome.


Introduction
The field of Text Generation is a subfield of Natural Language Processing (Celikyilmaz et al., 2020). We define text generation tasks as those where many different texts may constitute an optimal solution to a given problem. Examples are automated summarization, machine translation, dialogue systems, paraphrasing, caption generation, or natural language generation.
One unsolved issue in the field of Text Generation is the evaluation, be it human or automated evaluation. Human evaluation is more reliable but more cost and time intensive, and automated evaluation is erroneous but performed in a fraction of time and cost (Amidei et al., 2019;Hashimoto et al., 2019;Celikyilmaz et al., 2020;Deriu et al., 2021). One of the main issues is the lack of theoretically founded guidelines when running an evaluation. For instance, how many samples are needed to be able to significantly distinguish the performance of two systems? Or how do we handle the errors made by automated metrics? Under which circumstances is it still possible to run an evaluation campaign that yields significant results? In this work, we make a first step towards developing such a theoretical foundation, which can be used as a guideline to answer the above questions. For this, we consider what we call binary metrics. These are metrics that classify the output of a text generation system as being either adequate or inadequate. This allows us to measure the performance of a text generation system as the ratio of adequate responses it generates. Furthermore, it allows us to reason about the performance of the metric in terms of true positives and true negatives.  For this setting, we derive various theoretically founded guarantees and guidelines that can be used to run an evaluation campaign. For instance, consider Figure 1 (derived by our theory). If we assume a binary metric that has an accuracy of 70%, and if we have access to 1000 automatically rated samples (blue line), then we can reliably distinguish between two text generation systems that have a difference in performance of 10 percentage points. To distinguish two systems with a smaller difference, for instance of 2%, we would need a better metric and many more samples. That is, we need for instance a metric with an accuracy of at least 85% and 10000 automatically rated samples by this metric.
Our theory provides analogous assessments of how many human evaluations are required to reliably distinguish text generation systems. When we say that the performance of two systems can be reliably distinguished, we mean that the difference in their performance is statistically significant. Similarly, a measurable difference in performance is one that leads to statistical significance given the experiment parameters.
In addition, our theory allows for the mix of human and automated evaluation. For this, consider Table 1 where we depict the number of human and automatic ratings required by a metric with 70% accuracy. For instance, to distinguish two text generators with 2 percentage points difference, we need either at least 5000 human ratings, or 2500 human ratings mixed with 10'000 automated ratings.
Our theoretical framework allows us to design our evaluation with theoretical guarantees regarding the significance of the resulting measurements. Given a monetary budget and our theory, one can decide whether to invest in more human annotations, in developing better automated metrics, or in sampling more automated ratings. Our approach can also be used to showcase the limits of a given setting: for instance in Figure 1, we see that using only 1000 automated ratings leads to a minimally measurable difference of 4% even with a perfect metric.
In the remainder of the paper, we derive the theoretical framework for binary metrics and apply it to two showcases: the WMT-21 shared task (Freitag et al., 2021b) and the Spot-The-Bot evaluation (Deriu et al., 2020). We analyse how well these evaluations adhere to the constraints imposed by our theory and demonstrate how the quality of the evaluations can be improved. To serve the community, we will release the formulas as code and as a web interface 1 that allows practitioners to enter their evaluation settings and receive an analysis of the measurable differences in their settings.

Definitions
In this section, we introduce the basic definitions that we need for the derivations. First, we define the general setting of Text Generation, then we cover binary metrics, and finally we describe text generation systems.

General Setting
Definition 1 (Text Generation Environment) A text generation environment is composed of a triple ⟨I, O, Φ⟩, where I denotes the set of inputs, O the output space, and Φ : I × O → {0, 1} an oracle that assess whether an output is adequate for a given input.
For instance, for Machine Translation I denotes all sentences in the source language and O all sentences in the target language, while for a chatbot I contains all dialogue contexts and O all possible responses in a dialogue. Note that I and O can be of infinite size. We regard Φ as an oracle that segments the output space for a given input into adequate and inadequate outputs 2 .
Definition 2 (Adequate Responses) ∀i ∈ I, we call R i + = {o ∈ O|Φ(i, o) = 1} the set of adequate responses for input i, and R i − = {o ∈ O|Φ(i, o) = 0} the set of inadequate responses.

Binary Metric
In this work, we set our focus to binary metrics, i.e., metrics that classify the output of a text generation system as being either adequate or inadequate. The choice of binary metrics allows us to reason about the performance of a text generation (TG) system as the ratio of adequate responses 3 .
We first define the notion of a binary metric, then we show what it means for a binary metric to be error-free or error-prone with regards to Φ.

Definition 3 (Binary Metric)
which takes a pair of input and output, and returns either 0 or 1. We interpret the return of 1 as claiming that the output is an adequate output for the given input, and 0 claiming that the output is not adequate.
Next, we define the notion of an error-free metric. That is, how we expect the metric to behave in the optimal case (i.e. its ability to replicate the oracle Φ).
. That is, an error-free binary metric always rates an adequate output as 1 and an inadequate output as 0. Since most metrics do not perform perfectly regarding Φ, we formulate the cases where a metric makes mistakes and the calculation of its performance as follows.
That is, we define the performance of a binary metric as its probability to correctly classify an output as being adequate or not. Thus, the error of a binary metric can be assessed similar to the error of a binary classifier, i.e., ρ is equivalent to the true positive ratio and η to the true negative ratio. Note that ρ = η = 1 defines an error-free binary metric, whereas all other cases are error-prone. In the case where ρ and η have the same value, ρ = η, this value is the accuracy of M ρ,η b . Note that in practise, ρ and η must be estimated from data.

Text Generation
We define a text generation system as a function that takes an input from the input-space and generates an output.
Definition 6 ((Optimal) Text Generator) A Text-Generator (TG) is a mapping π : I → O which generates for each input i an output o. A TG is optimal ⇐⇒ ∀i ∈ I : π(i) ∈ R i + values above 0.78 are regarded as adequate). This introduces errors, which can be measured.
Next, we introduce the notion of an imperfect text-generator. There are many different ways the errors of a TG can be modeled. We model it as its capability of generating adequate responses.
Definition 7 (α-optimal TG) Let π be a TG and α ∈ [0, 1]. Then π is an α-optimal TG if P r[π(i) ∈ R i + ] = α for all i ∈ I. That is, the probability of the text generation system to generate an adequate output is denoted as α. The task of a binary metric is to estimate the α value of a TG system, which has a concrete meaning: Assume that we compare two systems, where α π 1 = 0.5, and α π 2 = 0.49, then these numbers have a clear semantic: π 1 outputs an adequate output in 50% of cases and π 2 in 49% of cases. Thus, one system generates adequate outputs more often than the other. We denote the difference in performance as ϵ. In the following, we will use α π to denote the rate at which a system π generates adequate responses, and π α to refer to a system which is α-optimal.

Theory: Estimating α with Binary Metrics
In this section, we show how binary metrics can be used to estimate the performance α of text generation systems. For the remainder of the text, assume that T Φ = {(i j , o j , r * j )|1 ≤ j ≤ n ϕ } is a set of input-output rating triples of size n ϕ , where i j are inputs, o j = π α (i j ) denotes the output generated by an α-optimal TG system for input i j , and r * j = M * b (i j , o j ) denotes the error-free rating of the j th input-output-pair. Analogously, let T M = {(i j , o j , r j )|1 ≤ j ≤ n M } be a set of input output rating triples of size n M , where r j = M ρ,η b (i j , o j ) denotes the rating of an errorprone (ρ, η)-optimal binary metric.
We consider three different cases: 1) the errorfree case, 2) the error-prone metric case, and 3) the mixed case. The error-free case is where we have access to r * j . For instance, we can interpret human evaluation as an example of the error-free case. In the error-prone metric case, we have access only to an (ρ, η)-optimal binary metric. Finally, the mixed case is a novel approach that leverages errorfree ratings, which are usually costly to obtain, with error-prone ratings, which are cheaper but are needed en-masse for automated metrics with low ρ and η values, as we will see. Usually, in evaluation campaigns, either the first or second setting is applied.
We apply a Bayesian approach to estimate α by treating it as a random variable, which allows us to model various sources of uncertainty stemming from α, ρ and η, which all need to be estimated from data. The full derivations are given in Appendix A.

Error-Free Case
Here, we start with the most simple case and introduce the formula to estimate α given error-free ratings r * j . Given n ϕ error-free ratings, α is estimated byα = n + n ϕ , where n + = n ϕ i=1 r * j . This formula can be derived via the frequentist approach or the Bayesian. For the Bayesian approach, we assume a uniform prior over α (i.e. α ∼ Beta(1, 1)). The resulting posterior distribution for α given n + is: and the value of α is estimated using the mode of Beta(n + + 1, n ϕ − n + + 1), which corresponds to n + n ϕ .

Error-Prone Metric Case
In the error-prone metric case, the probability that r j = 1 depends on ρ and η. Hence, if r j = 1, we cannot assume that r * j = 1 as well, since the binary metric can be error-prone. For the error-prone setting, we consider two cases, one where ρ and η are provided (e.g. from an earlier evaluation campaign), and one where ρ and η must be estimated from data (i.e., from comparison to error-free ratings).

Provided ρ, η
Here, we assume that the exact values of ρ and η are known. The probability that the binary metric returns a positive label is thus given by: From this, we derive the formula to estimate α using the Bayesian formulation.
Theorem 1 (Estimate α with error-prone metric) Let m + = n M i=1 r j ∼ Binom(P (r j = 1), n M ) be the number of pairs i j , o j rated as adequate M ρ,η b (i j , o j ) = 1. Then we estimate α by computing the mode of the following distribution: If we assume a uniform prior of α, i.e., P (α) ∼ U (0, 1), this reduces to:α = m + n M +η−1 ρ+η−1 Note that the above formulation does not allow for ρ + η = 1, in which case our estimator would be undefined. In the following we will assume that ρ + η > 1. This is a relatively safe assumption since in the case where ρ + η < 1, we can derive a new metric M ρ ′ ,η ′ b by flipping the predictions of

Estimated ρ, η
Here, we assume that ρ and η must be estimated from data, which introduces uncertainty. In our case, we estimate ρ and η from error-free ratings (i.e., how well the error-prone metric agrees with the error-free ratings). In practise, the error-free assessments stem from human annotations, which are regarded as the ground truth. To weave the estimation of ρ and η into the Bayesian framework, we treat them as random variables. For this, assume that we have access to a dataset T ρ,η = {(i j , o j , r * j , r j )|1 ≤ j ≤ M } of both error-free and error-prone ratings for pairs of inputs and outputs. Denote T + ρ,η = {(i j , o j )|r * j = 1} as the set of true positive samples, and T − ρ,η = {(i j , o j )|r * j = 0} as the set of true negative samples. Thus, assuming a uniform prior over ρ, we apply the same reasoning as in Section 3.1 to compute the posterior distribution ρ ∼ Beta(m T P + 1, |T + ρ,η | − m T P + 1), where m T P denotes the number of true positive samples, rated as positive by M ρ,η b . Analogously, η ∼ Beta(m T N + 1, |T − ρ,η | − m T N + 1), where m T N denotes the number of true negative samples, rated as negative by M ρ,η b . Note that to estimate ρ and η, having a large sample size for both T + ρ,η and T − ρ,η is important, otherwise the estimation of ρ or η would have a higher uncertainty.
To incorporate the uncertainty of ρ and η into the estimation of α, we need to marginalize ρ and η from the joint likelihood P (m + , ρ, η|α) to get P (m + |α). Theorem 2 (Est. α, ρ, η with error-prone metric) Let m + = n i=1 r j ∼ Binom(P (r j = 1), n) be the number of samples rated positively by M ρ,η b . Then we estimate α by computing the mode of the following distribution: Note that we are not aware of a closed form solution for the above distribution and the computation of the mode. Thus, we approximate the solution using numerical methods in practise (See Appendix B).

Mixed Case
The mixed case combines the error-free and the error-prone cases. Here, we assume that we are given a small number of error-free samples (human annotations), which are costly to obtain, and a larger set of error-prone samples (ratings by an automated metric), which are easier to obtain 4 .
Then we estimate α by computing the mode of the following distribution: Note that the difference to the error-prone case is that P (α) is replaced by P (α|n + ), which can be expressed by a closed form beta distribution (see Section 3.1). Thus, we can compute the mixed case by first computing the error-free case to get an initial estimate of α, and then estimate the errorprone case. More generally, this approach lets us also combine ratings from multiple different errorprone metrics by applying Equation 5 iteratively. One would plug in the posterior from one metric as the prior for the next.
Having outlined the estimation of α for different scenarios, we now show how they can be used to determine the minimal number of samples needed to distinguish TGs in a significant manner.

Minimal Number of Samples Needed to Make Reliable Distinctions between TG Systems
We now come back to the main question of this paper: how many samples are needed to be able to significantly distinguish the performance of two text generation systems? The intuition is that the closer the performance of the two TG systems is, the more samples are needed. Thus, we investigate 4 Note that our setting also allows for TΦ ⊆ TM . the setting where their difference in performance |α π 1 − α π 2 | = ϵ is small. Using the formulas from Section 3, we can compute the estimates shown in Table 1. There are seven variables involved in this computation: • ρ and η denote the (unknown) performance of the automated binary metric. The better it is, the less samples are needed. • α denotes the (unknown) performance of the TG system to be evaluated. • γ as the significance level that is wished to be achieved. • |T Φ | denotes the size of the set of rated inputoutput pairs that stem from a error-free binary metric. • |T M | denotes the size of the set of rated inputoutput pairs that stem from an error-prone binary metric. • |T ρ,η | denotes the set of samples needed to estimate ρ and η. To compute if one system is significantly better, the probability of one system being better than the other must be compared to the significance level (e.g., 0.05). We compute the probability that α 1 > α 2 as follows: The difference between π α 1 and π α 2 is significant at the γ-level if P (α 1 > α 2 ) < 1 − γ 2 or P (α 1 > α 2 ) < γ 2 . Equation 6 holds for any two random variables. In the particular case of normal distributions this is a reformulation of a two-sided z-test of the null hypothesis that both variables have the same mean. Equation 6 is therefore applicable to all the three cases of α estimation (i.e., error-free, error-prone, and mixed) by inserting the posterior distributions.
By applying normal approximations for p(α 1 ) and p(α 2 ), and using simulations we can compute the minimal distinguishable difference ϵ for a given set of fixed parameters. The details of the simulations are given in Appendix B.

Showcases: Application in Practise
In order to show that the theoretical findings translate to practical applications, we apply our theory to two real-world settings: the WMT21 metric shared task (Freitag et al., 2021b) and the Spot-The-Bot data (Deriu et al., 2020). Since the two tasks have significantly different settings (e.g., machine trans-lation and dialogue systems, different types of human annotations, and different types of metrics) this shows that our theory is applicable to a variety of text generation tasks. The showcases highlight the different dimensions that can be manipulated when designing an evaluation. In showcase 1, we highlight the number of ratings needed, whereas, in showcase 2, we focus on the influence of the metric performance.

Showcase 1: WMT Metrics Shared Task
For the WMT21 Metrics shared task, the authors evaluated the performance of 15 automated metrics by comparing their ratings to human ones on the output of several MT systems and several language pairs. In this work, we only focus on the English to German language pair and the news domain, where seven machine translation systems were evaluated. The data provided by the shared task can be expressed as follows using our notation: We regard the expert human multidimensional quality metrics (MQM) (Lommel et al., 2014) annotations as our error-free ratings. We binarize the scalar output of this metric by stating that only translations without any mistakes are regarded as . This means only responses that have been judged as being completely correct by all annotators are considered adequate. For this setting there are |T Φ | = 527 error-free annotated samples for each machine translation system. We can reuse these annotations to estimate ρ and η, thus, |T ρ,η | = 527 5 . For the error-prone metric outputs, WMT provides |T M | = 1000 samples for each machine translation system and each error-prone metric. For the error-prone metrics, we use BleuRT (Sellam et al., 2020) as the metric with the highest ρ and η estimates, and SacreBLEU (Post, 2018) as the most popular metric. We consider three machine trans-

WMT: Theoretical Bounds of ϵ
Here, we showcase the theoretical bounds of the ϵ values that can be distinguished significantly de-5 Note that we estimate ρ and η for each machine translation system separately since we noted that most trained metrics have different performances depending on the various machine translation systems. See Appendix C. pending on the number of ratings and the performance of the metrics. We consider BleuRT with an estimated ρ = η ≈ 0.6 (see Section 3.2.2 on how to compute these estimates), SacreBLEU with ρ = η ≈ 0.52 and the performances of the machine translation systems are around α ≈ 0.65 (see section 3.3 on how to compute the estimate). For instance, with 527 error-free (|T Φ |) and 1000 error-prone samples (|T M |), we can distinguish an ϵ of 5.6% for both BleuRT and SacreBLEU. Thus, the impact of the automated metrics is low for higher number of human ratings. However, for T 0 = 100 the impact of the metric performance is larger: ϵ = 0.112 vs. ϵ = 0.13. The effect is even larger with access to more automated ratings. Thus, using 10000 BleuRT ratings with 100 human ratings allows to distinguish the same ϵ as with 527 human ratings and 1000 SacreBLEU ratings, which is much costlier.

WMT: Practical Results
Here we analyse the results obtained when applying the theoretical framework to real data to estimate α, and assess whether the pairwise differences are significant or not. Table 2 shows the results for four scenarios: using all 527 error-free ratings, using only 100 error-free ratings (low-cost scenario), using 100 error-free ratings with an additional 1000 error-prone ratings from SacreBLEU, and using 100 error-free ratings with an additional 1000 errorprone ratings from BleuRT. The results include for each pair of systems the estimated ϵ values, and the probability that the first TG system is better than  Table 2: Predicted WMT21 evaluation using BleuRT and SacreBLEU on three machine translation systems.
the second system. In the first scenario, we see that FBAI and VT cannot be significantly distinguished, which is consistent with the theory that states only ϵ > 0.057 can be distinguished (see Figure 2), whereas the other system pairs can be distinguished. In the second scenario, we reduce the number of error-free samples to only |T Φ | = 100, which makes all the TG systems non distinguishable from each other. Again, this is consistent with the theory that states only ϵ > 0.131 can be distinguished using 100 consistent samples. When we add error-prone ratings, the probabilities of the first TG being better than the second increase, however not enough to be significantly distinguishable. This goes for both automated metrics, which is still consistent with the theory. The problem lies in the fact that the performance of the automated metrics is too low to have a strong impact on the evaluation. For instance, the theory predicts that using 10 ′ 000 error-prone SacreBLEU samples will only lead to being able to distinguish ϵ > 0.120. In this setting, adding even more error-prone samples will not help (even with |T M | = 10 9 ), since the uncertainty of ρ and η is too high due to |T ρ,η | = 527.
Thus, the practical application shows that the outcomes using real data is consistent with the theory. Unfortunately, the setting does not allow to distinguish FBAI and VT. For this more error-free ratings are needed, or better metrics.

Showcase 2: Spot The Bot (STB)
For the second show case, we use the Spot The Bot (STB) data, where dialogues between two dialogue systems are sampled and humans classified each interlocutor to be a human or a bot. STB contains pairwise ratings for six dialogue systems. In our setting, we use three of them: Blenderbot (BL) (Roller et al., 2021), Lost in Conversation 6 (LiC), and KVMemNN (KV) (Dinan et al., 2020). In this setting the error-free metric is the (aggregated) human judgment, which is already binary. We consider a response as adequate if all annotators labelled it as coming from a human. For the error-prone metric, we use the USR (Mehri and Eskenazi, 2020) metric, which is also a scalar metric that we binarize with a threshold 7 . The STB dataset yields |T Φ | = |T ρ,η | ≈ 600 error-free ratings per dialogue system. For creating T M , we sample new pairwise dialogues and let USR rate each turn of the dialogue. This yields |T M | = 10 ′ 000 samples per dialogue system. Figure 3 shows the theoretical ϵ values that can be achieved depending on |T ρ,η |. The values are depicted for three different settings of |T Φ | (i.e, human ratings). Each setting shows the measurable ϵ for three different ρ = η combinations. The figure reveals the impact of |T ρ,η | for |T ρ,η | < 1000. For instance, for |T ρ,η | = 600, a metric with ρ = η = 0.6 is only able to distinguish an ϵ = 0.11, however, when increasing |T ρ,η | to 5000 a difference of ϵ = 0.08 can be measured. On the other hand, when the performance of the metric is too low (e.g., ρ = η = 0.52) the impact of higher |T ρ,η | is negligible regardless of |T Φ |. Table 3 shows the measured values for α and ϵ for three scenarios. The first two scenarios are analogous to the WMT setting, where we use |T Φ | = 600 error-free ratings in the first scenario and |T Φ | = 100 error-free ratings in the second scenario (assuming that we labeled only 100 samples due to cost reasons). For the third scenario we again use |T Φ | = 100 error-free ratings, combined with |T M | = 10 ′ 000 error-prone ratings from the USR metric. The results show that for the first scenario all the pairs of systems are distinguishable,  which is consistent with the theory and the original Spot The Bot results. When reducing the number of error-free samples to |T Φ | = 100, only the pair BL-KV is distinguishable. This is consistent with the theory, which predicts that two systems with ϵ > 0.126 are significantly distinguishable. However, adding |T M | = 10 ′ 000 error-prone ratings only increases the probability of the first TG system being better than the second by a small amount. The reason is that the performance of USR is too low to have a strong impact, which is consistent with the theory. Thus, to benefit from automated evaluation one needs a better metric and more samples to estimate ρ and η. There are few efforts to underlay (parts of the) TG evaluation paradigm with a theory-grounded base: To theoretically solidify human NLG evaluation and provide more statistically significant results in pairwise evaluations, a recent approach leverages utility theory in economics (Ethayarajh and Jurafsky, 2022) to showcase issues arising from the use of Likert scale ratings and averaging them. Chaganty et al. (2018) propose a method to combine automated metrics with human rankings to debiase a metric under a budget constraint. They provide a theory-grounded proof that their calculated mix of human and automated ratings is optimal and conclude that error-prone evaluation metrics are a bottleneck for reducing the cost of evaluations. Related to our Bayesian approach of modelling uncertainty in the evaluation of systems, a number of approaches aims to model uncertainty in the annotation process and the aggregation of annotations using a Bayesian approach (Paun et al., 2018, e.g.). Card et al. (2020) analyze the statistical power of different evaluation scenarios prevalent in NLP. In particular, they study the number of samples needed to detect a difference of 1 BLEU as significant. However, to the best of our knowledge, no efforts to model the uncertainties ingrained in TG evaluation in a holistic theory has been proposed so far.

Conclusion
We introduced a theoretical framework for binary metrics that can be used to extract guidelines for designing an evaluation of text generation systems. The framework estimates the performance of a text generation system from a mix of human and automated ratings giving guarantees of which level of significance can be achieved. Using the formulas, one can design the evaluation setup and compute estimates of how many human and automated samples are needed for a significant evaluation. We applied the theory to two very different real-world cases and exemplified how the theory can be leveraged to improve the significance of the results. We provide a tool that allows the computation of the formulas so that different settings can be tested.
The current theory is limited to binary metrics, but in future work, we will extend the theory to more types, such as comparative or scalar metrics. Furthermore, we will apply the theory to a wider range of tasks and domains. In general, we hope to have set in motion efforts to arrive at a sound formalization of the evaluation of text generation systems to increase the robustness, reliability, and significance of future evaluation campaigns.

Limitations
Human Ratings. We assume that human ratings are perfect, which is not the case (Clark et al., 2021). While it might be the case that the MQM ratings are close to error-free, there is no guarantee. To handle the fact that human ratings are not errorfree we would need to measure this, which could be done via agreement scores.
Uniform Input and Outputs. We assume that each input and each output have the same difficulty of being evaluated. However, it is more likely that in practise, each metric has a different ρ and η value depending on the input. This is however very hard to include in the theory.
Uniform Text Generation Systems. Similarly to the above point, we assume that ρ and η are independent of the text generation system. However, preliminary experimental results (see Appendix C) showed that metrics tend to have different performances for different TG systems. Thus, ρ and η need to be estimated separately for each TG system.
Domain Dependence. The same argument can also be made about the domain. Metrics trained on one domain will perform differently when applied to another domain. Thus, the ρ and η values must be measured again for each domain.
Binary Metrics. The current theory is limited to binary metrics. However, in practise there are many different types of metrics and evaluation types. For instance, in a next step the theory should be extended to cover comparative metrics (i.e., metrics that state which of the two outputs is better).
Approximations. The estimations of the mixed case and the estimated ρ, η case must be approximated numerically since we did not find a closed form solution. This will inevitably lead to mistakes in the estimated values. This can be circumvented by making the numerical approximation more precise with the downside of needing more computational power (see Appendix B).

A Derivations for α-estimation
In Section 3 we have introduced several ways to estimate the success rate α of a Text-Generator π.
We will now elaborate some of these in more detail. First, we want to estimate α based on consistent ratings from M * b . For this we need a set of inputs, the corresponding outputs from π, and the rating from M * b : . We note that, in this case, the probability that a given pair is rated adequate is α, since: We can therefore treat r * j as outcomes of Bernoulli trials with success probability α. The number of successful trials N + is therefore a random variable with binomial distribution: N + ∼ Binom(α, n ϕ ). The concrete outcome for a given experiment is n + = n ϕ j=1 r * j . To estimate α we use the proportion of successful trials, meaning the fraction of adequate responses:α = n + n ϕ . Due to the Law of Large Numbers this will converge to the expected value E[r j ] = α.
Bayesian Formulation We choose to work in a Bayesian framework as it provides a convenient way to unify the multiple sources of evidence and uncertainty we want to tackle. The first source of information comes from T Φ . In particular we have seen that the number of input-output pairs rated as adequate, N + , follows a binomial distribution. This means that P (N + = n + |α) = n ϕ n + α n + (1 − α) n ϕ −n + . We want to derive a posterior distribution for α based on the evidence: p(α|N + = n + ). For this we can apply Bayes' Theorem: p(α|N + = n + ) ∝ P (N + = n + |α)p(α), where p(α) expresses our prior belief of the possible values for α. In this setting p(α) is called the prior, P (N + = n + |α) likelihood, and p(α|N + = n + ) the posterior. Since we in general cannot assume anything about α we choose a uniform prior α ∼ U(0, 1). This means before seeing any evidence we consider any possible value of α to be equally likely. Of course there are other reasonable choices for priors, but in general uniform priors are a good choice since the resulting estimators will closely match traditional frequentist approaches.
Another approach is to choose a so-called conjugate prior based on the type of likelihood we are confronted with. A conjugate prior for a given likelihood will result in a posterior from the same family (but different parameters) as the prior. In our case, the Beta distribution is a conjugate prior for a Binomial likelihood. Beta distributions have two shape parameters a and b and assuming . Here   B(a, b) is the beta function of a and b and serves as the normalizing constant, ensuring that p(α) integrates to 1. The beta function is defined in terms of the Gamma function Γ, an extension of factorials.
Luckily, we can show that U(0, 1) and Beta(1, 1) are the same distribution. We first note that both distributions are defined on the same domain (0, 1). In particular, the uniform distribution is constant 1 over the domain. By definition of the Beta distribution we have that if α ∼ Beta(1, 1) B(1,1) = 1. Next we will show how to compute the posterior for the general case where α ∼ Beta(a, b): We see that the resulting posterior is indeed another Beta distribution. In particular if we choose a = b = 1, or a uniform prior, we get that α|N + = n + ∼ Beta(n + + 1, n ϕ − n + + 1) as in Section 3.
Error-prone Metric Next, we want to estimate α given a set of inputs, outputs from π, and ratings from a error-prone metric M ρ,η b with known ρ and η. We define T M = {(i j , o j , r j )|1 ≤ j < n M } where o j = π(i j ) and r j = M ρ,η b (i j , o j ). The probability that any given r j is 1 is: What we can concretely measure (or count) on T M is the number of times the error-prone metric gives an adequate rating. We define this as m + = n M j=1 r j . Since we sum n M Bernoulli trials with success rate α(ρ + η − 1) + (1 − η), the sum has a Binomial distribution: M + ∼ Binom(α(ρ + η − 1) + (1 − η), n M ). Therefore our likelihood is: We notate the likelihood as P (M + = m + |α, ρ, η) to indicate the dependence on ρ and η, even though they are assumed deterministic. Unfortunately, we are not aware of any conjugate prior for α that would allow us to derive a closed form posterior from this likelihood. Nevertheless, we can show that for α ∼ U(0, 1) the mode of the posterior is at ρ+η−1 . For this we will have to find the point where the derivative of the posterior with respect to α is 0. To simplify the notation we will write f (α) = α(ρ + η − 1) + (1 − η) and f ′ (α) = d dα f (α) = ρ + η − 1.
We will first compute the derivative of the posterior with respect to α using a uniform prior (i.e. p(α) = 1): To find the mode we set the derivative to zero and solve for α. We will use the convenient fact that f ′ (α) is constant independent of α: ρ + η − 1 Uncertainty in ρ and η If we do not already know the specific ρ and η for a given error-prone metric, we will have to estimate them from data. For this we need ratings from a the error-prone metric as well as an error-free metric to compare to. Assume we are given the set T ρ,η = {(i j , o j , r j , r * j )|1 ≤ j < n ρ,η }, where r j = M ρ,η b (i j , o j ) and r * j = M * b (i j , o j ). Note that unlike T Φ and T M we do not make any assumptions about how o j was generated.
To estimate ρ we have to count the number of times r j = 1 when r * j = 1 too, in other words we have to count the number of true adequate ratings: n T P = i,o,r,r * ∈T + ρ,η r. By definition we know that ρ = P (r = 1|r * = 1) and therefore N T P ∼ Binom(ρ, |T + ρ,η |). We can apply the same Bayesian reasoning as at the start of this Appendix to derive a posterior distribution for ρ. Assuming a uniform prior over ρ, we have that ρ|n T P ∼ Beta(n T P + 1, |T + ρ,η | − n T P + 1). The estimation of η is exactly analogous.
At this point we could just use point estimates for ρ and η and treat them as deterministic like above. Unfortunately this has a high chance of throwing off the point estimate (mode) of α.
We will therefore consider the joint likelihood P (M + = m + , ρ, η|α) and marginalize ρ and η. We will reuse results from above. Recall we were given the set We counted the number of adequate ratings m + = n M j=1 r j and we saw that P (M + = M + |α, ρ, η) = α(ρ + η − 1) + (1 − η). Based on that we can compute the likelihood as follows: We will show how approximate this numerically in Appendix B.
Combining error-free and error-prone ratings Finally, we show how we can combine both errorfree and error-prone ratings into a single estimate for α. Here we assume that we have estimates for ρ and η, for example in the form of Beta-posteriors, as derived previously: ρ ∼ Beta(a ρ , b ρ ) and η ∼ Beta(a η , b η ). Similarly, we build upon the previous setting where we counted the number of adequate ratings from the error-free metric, N + ∼ Binom(α, n ϕ ), and the number of adequate ratings from the error-prone metric, M + ∼ Binom(α(ρ+η −1)+(1−η), n M ). Our observed n + and m + have the joint likelihood: We assume here that M + and N + are independent when conditioned on α.
We are now ready to compute the posterior for α. Using a Beta prior α ∼ Beta(a α , b α ) we get: Looking at the last step, we see that we can combine the prior p(α) with the partial likelihood P (N + = n + |α) to get a partial posterior p(α|N + = n + ) that gets multiplied with the likelihood of M + . We have already seen that since α has a Beta prior and N + has a binomial likelihood, α|N + is also a Beta distribution. This suggests a two-step procedure, where in the first step we derive a posterior from error-free ratings. In the second step we use that estimate as the new prior for deriving the posterior from error-prone ratings.
Notes on T Φ , T M , and T ρ,η Note that in practise there are some considerations to be made. Since we use human ratings, we can use them both for estimating ρ and η but also to estimate α. Thus, we use T Φ = T ρ,η , which is also necessary since ρ and η are different for each TG system (see example in Appendix C). Thus, it is often not advisable to use the ratings for other systems to estimate ρ and η. However, this phenomenon needs to be explored in more detail.
For the estimation of ρ and η, we need to make sure that T + ρ,η and T − ρ,η are of large enough size.
Since if we have only a few samples in T ρ,η where r * j = 0 then the estimate for η will be uncertain. This can be problematic when evaluating very strong or very poor systems (e.g., α > 0.9 or α < 0.1) as there will be only a few samples with r * j = 0 or r * j = 1 respectively. In many cases we can reuse the samples in T Φ for T M , i.e., T Φ ⊆ T M since we can use the automated metric to rate the samples, which were annotated by humans. However, it is not clear what effect this will have on the final estimate of ϵ. Exploring this phenomenon is part of future work.

B Derivations for ϵ-simulation
In this section we will show how we derive the values for the minimally distinguishable difference between two systems. We do this by first simulating a concrete experiment based on theoretical parameters. We substitute the simulated experiment into Equation 5. We will also show how we numerically approximate Equation 5.
Simulation Until now we have considered the case where α , and possibly ρ and η, are unknown and need to be estimated from data. In that case we use Equation 5 to derive a posterior estimate for α. The whole estimation is based on counts from three sources T Φ , T M , and T ρ,η . Assume we know the following properties: α, ρ, η, n ϕ = |T Φ |, n M = |T M |, n ρ,η = |T ρ,η |, as well as the proportion ψ of truly adequate responses in T ρ,η .
To simulate the number of adequate ratings from the error-free metric n + we round its expected value, E[n + ] = αn ϕ , to the nearest integer: n sim + = ⌊αn ϕ + 1 2 ⌋. To simulate the number of adequate ratings from the error-prone metric m + , we round its expected value, E[m + ] = (α(ρ + η − 1) + (1 − η))n M , to the nearest integer: m sim + = ⌊(α(ρ + η − 1) + (1 − η))n M + 1 2 ⌋. We have seen that to estimate ρ we need to know the number of true positive ratings n T P of the errorprone metric as well as the total number of positive ratings in T ρ,η which we notated as |T + ρ,η | = n * p . We can simulate the latter by rounding its expected value, E[n * p ] = ψn ρ,η , to the nearest integer: n sim p = ⌊ψn ρ,η + 1 2 ⌋. To simulate n T P we have to plug the simulated n sim p into the expected value: n sim T P = ⌊ρn sim p + 1 2 ⌋. Finally, we follow the same process to simulate the data for η. Let n * n = |T − ρ,η |, which we simulate as n sim n = n ρ,ϕ − n sim p . The number of true negatives of the error-prone metric is simulated as: n sim T N = ⌊η(n ρ,ϕ − n sim p ) + 1 2 ⌋.
We can then use these simulated values to calculate the calculate the posterior p sim (α) based on Equation 5. For this we first have to simulate our belief over ρ and η: ρ sim ∼ Beta(n sim T P + 1, n sim p − n sim T P + 1) and η sim ∼ Beta(n sim T N + 1, n sim n − n sim T N + 1). We again set a uniform prior, α ∼ Beta(1, 1) and compute the simulated posterior: For the tables in Appendix E we make the following simplifying assumptions: we assume that the input-output pairs in T Φ and T ρ,η are the same. This means that n ρ,η = n ϕ and ψ = α.
Computing ϵ γ We will now show how we use p sim (α) to compute the minimal distinguishable difference between two systems π 1 with success rate α 1 and π 2 with success rate α 2 .
Assume we know the distributions p(α 1 ) and p(α 2 ), we can then compute their means µ i = E[α i ] and variances σ 2 i = V[α i ]. These can be used to derive normal approximations for α i : α N i ∼ N (µ i , σ i ). In that case the difference ϵ = α N 1 − α N 2 also follows a normal distribution: N (µ 1 − µ 2 , σ 2 1 + σ 2 1 ). We can now formulate a z-test to see whether there is a significant difference between α N 1 and α N 1 . The null hypothesis H 0 is that both systems perform the same, meaning µ 1 = µ 2 or ϵ = 0. Under H 0 we have that ϵ √ σ 2 1 +σ 2 1 ∼ N (0, 1). To reject H 0 at a certain significance level γ, we have to show that | Here Φ −1 is the inverse cumulative distribution function of the standard normal distribution and we notate Z γ = Φ −1 (1 − γ 2 ). In that case, all |ϵ| > σ 2 1 + σ 2 2 Z γ will be significant under this test. The minimal significant difference at the γ level is then ϵ γ = σ 2 1 + σ 2 2 Z γ . Given our simulated posterior p sim (α) we can compute its mean, µ sim = 1 0 αp sim (α)dα and variance σ 2 sim = 1 0 (α − µ sim ) 2 p sim (α)dα. We have to make one final assumption: if we estimate α 1 and α 2 under exactly the same conditions, meaning with the same n ϕ , n M , n ρ,η and the same error-prone metric M ρ,η b , and their difference ϵ is relatively small, then their variances should be the same. Using this assumption we compute: ϵ sim γ = 2σ 2 sim Z γ .
Caveats At this point we will reflect on the several layers of approximations we go through to arrive at an numerical estimate for ϵ γ . We start out by simulating an experiment where we replace all key observables by their expected values under our experiment assumptions (i.e. the chosen fixed values of α, ρ, η and sample sizes). Of course in a real world setting those values could deviate from their expected values due to bad luck. This will influence both the mean and variance of the resulting estimate. We then compute the simulated posterior using numerical approximation (see next paragraph), which could be imprecise. We then further approximate the posterior by a normal distribution. In practice, we work with large enough sample sizes, that the normal approximation should be relatively accurate.
The overall implication is that the theoretical values of ϵ γ we use throughout this work provide a useful guideline but it is unclear how exact they are.
Numerical Approximation of Posteriors A problem we face repeatedly is that we are interested in the expected values of a function of a continuous random variable, such as b a f (x)p(x)dx, which might not have an easily computable closed form. This is for example the case for the integrals over ρ and η in Equation 5, but also when computing the mean and variance of the posterior.
We will now elaborate how we approximate expected values of a continuous variable by middle Riemann sums. Assume we are given a random variable x with domain (0, 1), its density function p(x), and its cumulative density function The main idea is to partition the domain into a discrete number equally sized slices. Every partition gets identified by its midpoint and the total total density within that partition. Let N x be the number of slices, the larger N x the preciser our approximation will be. We define: Here If we want to apply this discretization to α, ρ, and η, we need access to their cumulative distribution functions. In our framework, these variables are either uniformly or more generally Beta distributed. The cumulative distributions for these are available in most numerical software libraries and therefore computing the discretization is relatively straight-forward.
Applying this discretization to α, ρ, and η we can restate Equation 5 in approximate form: This results in a discretized form of the posterior with the same granularity N α as for the prior. We can then approximate the mean of the posterior as: and the variance as: We use N α = 2000 and N ρ = N η = 1000 in all our experiments.

C ROC Curves of Metrics
While our theory assumes binary metrics that will only produce 0 or 1 ratings, most real-world automated metrics produce scalar ratings ∈ R. In our case all metrics under consideration produce scalar ratings. To apply our framework we have to transform scalar ratings into binary ratings. We can do this by selecting a threshold τ that partitions the ratings into binary classes (on either side of the threshold). We define a scalar metric as a function of input output pairs to the reals: M s : I × O → R. We interpret the rating as a preference, such that, if M s (i, o 1 ) > M s (i, o 2 ), then we say that according to M s , o 1 fits i better than o 2 . Given a scalar metric M s and a threshold τ ∈ R we can derive the associated binary metric: The question is now, how to select τ . This is a well known problem in binary classification. Intuitively, every possible threshold τ is associated with a pair of corresponding ρ and η. Figure 4 shows the Receiver Operator Characteristic (ROC) curves for BleuRT as a predictor of M * b for three machine translation systems. In an ROC plot, the true positive rate is plotted against the false positive rate at various thresholds τ . We note that in our framework the true positive rate is ρ and the false positive rate is 1 − η.
Assume we are given a set of inputs and outputs, the ratings from M s and ratings from an error-free binary metric M * b : . We can consider the values s j as candidate thresholds, as these are exactly the cases where the predictions would switch in Equation 8. For each candidate threshold, we can binarize the predictions and compute the associated ρ and η. We select the threshold that minimizes |ρ − η|, to be consistent with our examples, where we usually assumed for simplicity that ρ = η. This selection is shown in Figure 4 by markers and the red diagonal.
One thing to note in Figure 4 is that the curves for the three MT systems differ from each other. This means that the specific ρ and η of BleuRT when used as a binary metric depend on the systems that produced a given output. In our framework laid out in Section 3 we assumed that ρ and η are independent of how a given output o is produced. This calls for further analysis in future work.

D Full Show Cases Tables
In this Appendix, we show the full tables for the show cases with all the systems from the WMT and STB setting.

D.1 WMT21
For the WMT task, we have 4 scenarios (see Section 5.1), for all these scenarios we show the pariwise comparisons in Tables 4, 5, 6, and 7. Each table shows for each system the estimated α value in parentheses, and in each cell the ϵ value with P (α 1 > α 2 ) in parentheses. All the pairs that are significantly distinguishable are put in bold. The  Table 4 where all the human ratings are used shows that FBAI, VT-G, OW, NE, and VT-A are not significantly distinguishable from eachother as their ϵ < 0.06. For the other three scenarios none of the systems are distinguishable. This is consistent with the theoretical predictions. From Table 1, we see that at least 5000 human ratings are needed to be able to significantly distinguish all the pairs of systems (i.e., for ϵ < 0.02). Thus, in this case the problem is that the TG systems are too close to eachother in terms of performance and the automated metrics are too weak to boost the evaluation with low cost.

D.2 STB
Tables 8, 9, and 10 show the full evaluation of the three STB scenarios (see Section 5.2). Each table shows for each system the estimated α value in parentheses, and in each cell the ϵ value with P (α 1 > α 2 ) in parentheses. All the pairs that are significantly distinguishable are put in bold. For the STB case, the six systems from the original paper are used: Blenderbot (BL) (Roller et al., 2021), Lost in Conversation (LiC) 9 , KVMemNN (KV) (Dinan et al., 2020), Huggingface (HF) 10 , Bert-Rank (BR) (Deriu et al., 2020), and Seq2Seq-NN (S2S) (Deriu et al., 2020). Note that BR and S2S were custom trained baseline by the STB authors. In the STB case almost all pairs of systems are significantly distinguished, which is in line with the theory and the original STB paper. Our theory reveals that this is mostly due to the fact that the difference in α between the TGs is large and not many samples are needed for discriminating. Tables   Tables 11, 12 Table 14: Estimated ϵ γ for α = 0.60, ρ = 0.51, η = 0.51, and γ = 0.05