Correction of Errors in Preference Ratings from Automated Metrics for Text Generation

A major challenge in the field of Text Generation is evaluation: Human evaluations are cost-intensive, and automated metrics often display considerable disagreement with human judgments. In this paper, we propose a statistical model of Text Generation evaluation that accounts for the error-proneness of automated metrics when used to generate preference rankings between system outputs. We show that existing automated metrics are generally over-confident in assigning significant differences between systems in this setting. However, our model enables an efficient combination of human and automated ratings to remedy the error-proneness of the automated metrics. We show that using this combination, we only require about 50% of the human annotations typically used in evaluations to arrive at robust and statistically significant results while yielding the same evaluation outcome as the pure human evaluation in 95% of cases. We showcase the benefits of approach for three text generation tasks: dialogue systems, machine translation, and text summarization.


Introduction
The field of Text Generation (TG) has witnessed substantial improvements over the past years.The gain in performance is mainly due to the application of large-scale pre-trained language models (Devlin et al., 2019;Raffel et al., 2020) based on the Transformer architecture (Vaswani et al., 2017), which allows fast processing of large amounts of data.This has spawned myriads of new systems for TG.The most prominent example is GPT-3 (Brown et al., 2020), which showcases impressive performance on a variety of tasks in a zero-shot learning regime.
One major hurdle for further progress is the evaluation of TG systems.Currently, the most reliable * These authors contributed equally.approach to evaluating TG systems is a humanbased evaluation (Celikyilmaz et al., 2020), which is time-consuming and cost-intensive.Furthermore, human evaluation suffers from a set of problems such as low annotator agreements (Amidei et al., 2018), and they need to be designed with care to be reproducible (Belz et al., 2021).These problems motivated the development of automated evaluation metrics, which take the input of a TG system and the generated text (and potentially one or multiple reference texts) as their input, and return a rating.
Generally, there are two types of automated metrics: trained and untrained metrics (Celikyilmaz et al., 2020).The most prominent untrained metrics are the BLEU score (Papineni et al., 2002) and the ROUGE score (Lin, 2004), developed for the evaluation of machine translation and automated summarization systems, respectively.The more recent metrics that were proposed are trained metrics.One of the first such approaches is the PARADISE framework by Walker et al. (1997) for task oriented dialogue systems, which learns to match interaction statistics to user satisfaction scores.Current approaches are based on large pre-trained language models.For conversational dialogue systems, there are ADEM (Lowe et al., 2017), USR (Mehri and Eskenazi, 2020b), FED (Mehri and Eskenazi, 2020a) or MAUDE (Sinha et al., 2020), among others (for a more complete overview, we refer the reader to Yeh et al. (2021); Deriu et al. (2021)).For machine translation, the most prominent trained metrics are COMET (Rei et al., 2020) and BleuRT (Sellam et al., 2020).For a more in-depth treatment of different automated metrics for TG systems, we refer the reader to Celikyilmaz et al. (2020).
Some metrics already achieve correlations with human judgements of 50% and above (Yeh et al., 2021;Fabbri et al., 2021;Freitag et al., 2021).For this reason, it is a tempting to use automated metrics to rate and rank TG systems.A typical approach to compare two systems is to use preference ratings, where the generated output of two systems for the same input is given, and the metric is used to decide which output is preferred or if they are of similar quality (Mathur et al., 2020;Kocmi et al., 2021).Such preferences are then aggregated for several sample inputs to decide which system is "better".One important open question, which we will tackle in this paper, is how erroneous ratings from an automated metric on the sample level influences the system level evaluation.
Motivating Example.Assume that we are given a set of TG systems, and the goal is to rank them according to some criterion (e.g.relevance of generated summaries for some text summarization systems).Assume that we are given an automatic preference metric.The naive application of this metric to determine which of two TG systems is better is to apply the metric to the outputs of the two systems for a test set of a fixed size.Then one would apply a statistical significance test (Coakley and Heise, 1996) to determine if one system is preferred significantly more often than the other by the metric.This process is repeated for each pair of systems, and then a partial ordering can be derived from the pairwise decisions.To compare the outcome of the automated evaluation, the same procedure is repeated with a human evaluation.A good metric is one that recreates the same system level preference ranking or the same pairwise results as a human evaluation would generate.
In this setting, there are four types of outcomes with respect to a human evaluation at the system level: • No Error.There are two sub-cases of no errors.1) If the human evaluation states that two systems are significantly different, and the automated evaluation states the same (Green).
2) If the human evaluation states that two systems are not significantly different, and the automated evaluation states the same (Olive).• Inversion Error.If the human evaluation significantly prefers system A over system B, but the automated evaluation results in the opposite preference (Red).• Omission Error.If the human evaluation states that two systems are significantly different, but the automated evaluation states that they are of the same quality (Blue).• Insertion Error.If the human evaluation states that two systems are not significantly different, but the automated evaluation states that they are (Yellow).We have evaluated the performance of several automated metrics in comparison to human preferences.More precisely, we examined four TG tasks -chatbots, summarization (coherence), summarization (consistency), and machine translation -and analyzed the performance of a popular automated metric for each of these tasks when used to derive system rankings based on pairwise comparison of the metrics' sample level scores.
Figure 1 highlights the error-proneness of the analysed automated metrics.The main findings are: • Only around 50% of the pairwise system comparisons agree with the human evaluation.• The Insertion Error is the most prominent with an average of 30%.• Inversion errors appear in around 10% to 20% of the comparisons on average.• The are almost no Omission errors.We hypothesize that the large discrepancy between the outcome of the automated and the human evaluation stems from three different sources of uncertainty that are not accounted for when applying the automated metric: 1) the sample size used to run the evaluation1 , 2) the errors of the metric, and 3) the sample size used to estimate the extent of the metric errors.Thus, naively applying automated metrics leads to overconfident predictions, which yield wrong outcomes of the evaluation.
Contributions.This paper has two main contributions: First, we propose a novel Bayesian statistical model of TG evaluation which integrates the various sources of uncertainty mentioned above.The model yields a more robust evaluation and has the flexibility to combine human and automated evaluations.The model can be used to determine whether two systems are significantly different or if they are of equal quality.The second contribution is an evaluation protocol that leverages the statistical model and reduces the amount of human ratings required.
We investigate the performance of the evaluation protocol in a case-study in three different TG tasks: chatbots, text summarization, and machine translation.Our case-study shows that using our contributions, we can almost completely correct for the errors emerging in the naive application of the metrics and that the amount of human ratings Figure 1: Comparison between the naive application of automated metrics and human evaluation.For each metric, the difference to the human evaluation is shown on the system-pair level.Green and Olive show agreement, Blue is an Omission Error, Yellow an Insertion Error, and Red denotes an Inversion Error.
needed to produce robust evaluation outcomes is reduced by more than 50%.

Definition of Preference Metrics
In this section, we formally define preference metrics, their errors, and how to mitigate them.We then use this formalism to derive an effective evaluation protocol that can handle error-prone metrics.For the remainder of the paper, we define I as the set of all possible inputs (e.g., for machine translation all sentences in the source language), and O as the set of all possible outputs (e.g., all sentences in the target language).We start by defining a TG system as a function that takes an input and generates an output π : I → O.
On an abstract level, we can define a preference metric as a function that takes as an input a triple consisting of: the input to a TG system (e.g., a sentence in the source language to be translated), the output of a system A, and the output of a system B (e.g., the translated sentences in the target language), and returns the preference rating.This is formalized as follows: Definition 1 (Preference Metrics).We call functions of the form M : I × O × O → {>, =, <} preference metrics.We call an outcome of " > " a win, " = " a draw, and " < " a loss.
Note that the semantics of M(i, o 1 , o 2 ) is to find out whether output o 1 is preferred over o 2 .At this point, the notion of an output being preferred is to be taken abstractly, in a real-world application this would be realized by a concrete feature (better fluency, higher relevance, etc.).
Next, we introduce an oracle, which constitutes the ground-truth.When constructing an oracle in a real-world application, we would usually resort to human annotations.
Definition 2 (Preference Oracle).The preference oracle is a function Ω : Now we define the notion of an error-prone metric.
Definition 3 (Error-prone Metric).A preference metric with independent confusion errors M is an error-prone metric where the probability of a given outcome is only dependent on the comparison oracle rating2 .Its confusion probabilities are defined as: The errors made by an error-prone preference metric can be represented by a confusion matrix with normalized columns, such that each entry in the matrix is a probability.The matrix µ is called the mixture matrix of M: Note that the mixture matrix of an error-free metric is the identity matrix.

Statistical Model for Preference Metrics
In this section, we introduce the statistical model that is used to compare two TG systems π a and π b .The model encompasses three main sources of uncertainty.
1. Uncertainty due to sample size of both errorfree and error-prone ratings 2. Uncertainty introduced by errors from the error-prone metric 3. Uncertainty over the true error-rates of the error-prone metric  We build up the statistical model step-by-step by discussing each source of uncertainty.We apply the Bayesian approach which allows us to describe the process in terms of probability distributions that can be sampled by using Markov Chain Monte Carlo (MCMC) sampling (refer to Appendix A and B for additional details).
For the rest of the chapter, assume that we have access to a set of inputs {i 1 , . . ., i n } ⊆ I, and the corresponding system outputs of π a and π b , i.e., o a j = π a (i j ), and o b j = π b (i j ).

Step One: Direct Estimation of the Win-Rate Significance
For this, assume that we have access to the preference oracle Ω itself.Let r Ω j = Ω(i j , o a j , o b j ) be the output of the preference oracle.Let I x [y] denote an index function that is equal to I x [y] = 1 ⇐⇒ x = y, and 0 otherwise.Then let n > = n j=1 I > [r Ω j ] denote the number of times o a j was rated as being better than o b j .We analogously define n < = n j=1 I < [r Ω j ] the number of times o a j was rated as being worse than o b j , and the number of draws.We use a Dirichlet distribution to model the posterior distribution for the given observations: where p = (p > , p = , p < ) denotes the probability vector for the win-rate p > , the draw-rate p = , and the loss-rate p < .

Step Two: Integrate the Metric-Errors
In this section, we assume a metric that makes mistakes, i.e., an error-prone metric M. Let r m j = M(i j , o a j , o b j ) denote the rating given by an error-prone metric for sample j.Analogous to before, we define m > = n j=1 I > [r m j ] the number of times the error-prone metric prefers the output of system π a to that of π b .The counts for equality and being worse are denoted by m = and m < , respectively.Since M is an error-prone metric, the counts m >,=,< are not equal to the true counts n >,=,< , which are yielded by an oracle.The errors made by the metric are characterized by its mixture matrix µ.In this section, we assume that the precise values of µ are known.We note that the true probabilities p = (p > , p = , p < ) are transformed by the mixture matrix to the error-prone ones by p = (p > , p= , p< ) = µp.We want to model the posterior distribution p(p|M > = m > , M = = m = , M < = m < ) of the true probabilities p given the observed error-prone ratings.This is done by combining the prior belief of p with the likelihood of the observed m >,=,< values, which can be modeled using a Multinomial distribution.We use a Dirichlet prior for p ∼ Dirichlet(α > , α = , α < ).The parameters α c are either chosen according to Equation 3, if we have access to oracle ratings, or set to 1, which corresponds to a uniform prior.

Step Three: Integrate Uncertainty over Error Measurements
In a real-world scenario, the values of µ must be estimated from data.This is achieved by comparing the error-prone metric outputs to a set of oracle outputs.For this, we use the following counts: , ∀c, c ′ ∈ {>, =, <}.Thus, n <,= denotes the number of times the errorprone metric returns < and the oracle returns =.In Bayesian terms, each column in the mixture matrix is modeled as a Dirichlet distribution.
Thus, the mixture matrix is treated as a random variable.Putting all together, we define a joint posterior for p and µ given the error-prone metric observation and the prior for p and µ.

Decision Function
Algorithm 1 shows how to apply the framework for one pair of systems π a , π b for a set of inputs I and a set of human annotations A. First the metric M is used to generate the set of automated ratings M .Then the confusion counts n c,c ′ are computed based on the human annotations and the metric ratings, which are used to create the distributions of the mixture matrix µ.Then the human annotations are used to estimate the prior distribution P r(p) of the comparison results.The metric samples are then used to estimate the posterior distribution P r(p, µ|m > , m = , m < ).Each of the three steps presented above yields a posterior distribution for p = (p > , p = , p < ).In order to decide whether system π a is better than system π b , we need to check whether p > and p < are significantly different.For this, we draw a number of samples pi from the posterior.This can in general be done using Markov Chain Monte Carlo sampling 3 .We define a significance level γ (e.g.γ = 0.05) and consider the fraction of samples where pi> > pi< .If this fraction is greater than 1 − γ 2 , then we regard the difference as being significant.Conversely, if the fraction is smaller than γ 2 , then π a is significantly worse than π b .

Evaluation Protocol
In this section, we present an evaluation protocol combining human and automated metric ratings assuming a limited budget for human annotations.More formally, given a set of TG systems π 1 , ..., π S , we want to create a partial order, where π i > π j if the win-rate of π i is significantly greater than the one from π j .The evaluation protocol is depicted in Algorithm 2. The protocol leverages the statistical framework to reduce the amount of human annotations needed by leveraging the metric judgments.This works as follows: We are given a set of inputs I (which corresponds to a test set), 3 The posterior in Equation 3 can be sampled directly.a set of TG systems {π 1 , ..., π S } to be ranked, an automated metric M, and an annotation budget B, which is the maximum allowed amount of annotations.The result of the protocol is a (potentially) partial order of the TG systems.
The protocol starts with a set of undecided system pairs, which initially consists of all pairs of systems, and an empty set of human annotations for each pair of systems A ij .In a first step, the metric M computes the scores M ij for each pair of systems.That is, for all inputs in I all TG systems generate their outputs, which are then evaluated using M.
Then we repeat the following process until our budget is empty.First we extend A ij with a batch of N human annotations for each pair of undecided systems.We then iterate over the undecided system pairs and use the decision function from Algorithm 1 (see Section 3) to decide whether two given systems are significantly different given the current set of annotations and metric ratings.If so, the pair is removed from the set of undecided pairs.When the budget is empty or all system pairs are decided, a (potentially) partial order is computed.The decision function leverages human and automated ratings to state whether one system is significantly better than the other.
The advantage the protocol is two-fold.First, it exploits the fact that some system pairs are easier to distinguish than others.In cases where |p > − p < | is large we need fewer human annotations to achieve the significance threshold.Compared to the setting where we allocate the same number of human annotations for each pair of systems, this allows us to spend more of the annotation budget on difficult system pairs.This approach can be used even in the absence of automated ratings.Second, our framework allows for a seamless combination of ratings from both humans and an automated metric.

Case Studies -Setup
In this section, we present three case studies where we apply the evaluation protocol outlined in Algorithm 2. As showcases, we use three domains: the WMT 21 metrics task data (Freitag et al., 2021) for machine translation, the SummEval data (Fabbri et al., 2021) for summarization, and data collected for conversational dialogue systems (see Appendix C).Table 1 gives an overview of the setting.For each domain, we investigate a set of metrics applied to outputs of a set of TG systems.We provide the details of the TG systems and the metrics in Appendix C. Chatbot: For the chatbot domain, we used the ParlAI framework (Miller et al., 2017) to generate 1000 outputs for 5 different TG systems on the BlendedSkillTask (BST) dataset (Smith et al., 2020).We then used the DialEval framework by Yeh et al. ( 2021) to run the outputs on 5 different metrics: DEB (Sai et al., 2020), GRADE (Huang et al., 2020), HolisticEval (Pang et al., 2020), MAUDE (Sinha et al., 2020), and USL-H (Phy et al., 2020).In addition, we used Amazon Mechanical Turk to annotate 50 pairwise outputs.SummEval: For the summarization domain, we used the SummEval framework (Fabbri et al., 2021), which provides the outputs of 16 different summarization tools on the CNN/DailyMail corpus (Nallapati et al., 2016), as well as 100 ex-  et al., 2015), Rouge-L (Lin, 2004), S3 (Peyrard et al., 2017), SummaQA (Scialom et al., 2019), and SUPERT (Gao et al., 2020).WMT21: For machine translation, we used the WMT21 metrics task data (Freitag et al., 2021).In this work, we only focus on the English to German language pair and the news domain, where eight machine translation systems were evaluated, plus three human references for each input which were also regarded as TG systems (resulting in eleven TG systems).Although the WMT21 metrics task inspected 15 different automated metrics, we only focused on four of them (we selected the most prominent ones): BleuRT (Sellam et al., 2020), COMET (Rei et al., 2020), C-SPEC (Takahashi et al., 2021), and sentence-level BLEU (Papineni et al., 2002).For each TG system there are 500 expert MQM annotations, and for each metric there are 1000 metric ratings.

Case Studies -Results
In this section, we discuss the results of the case study.We use the error-measures that we presented in List 1 in the introduction.Furthermore, for each system pair we compute the Kullback-Leibler Divergence (KLD) between the mode of the posterior in Equation 3 based on all human annotations p hum and the mean estimated by running Algorithm 2 p prot .We then report the average over all pairs of systems: Note that in Tables 2 and 3, we only report the Relevance part of the SummEval data due to space limitations.The results for the other features are in Appendix D.
The naive application of metrics yields many errors.When applying the metrics naively, i.e. by simply checking whether m > is significantly That is, in all domains, the metrics have a strong tendency to suggest differences between systems that are not statistically significant according to the human evaluation.The rate of inversion errors depends on the domain and metrics used.For the chatbot domain, the average Inversion error rate lies at 10%.For SummEval-Relevance the Inversion error rates lie at an average of 23%.The average KLD scores are high, which indicates that the naive application of metrics yields distributions that are in high disagreement with the human evaluation.
The evaluation protocol is able to recreate the original results.Table 3 shows the results of applying the protocol described in Algorithm 2. We also report the results achieved when applying the protocol using only human ratings (i.e., leaving M ij empty), as well as the result of an ideal metric in the SummEval -Relevance case.First, we note that there are no Inversion Errors, almost no Omission Errors, and there is a high Correctness score.For the Chatbot and SummEval domain the outcomes agree in around 90% to those of the human eval- uation.For the WTM21 domain, the agreement is lower at 65%.The most common error type is the Insertion Error.In our setting this can be explained by the fact that we are using the outcomes of significance tests to compare the human to the protocol evaluation.Thus, using corrected metric samples increases the amount of samples, which leads to pairs being rated as significantly different.
Since the ratings are based on our decision function, which takes into account different sources of uncertainty, the Insertions are not necessarily wrong.In fact one reason to use automated evaluation is to find differences between system that would be too expensive to discover with human annotations.A different view for comparing the outcomes of the evaluations is given by the KLD score, which reports how close the distribution p prot is to the original human evaluation.This view removes the significance test from the equation, and better showcases the disagreement between the protocol and the original human evaluation.In all cases the KLD scores are very low, which shows that the protocol yields results comparable to the original human evaluation.In terms of the number of annotations needed, there are two measures.First, we compare the number of annotations needed by the protocol to the one needed by the full human evaluation.
Here, the application of our protocol reduces the amount of humans annotation by more than half in most cases.For the WMT21 we can even reduce it by two thirds.The second view is comparing the number of annotations needed by the protocol to the annotations needed when the protocol is applied to human ratings only.For the Chatbot and SummEval domain, leveraging automated metrics results in less data needed (up to 10% for the Chatbot domain, and 5% for the SummEval domain).For the WMT21 domain, only 1% difference is measured.We assume that this are due to the fact that the metrics are not yet of high enough quality to yield the boost needed to have a large impact.Summary. Figure 2 summarizes the main outcomes of this work.The Figure shows the number of annotations (x-axis) in relation to the negative log-KLD score achieved (y-axis).The full human evaluation is set as reference, that is, using 100% of annotations, and a KLD score of 0 (thus, not shown).The Figure shows that on average using the full protocol on real-world metrics yields using 40% of annotations, and achieving a KLD score of 0.08.On the other hand, not using metric ratings in the protocol needs 43% of annotations and achieving a worse KLD score of 0.6.The naive application of the metrics does not need any annotations but yields high KLD scores (1.6 on average).To showcase an upper limit, we also added the KLD divergence for an ideal metric, which we simulated using the Bayesian model, where we use a fixed µ (see Appendix E for details).The ideal metric only needs 38% of annotation, and achieves a KLD score of 0.02.An ideal metric would also achieve a low KLD when applied naively.

Related Work
We here focus on approaches that discuss theorydriven analysis of metrics-based evaluation of TG systems that involve human annotations.Chaganty et al. (2018) propose an approach to combine human and metrics-based evaluation using control variates to reduce biases in automated metrics and save annotation costs.They explore automated scalar metrics in the Summarization and QA domain and find that their approach can lead to marginal reductions of the required human annotations.They conclude that further improvement of automated metrics and exploration of rankingbased evaluation are potential future directions.One interesting take-away from this work is the influence of the quality of the human annotations.Currently, we approximate the oracle preference ratings through human annotations without explicitly modeling uncertainty stemming from annotator disagreement.Wei and Jia (2021) pick up this point and apply a statistical approach to identify a setting in which automated metrics for Machine Translation and Summarization are more precise than human annotations: When the qualitative difference is small and there are only few human annotations.They argue that the reason is that while human annotations are unbiased, they have a high variance.
Conversely, automatic metrics tend to have a high bias but low variance.Furthermore, they apply the bias-variance-noise decomposition from Domingos (2000) to analyse sources of errors in evaluation and asses bias levels in automated metrics.Our analysis, in comparison, is more fine-grained in terms of categorizing metric errors, and we propose how to combine human and metric evaluation under a budget constraint.Similar to this work, von Däniken et al. (2022) propose a model that captures uncertainties stemming from imperfect metrics and insufficiently sized test sets.With their framework, the required size of a test set that is needed to distinguish a given difference in performance between two systems with a given automated metric can be calculated.Their investigation is limited to the case when scalar metrics are converted to binary metrics, however.Card et al. (2020) also analyse the required data set sizes that enable the detection of significant differences between systems, but they do not account for metric errors explicitly.Hashimoto et al. (2019) propose an evaluation approach that combines human and automated evaluation in a statistical framework to estimate diversity and quality of NLG systems jointly.Their focus is the creation of a novel metric, while our goal is to evaluate existing ones and to combine them with human annotations to obtain robust evaluations.

Conclusion
In this work, we introduced a novel Bayesian model that explicitly handles errors from automated metrics at the sample level.We then proposed an evaluation protocol that leverages this statistical model to reduce the amount of human annotations needed while yielding similar evaluation outcomes.We applied the protocol to three tasks in a case study.Namely, Dialogue Systems, Summarization, and Machine Translation.The results show that the Bayesian model is able to successfully include various types of uncertainty, which leads to more trustworthy applications of automated metrics.When applying the protocol, we achieve similar results as a purely human evaluation with only half the annotations needed.and internal funding by the Zurich University of Applied Sciences.

Limitations
Human Ratings as Oracle.In this work, we make the strong assumption that human ratings are equivalent to the oracle.As noted in the introduction, human evaluation is hard to setup and does not always lead to satisfactory agreement scores.However, for SummEval and WMT21 the human ratings are provided by experts, and thus, can be seen as close to oracle ratings.For the Dialogue domain, the ratings are made by crowdsourcing where we applied MACE (Hovy et al., 2013) to get the highest quality ratings.In future work, we will integrate the uncertainty of the human evaluation in the Bayesian model as well, which is not trivial.
Pairwise µ.We noted that the for each pair of systems, the mixture matrix is different.As a consequence, the errors made by the metric must be computed for each pair of systems separately, which is more cost-intensive.In future work, we aim to develop methods to transfer the knowledge from one system pair to another.This also highlights one issue of automated metrics, namely, that they are biased towards certain output types, which are exhibited by certain TG systems.
Draws are ignored.One issue with preferencebased ratings is the question of how to handle draws on the sample basis.Currently, we use p > and p < to decide if two systems are significantly different.However, if we consider the case of p > = 0.02, p < = 0.01, and p = = 0.97, then with enough samples, we will measure a statistical significant difference between the two systems.However, in 97% of cases the outputs are of equal quality.Thus, can we really state that one system is better than the other?Statistical Significance Decision.To compare the outcomes of the automated evaluation to the human evaluation, we rely on statistical significance testing.For this, we use the standard approaches, which are widely adopted.However, we noted that the significance decisions are rather arbitrary and make it hard to compare two evaluations, especially the interpretation of Insertion Errors is not trivial.The large amounts of additional automated metric ratings result in some pairs being rated as significantly different.However, it is not clear whether this is a mistake or if we were able to distinguish two systems that were not distinguishable due to too little data.The KLD score gives better insights for this as it compares distributions.
Differences in Samples.Currently, we disregard the fact that there are samples which are harder to rate than others.In fact, we treat each sample as being equal in Defintion 3.However, the sample difficulty could be leveraged to distinguish different systems from each other.For instance, if two machine translation systems are evaluated only on easy samples, then they might be rated as being of equal quality.However, a test on a harder sample might show the difference in capabilities between the two systems.
Conversion to Preference Ratings.Current automated metrics are built such that they return a scalar value ∈ R to rate a given pair of input and output.We have to transform these values into preference ratings by looking at the sign of the difference between the ratings of two outputs (see Appendix C).This leads to a few problems.First, there are only few draws, since metrics rarely return the exact same floating point value for two different outputs.Second, we disregard the magnitude of the scalar value.The magnitude can be used to assess the certainty of the preference of one output against another.In preliminary experiments we tried including a minimal threshold that the difference needs to surpass in order to be regarded as a preference decision.This will have to be explored in more detail in future work.
Current Metric performance.Since the current metrics are not yet of high enough quality, the impact they have on the protocol is small.This might give the impression that the protocol does not offer any remedy.However, the results show that our Bayesian model is able to rectify the overconfidence of low-performance metrics, and in cases where a metric is of low quality, its impact is reduced.

A Derivations
Dirichlet.We will first explain the usage of Dirichlet distributions in Equations 3 and 4. The Dirichlet distribution of order K is defined for all K dimensional probability vectors p = (p 1 , . . ., p K ) such that K i=1 p i = 1 and p i ≥ 0. It has K parameters α = (α 1 , . . ., α K ) and its density is: , where B(α) is the multivariate Beta function used to normalize the distribution.Note that if all α i = 1 then the distribution is constant at all points p, meaning it is equivalent to the uniform distribution in that case.Our main interest in using the Dirichlet distribution is because it is the conjugate prior of the Multinomial distribution.In Section 3 the counts from the oracle ratings n > , n = , and n < follow a Multinomial distribution with unknown probabilities p = (p > , p = , p < ), meaning that: If we assume a Dirichlet prior for p ∼ Dirichlet(α > , α = , α < ) then we can compute its posterior: We left out the normalization constants in this derivation.We can see that on the final line we arrive at the density of an updated Dirichlet distribution Dirichlet(α > + n > , α = + n = , α < + n < ).
Setting α i = 1 we get our result in Equation 3. We can apply the same principle to the columns of µ.The first column µ •> = (µ >> , µ => , µ <> ) T denotes the conditional probabilities of getting a specific outcome from the metric conditioned on the oracle rating being >.The associated confusion counts n >> , n => , and n <> again follow a Trinomial distribution with outcome probabilities of µ •> .If we again assume a uniform prior for µ •> then we can derive the posteriors in Equation 4.
Mixture.In Sections 3.2 and 3.3 we use the fact that the probabilities associated with the counts of metric ratings is p = µp.We know that p> = P (M(i j , o a j , o b j ) = >).Using the law of total probability: We note that P (Ω(i j , o a j , o b j ) = c) = p c , and therefore p> = c∈{>,=,<} µ >c p c and analogously for p= and p< .This leads us to our original statement p = µp.
Using Annotations multiple times.In Algorithm 1 it can be unclear which subsets of annotations should be used for the various counts.In general, the test set of inputs I can be split into three subsets: I M,A , the samples for which we have paired ratings from both the metric and humans, I M , the samples for which we have only ratings from the automated metric, and I A , the samples for which we only have human ratings.It is relatively obvious that we can use I M to count m > , m = , m < , I A for n > , n = , n < , and I M,A for the confusion counts n cc ′ .The question is whether it is sound to use the ratings from I M,A to augment n c and m c .In general, it should not be an issue to use the human ratings in I M,A as additional counts for n > , n = , n < .But we must not use the metric ratings to get additional counts m > , m = , m < .This means that, in principal, I A could be empty.

B Markov Chain Monte Carlo (MCMC) Sampling
Markov Chain Monte Carlo (MCMC) methods are often used in Bayesian modelling when there is no analytic closed form solution for the resulting posterior.The main idea is that expected values of functions of the posterior can be reasonably approximated by averaging over samples drawn from it (Metropolis and Ulam, 1949).Samples are generated sequentially, and the next sample is usually generated by first modifying the current sample randomly and then either accepting or rejecting it based on its likelihood.We refer the interested reader to Andrieu et al. (2003) for an introduction.We use the Numpyro (Bingham et al., 2019;Phan et al., 2019) library to implement the framework laid out in Section 3. We use the built-in No-U-Turn (NUTS) Sampler (Hoffman and Gelman, 2014).When running the decision function laid out in Algorithm 1, we run 5 chains in parallel.We use a warm-up period of 2000 samples per chain, which are discarded, and draw 10000 samples per chain to keep and compute the difference in win rates.

C Case Study Details
For the case studies, we require two types of ratings: preference ratings made by humans to simulate the oracle Ω, and metric ratings for the error-prone ratings M. We collect this data for three types of domains: Conversational Dialogue Systems, Automated Text Summarization, and Machine Translation.
Since most metrics return a scalar value, we need to transform them into a preference rating, which is done as follows: Definition 4 (Scalar Metric).We call real valued functions of inputs and outputs scalar metrics: A preference metric can be constructed from a scalar metric as follows: Definition 5.The derived comparison metric M of a given scalar metric M s is defined as

C.1 Dialog System Data Collection
For the Dialog domain, we used the ParlAI framework (Miller et al., 2017) to generate the outputs of systems.For this, we selected 5 state-of-the-art dialog systems, and used ParlAI to generate response for a static context.The Dialogue Systems are: • Blenderbot 1.0 -400distill (BL400distill  • SeekerDial3B. SeekerDial (Shuster et al., 2022a) improves on the internet search proposed in BlenderBot 2.0 We used ParlAI to generate the response of each of the dialogue systems for 1000 static contexts from the Blended Skill Task (BST) test set.From this set, we selected 50 contexts, and generated all pairwise outputs between all 5 dialogue systems and the human reference.That is, for each context, there are 15 pairs of outputs to be rated.
We let workers on Amazon Mechanical Turk4 perform a preference rating.That is, for each pair of output, the workers decided, which output is more appropriate.Figure 3 shows the annotation tool.Each sample was annotated by three workers.Each worker is payed 15 cents per annotation, and at a rate of 1.5 annotations per minute on average, they achieve a wage of 12$ per hour.We used workers with the Master status and restricted their geographic location to English-speaking countries (USA, UK, Canada, Ireland, and Australia).In Figure 4 shows the instructions given to the annotators.The annotations are ratings on the overall adequacy of the utterances.Since each sample is annotated by three different workers, we aggregate the ratings using the MACE (Hovy et al., 2013) software, which computes a trustworthiness score for each annotator and generates a weighted average to get the final label.Note that this annotation scheme directly yields preference ratings according to our framework.
In order to generate the metric ratings, we used the DialEval framework by (Yeh et al., 2021), which integrates a large pool of metrics.We selected five metrics that were easy to setup and achieved decent correlations to human judgments in the evaluation by (Yeh et al., 2021).Since the metrics return scalar values for each sample, we create preference ratings as suggested in Definition 5. Since the scalar metrics yield real numbered values, there are almost no cases where the derived preference rating yields a draw.

C.2 Summarization Data Collection
For the Summarization domain, we use the data provided by the SummEval framework (Fabbri et al., 2021).It contains data from the Dailymail/CNN dataset (Nallapati et al., 2016), which contains a test set of 11k samples.The SummEval framework contains the outputs of 23 summarization systems (for a detailed description of these systems, we refer the reader to Section 3.2 of (Fabbri et al., 2021)).For 16 of the 23 summarization systems, the authors let 100 generated outputs be rated by three experts on the four characteristics: fluency, consistency, coherence, and relevance.Since these ratings are on a Likert-scale, we transformed those ratings into preference ratings by averaging the three rat-  We chose 7 automated metrics based on their popularity and ease of setup.We applied the Sum-mEval framework to generate the automated ratings for each of the 16 summarization systems on the full test set.Analogous to above, we converted the scalar ratings to pairwise ratings by applying definition 5.

C.3 Machine Translation Data Collection
For the Machine Translation domain, we used the WMT-21 (Freitag et al., 2021) metrics task data.For space limitations, we used the EN→DE section of the data only.The WMT-21 dataset consists of 15 different metrics for 8 machine translation systems and 3 human references.The human annotations consist of 500 MQM ratings (Lommel et al., 2014) done by expert translators.To create preference ratings, we computed the average score of a sample, and compared the score of two outputs of two different systems for the same input by applying definition 5.
The WMT-21 dataset already contains the ratings of the automated metrics for 1000 samples, out

Figure 3 :
Figure 3: Screenshot of the Dialogue Annotation Tool.

Figure 4 :
Figure 4: Screenshot of the Instruction of the Dialogue Annotation Tool.

Table 1 :
Overview of the data used.The ratings refer to the number of ratings available for each pair of TG systems.

Table 3 :
Frequency of Error Types for all metrics if the protocol is applied.Correct (Cor.),Inverted (Inv.),Omission (Omi.),Insertion (Ins.), and the fraction of annotations needed with the protocol (Ann.).
bigger than m < , then this introduces many cases where systems that are not significantly different are rated as such.In Table 2 the average error rates for each error type are shown (the full result table is found in Appendix D).Overall the Insertion error type dominates (averaging at 23% to 33%).
Dataset and Large Scale Pretraining.Transactions of the Association for Computational Linguistics, 8:810-827.
Komeili et al., 2022)2)and a long-term memory(Xu et al., 2022).• DialoGPT.DialoGPT (Zhang et al., 2020) is a decoder-only dialogue system that fine-tunes GPT-2 (Radford et al.) on the Reddit dataset proposed by the DialoGPT authors.• PolyEncoder.The PolyEnocder (Humeau et al., 2020) is a retrieval based dialogue system, which selects the most suitable response from a set of candidates.