The statistical advantage of automatic NLG metrics at the system level

Estimating the expected output quality of generation systems is central to NLG. This paper qualifies the notion that automatic metrics are not as good as humans in estimating system-level quality. Statistically, humans are unbiased, high variance estimators, while metrics are biased, low variance estimators. We compare these estimators by their error in pairwise prediction (which generation system is better?) using the bootstrap. Measuring this error is complicated: predictions are evaluated against noisy, human predicted labels instead of the ground truth, and metric predictions fluctuate based on the test sets they were calculated on. By applying a bias-variance-noise decomposition, we adjust this error to a noise-free, infinite test set setting. Our analysis compares the adjusted error of metrics to humans and a derived, perfect segment-level annotator, both of which are unbiased estimators dependent on the number of judgments collected. In MT, we identify two settings where metrics outperform humans due to a statistical advantage in variance: when the number of human judgments used is small, and when the quality difference between compared systems is small.


Introduction
Automatic metrics are involved in many developmental settings for natural language generation (NLG) systems. In machine translation (MT), metrics like BLEU (Papineni et al., 2002) enable settings where the amount of human effort required would be infeasible, such as architecture or hyperparameter search (Britz et al., 2017). As objective, reproducible quantities, BLEU scores facilitate cross-paper comparisons (Post, 2018). Historically, progress in MT has been attributed to its 1 The data and code to reproduce our analyses are available at https://github.com/johntzwei/ metric-statistical-advantage. : Distribution of estimators for the true difference in system quality δ H S,S between two generation systems (for illustrative purposes). Notation is defined in §2.3. An estimate incurs prediction error if its sign is opposite to the true difference. While humans provide an unbiased estimator of the difference, a biased estimator derived from a metric can have a smaller error probability (shaded areas) due to its lower variance. Evidence supporting the illustration can be found in §5. use (Callison-Burch et al., 2006). Metrics are an active research area in many NLG subfields, including summarization (Lin, 2004), dialogue (Tao et al., 2018), and image captioning (Anderson et al., 2016), which seek to realize the goal of quick and reliable automatic evaluation.
In all these subfields, the primary goal when conducting evaluation is typically to compare NLG systems. Both human annotators and automatic metrics produce segment-level scores, i.e., scores for individual examples, so comparing systems requires aggregating segment-level scores into an overall system-level score for each system. Ideally, we would compare systems by their expected human annotator score (an average over infinite human judgments), which we term the true system quality. In practice, we can only estimate this expectation with a sample mean over a finite number of human judgments. Metrics offer a cheaper alternative: we can instead compare systems by their aggregate metric scores on a number of system out-puts. When comparing systems, we care primarily about how well we estimate the difference of their true system qualities, and in particular the sign of this difference (i.e., which system is better), which we term the true pairwise label.
There is a gap in our understanding of systemlevel metrics. To recount a perplexing anecdote, in the most recent edition of the WMT metrics shared task (Mathur et al., 2020b), initial human evaluation disagreed with most metrics on a pairwise prediction of two translation systems. In a manual re-evaluation, the second round results favored the metrics. Our paper offers a statistical explanation for how humans could go "wrong": even if human estimation for the difference in system quality is unbiased, it has high variance. On the other hand, while estimators based on metrics are biased, they have low variance. It is therefore possible for metrics to give a more accurate pairwise prediction than humans when the bias is small (see illustration in Figure 1). Our paper explores this distinction in the following three questions: (1) How can we evaluate system-level metrics? When observing estimator error in terms of pairwise predictions, predictions are evaluated against noisy, human predicted labels rather than the ground truth. In addition, metric predictions fluctuate based on the sample of outputs from the generation system. To disentangle these properties, we examine observed estimator error under a biasvariance-noise decomposition. Under simulation, we find that the label noise and metric variance account for a small fraction of observed error in both MT and summarization.
(2) How good are these metrics? We compare the errors of metric estimators computed on an infinite number of system outputs, against human estimators with varying amounts of human judgment. We also derive the error of a perfect segment-level annotator (i.e. they provide noiseless/expected human scores for each output), which is also unbiased and judgment dependent. Empirically, some MT metrics exceed the performance of unbiased estimators with a small number of judgments.
(3) What are the limits of system-level evaluation? The perfect segment-level annotator, as the noiseless human, provides an optimistic estimate for the number of human judgments necessary to achieve a fixed performance. With a power analysis, we can analytically calculate the number of judgments necessary to detect differences between systems of varying sizes. When differences in system quality are small, a prohibitively large number of perfect annotator judgments are required to give a correct pairwise prediction.

System-level scores
We will now formalize scoring at the system level, adopting notation from Chaganty et al. (2018). Let X be a distribution over inputs (e.g. source sentences), and S be a set of systems (e.g. all translation systems in WMT). Each system S ∈ S takes input x ∼ X and returns output z = S(x) (e.g. z is a translation). Let H(z) be a random variable representing a human judgment according to some evaluation prompt (e.g. translation adequacy, from 0-100). A central quantity of interest is the quality of system S, defined as and is not directly observable as it requires infinite human judgment. We can estimate (1) with a finite test set of n examples. Let x (1) , . . . , x (n) i.i.d. ∼ X be a sampled test set and z (1) , . . . , z (n) be the set of outputs where each z (i) = S(x (i) ). Human judgments are sampled independently as y (i) ∼ H(z (i) ). The sample mean is an unbiased estimator of (1). Only (2) is observable, which is a noisy approximation of (1). A cheaper alternative to estimating the true quality scores is with an estimator based on an automatic metric. Let M (e.g. BERTSCORE) be an automatic metric that takes as input any number of outputs from a system S and produces score where µ M S is a biased estimator of µ H S . As the test set is sampled, the metric score has non-zero variance. Note that while we use the greek letter µ, only some system-level metrics (e.g. ROUGE) are averages of their segment-level counterparts (their score decomposes to µ M S = 1 n n i=1 M (z (i) )). Empirically, we find that metrics using other aggregation strategies have convergent properties similar to an average (see Appendix B). We sidestep this by defining the "true" metric score as for test sets of size m sufficiently large so that this true score is nearly constant.

Problems in evaluating with correlation
Research in system-level metrics have a tradition of evaluating metric correlation to human judgment with the Pearson correlation coefficient (Reiter, 2018). Formally, these evaluations compare Mathur et al. (2020a) highlights two issues with the use of correlation: First, Pearson's r is neither interpretable nor reflective of systemlevel metric use in practice. Second, outlier systems (systems with very high/low human/metric scores) can arbitrarily inflate Pearson's r, and outlier systems often exist. Mathur et al. (2020a) propose evaluating metric accuracy in pairwise prediction (can the metric differentiate which generation system is better?) as an alternative that mitigates the issues mentioned above.
We add two points that apply to any measure of metric performance, correlation or pairwise predictions: First, metrics cannot be perfect due to noise in human labels. For instance, while r ranges from [−1, 1], even for the metric that predicts µ H S it has Corr S ( µ H S , µ H S ) < 1 due to noise in µ H S . It is unclear what is the true upper bound of performance we can expect to achieve. Second, direct measurement of any performance measure on our datasets introduces sample bias (Engstrom et al., 2020). For correlation, r M could be high because µ H S and µ M S happened to align for this data collection, but a repeat experiment could yield different results. A more holistic view is to give an estimate of average case performance. 2 The evaluation methodology we derive in §4 addresses the latter points we raise for pairwise predictions and mean squared error (which has direct relationship to the correlation). However, we also believe that pairwise predictions is a step in the right direction, and our discussion continues with pairwise predictions.

Pairwise predictions
We will now formalize pairwise predictions. For systems S, S ∈ S, define the true difference in their system scores as 2 Pearson's r was not formulated for individual distributions µ H S and µ M S for each datapoint, so applying the William's test (Graham and Baldwin, 2014) also falls short here. and the observed difference as and likewise for the differences δ M S,S and δ M S,S w.r.t. to a metric M . In practice, we are interested in the pairwise prediction of S and S i.e. whether δ H S,S ? > 0, given that we have collected human judgments (we observe δ H S,S ≶ 0), or computed metric scores (we observe δ M S,S ≶ 0). Refer to Figure 1 for an illustration.
To operationalize the pairwise prediction of S and S , let the true pairwise label be defined as the central quantity of interest. Define the human predicted pairwise label as which is typically estimated when we calculate metric pairwise accuracy from our datasets.

Datasets
3.1 WMT16-19 metrics shared task Data. We use the past 4 years of to-English translation data from the WMT metrics shared task (Bojar et al., 2016b(Bojar et al., , 2017Ma et al., 2018Ma et al., , 2019 about 1312-5612 judgments, depending on the year and language pair. For ease of interpretation, we always use raw direct assessment judgments which range from 0-100. Metrics. We evaluate the performance of the three metrics included in SacreBleu (BLEU, TER, chrF; Post, 2018;Koehn et al., 2007). These three have also participated in every year of the metrics task as baselines. In addition, we include two recently developed metrics: BERTSCORE (Zhang et al., 2020) and BLEURT (Sellam et al., 2020). Both metrics are found to effectively utilize contextual embeddings (Devlin et al., 2019), and BLEURT is a learned metric (tuned on data outside of WMT2019). For all metrics, we use the default settings for scoring. Since BLEURT is trained on WMT15-18, we test it only on WMT2019 pairs.

SummEval
Data. The SummEval dataset (Fabbri et al., 2020) contains 100 outputs from 17 summarization systems. This results in 136 pairwise examples. For each system output, 3 expert judgments, and 11 references for metric scoring. Each summarization is judged in four categories from 0-5: coherence, consistency, fluency, and relevance. To compute system-level human scores for a system, we first average over categories for an aggregate expert score, and then average the aggregated expert scores per system. Metric scores for system outputs were computed against as many references as possible.
Metrics. We evaluate the performance of several metrics that were found to be effective at the system-level in Fabbri et al. (2020). This includes the traditional ROUGE-4 (Lin, 2004) summarization metric, its extension ROUGE-WE (Ng and Abrecht, 2015), and METEOR (Lavie and Agarwal, 2007). In addition, we include two metrics based on BERT (Devlin et al., 2019). BertScore (Zhang et al., 2020), also present in the WMT analysis, and SUPERT (Gao et al., 2020), which is a reference-less metric for summarization.

Decomposing observed metric error
Two sources of variation distinguish the observed pairwise error (11) from the true error in (10)the noise in the human predicted labels due to finite judgements, and the variance in the metric due to finite test sets. Approximating (11) is straightforward with the bootstrap, but disentangling the error from these two sources of variation requires more care. With the bias-variance-noise decomposition, we can adjust our observed error estimates to the noise-free, infinite test set setting of the true error.

The bias-variance-noise decomposition
The bias-variance-noise decomposition due to Domingos (2000) decomposes the observed pairwise error in (11) w.r.t. two constant labels for any pairwise example on systems S, S ∈ S: • The true pairwise label for this example is and the estimator that produces these true labels has, by definition, the lowest observed error. In the decomposition, the human predicted label noise and metric bias is defined relative to the true labels. Assuming the central limit theorem (proof in Appendix A), we actually have ∆ H * S,S = ∆ H S,S as defined in eq. (5).
• The main prediction of a metric for this ex. is and we assume that the metric prediction converges onto the main prediction as the test data increases for S and S (empirically validated in Appendix B). In the decomposition, the metric variance is defined relative to the main prediction.
Starting from the loss incurred by M on this pairwise example, the decomposition gives us where the noise is an irreducible loss incurred by computing pairwise accuracy to the human predicted labels instead of the true labels. Note that this noise term also exactly corresponds to the lowest achievable observable error (see §4.2).
where the bias is 0 if the main prediction is correct (w.r.t. to the true label), and 1 otherwise. Note that this term is also the true error of a metric estimator in a noise-free, infinite test set setting. For unbiased estimators this term is zero, as their main prediction matches the true label.
where the variance is a likelihood that the estimator deviates from its main prediction under random sampling.
• c 0 = 2P X ( ∆ M S,S = ∆ H * S,S ) − 1 which means that the influence of label noise on the error becomes small if the estimator prediction are close to random chance. When the estimator gives constant predictions, the sign of c 0 is dependent on the estimator's correctness.
• c 1 = 1 if ∆ M S,S = ∆ H * S,S and c 1 = −1 otherwise. Variance can both increase and decrease the observed error. If the estimator is unbiased, the variance causes the prediction to from the correct main prediction. On the other hand, for a biased estimator, deviation from its incorrect main prediction occasionally decreases the error.
Unlike the decomposition for mean squared error, the interaction between the c 0 and Var terms only allows the error of two hypothetical settings to be read off directly from the table: when Noise − → 0, corresponding to estimator error when computed against the ground truth; or when Noise + Var − → 0, when the ground truth is used and metrics have access to an infinite test set for scoring.

A lower bound for the observed error
By definition the constant estimator that produces the true pairwise labels ∆ H * S,S (defined in (12)) for each pairwise example has the lowest possible observable error. The observable error of this op- Since this estimator is constant it has no variance, and since it is instantiated by definition it has no bias. Analytically, the observed error of any estimator is lower bounded by Noise( ∆ H S,S ) and is the agreement of our human predicted labels with the ground truth.

Best-faith estimation with the bootstrap
Assuming the bootstrap (Efron and Tibshirani, 1993) which is a common procedure in NLP (Dror et al., 2018), we can estimate the expectation quantities in the decomposition. By assuming that sampling with replacement from our datasets approximates real sampling, we can repeatedly simulate the quantity in an expectation. Taking the mean over trials gives the bootstrap estimate of the expectation. We emphasize that this is a regular application of a widely accepted technique-the bootstrap assumption allows us to study problems that would be impossible due to the cost of repeat experiments.

Results
The following analyses refer to the error components (averaged over all examples) from the simulated decomposition presented in Table 1.
The noise component almost always accounts for a small fraction of the total error. We found this to be counterintuitive-while the lowest observable error (optimal predictions, see §4.2) incur about 5% error on both datasets, the influence of the noise is much smaller than those errors suggest. For the constant c 0 scaling the noise, c 0 = 0 if the metric prediction is near random. Since the c 0 Noise term on average is small two cases hold true: when humans are uncertain about the example (noise term large) metrics are as well (c 0 term small), and when metrics are certain about the examples (c 0 term large) humans are as well (noise term small). The second case empirically shows studying the sampling distribution of metrics (Koehn, 2004;Berg-Kirkpatrick et al., 2012) is effective, as metric certainty in the difference of system quality often implies human certainty.
Metric variance introduces little to the pair- wise error, because it is low. Alternatively, metrics stand to gain little from using more test set examples. In MT, dropping both the noise and variance components for the error results in at most a 1 or 2 percent reduction in the observed error (see §9 for the implications in metrics research). Metrics generally have low variance, so at the test set sizes of WMT and SummEval, they are likely to converge to their main predictions.

Comparing to the human estimator
In §4, several MT metrics approach the error of the WMT human evaluation. The WMT human evaluation is expensive, using thousands of judgments per translation system. While each human judgment has associated monetary cost, once a large test set is collected, running metrics only incurs computational cost. This section explores this asymmetry, and seeks to understand how much metric predictions are worth, in terms of human judgments.

Noise-free, variance-free error estimates
We wish to give our best comparison between metrics and unbiased estimators (humans or the perfect annotator). Ideally, metrics would be given their best chance to perform, by using an infinite test set. With the decomposition, we can adjust metric errors estimates to a noise-free and infinite test set setting by taking only their bias component. For human and perfect annotator estimators, we can adjust their errors to a noise-free setting by taking only the variance component. The following sections compare these adjusted errors.

Simulating the perfect annotator
While we can estimate the lower bound to the pairwise error for a given dataset (in §4.2), it is achieved by a constant estimator using system-level ground truth. Comparing segment-level metrics against the unbiased "perfect annotator", or the best scorer at the segment-level, is more informative. At the high-level, we can simulate scoring with the perfect annotator at n judgments using the human estimator at n > n judgments to match the variance of the perfect annotator estimator.
Let's start from the unbiased human estimator µ H S (2). Recall that the estimator is a sample mean, so its variance is Var( µ H S ) = Var(H(x))/n. An insight from Chaganty et al. (2018) gives us the decomposition of the variance of H(x) with the law of total variance. In words, the variance term can be thought of as the variance of each output sentence's true quality score (some translations produced by S are better than others) and the expectation term is the noise introduced by the humans when estimating the quality of a sentence (human scores have mean 0 noise around an output's true quality score).
One intuition is that even if a perfect annotator gives the correct score for each sentence, every time, there is still some unavoidable variance in the estimator due to the variance of the hypothetical quality scores for each output. To formalize this notion, let P (x) = E[H(x)|x] be the human scoring function of a "perfect annotator", and the estimator µ P S be an empirical mean of n independent samples from P (x) similar to eq. (1). As a sample mean, Var( µ P S ) = Var(P (x))/n. Relating this to (15) Var(H(x)) = E [Var(H(x)|x)]+Var(P (x)) (16) and while Var(P (x)) is not directly observable, we can calculate Var(H(x)) with the sample variance on all the human judgments, and E [Var(H(x)|x)] with a pooled variance over variances from repeat human judgments on the same output sentence.
Our final step considers the efficiency ratio r = Var(H(x))/Var(P (x)). If we are interested in the perfect annotator estimator at n judgments, the hu-  man estimator at n = rn judgments has variance and we invoke the central limit theorem to claim both µ P S and µ H S are normal. This completes our reasoning that for scoring on the system-level, sampling n = nr human judgments is nearly equivalent to sampling n perfect annotator judgments. See Appendix C for step-by-step derivations for the perfect annotator variance in our datasets.

Results
The following analyses refer to the comparison of metric estimators to unbiased estimators at varying number of judgments for WMT in Figure 2.
Judgments from the perfect annotator have low variance, like those of professional linguists. While we do not have data from professional linguists, we can qualitatively compare them to the perfect annotator. A growing body of MT literature focuses on professional linguists (Freitag et al., 2020;Mathur et al., 2020b), and there are at least two known properties of their judgments: their judgments have better interannotator agreement (contain less noise), and they are more sensitive to linguistic phenomena. The perfect annotator has no noise, as they assign a constant score to each segment. However, the perfect annotator in WMT is better described as a noiseless crowdworker. With the biases of crowdworkers, the perfect annotator may not share the sensitivity property, and our use of crowdworkers may be biased w.r.t. professional linguists.
In terms of average pairwise error, MT metrics have an equivalence to a high number of human judgments. Since the error of the human estimator monotonically increases as the number of judgments decrease, each MT metric has a breakeven point. Metrics outperform human estimators using judgments below this threshold. BERTSCORE is as accurate as using a human estimator with 600 judgments per system, or the perfect annotator estimator with 300 judgments, across the WMT dataset. We highlight the statistical advantage in variance many metrics share, and that this advantage offers a possibility that metrics can outperform humans, determined by which human estimator the metric is compared against. This is a consequence of the general fact that humans are unbiased, high-variance estimators, and metrics are biased, low-variance estimators, as depicted in Figure 1. For metrics such as BERTSCORE or CHRF, the bias is low as well, which gives it remarkably good error properties.

The limits of human evaluation
The perfect annotator provides optimistic figures for human annotation, providing the best performance for a fixed number of judgments, and requiring the least judgments for a fixed performance. In §5, we saw that the perfect annotator is weak at low number of judgments, due to its non-zero variance. In this section we identify another consequence of the perfect annotator's variance, where estimating small differences in system quality is hard.

Power analysis of the perfect annotator
The performance of an unbiased estimator is dependent on their variance and the effect size it is trying to detect. This section performs a power analysis to determine how much annotator effort is needed to reliably detect the correct pairwise judgment between two systems (Card et al., 2020). To make an optimistic estimate, we assume our annotator variance is close to that of a perfect annotator. We make two assumptions to apply a basic power analysis for the estimation of the difference of system quality between two systems: normality and equal variance across groups. For parameters α = 0.05 (false positive rate) and β = 0.95 (false negative rate), we can analytically compute the number of judgments needed to ensure our pairwise judgment is at least β(1 − α) ≈ 90% accurate. Table 2 contains power analyses for different instantiations of annotator variance and effect size.
In WMT, detecting a difference of 1 point requires at least 10K perfect annotator judgments, for different instantiations of its variance. To put this in perspective, the top 5 zh-en translation systems in WMT19 differed by less than 3 points (Barrault et al., 2019). Depending on how much is paid per judgment, this cost can quickly become infeasible. Here, the merit of such a task may be argued, as knowing a small difference exists between two systems may not always be productive. From a scientific perspective, many NLG techniques will yield small improvements, and not being able to detect small differences means we will not know whether these techniques are useful.

Metrics more easily achieve significance
Since metrics tend to have lower variance, metrics often achieve significance in estimating the difference of system qualities, when humans cannot. For instance, BERTSCORE achieves significance in estimating quality differences over half of the pairwise examples where humans do not (see Appendix §E). In extreme cases, human evaluation is nearly as bad as flipping a coin, but the metric can still offer a consistent prediction between two systems. When comparing systems similar in quality, practitioners must accept that the number of possible analyses are limited. In ablation studies where similar systems are often compared, metrics may be our only insight into system performance. With white-box metrics such as BLEU, value can be derived from qualitative insight (e.g. systems with high BLEU score have high n-gram overlap with the reference set). In addition, we may qualitatively analyze output statistics not intended to correlate with humans judgment at all (Neubig et al., 2019).

Caveats to the analysis
Our analysis assumes that the human judgments are unbiased. In WMT16-19, direct assessment (Graham et al., 2013) was used to elicit judgments from a combination of crowdworkers and researchers. Direct assessment (DA) uses an adequacy evaluation prompt ("Rate how much you agree that the output translation adequately expresses the meaning of the reference translation") and asks contribu-tors to rate on a 0-100 scale.
The unbiased ground truth is not a fixed goalpost. A number of factors are known to change the eventual ranking of translation systems with human scoring. Employing a different collection methodology, such as human translation edit rate (HTER) of instead of DA, can result in divergent system rankings . In an earlier edition of WMT, DA judgments were collected with both a grammaticality prompt and an adequacy prompt, corresponding to different system rankings by the respective attribute (Bojar et al., 2016a). Several studies have shown scoring differences between professional linguists and crowdworkers which are due in part to the fact that linguists are more sensitive to linguistic phenomena (Fabbri et al., 2019;Freitag et al., 2019).
The goals of an evaluation should be decided by the practitioner. We do not give suggestions on any particular goals, and practitioners should understand what their application is, and which evaluation is the best approximation (refer to Gatt and Krahmer, 2018). Unfortunately, since the existing data in this domain is limited, our analyses are limited as well. However, the statistical techniques apply to any empirical method. We hope that our analysis inspires others to think about statistical limits in this domain.

Pushing the limits of evaluation
To push the limits of what can be evaluated, we need to improve on fundamental aspects of human evaluation. On the human side, we may focus on creating larger effect sizes or reducing noise by adopting new annotation schemes (Läubli et al., 2018;Shapira et al., 2019) or employing professional linguists (Fabbri et al., 2020;Toral et al., 2018). To make the human estimator more efficient, we may consider adaptive data collection techniques to stop data collection early when significance is achieved, in a statistically sound manner (Johari et al., 2017).
Strategies combining human and metric evaluation are also shown to have potential. Variance reduction techniques can be applied to the human estimator by taking advantage of strong metrics (Chaganty et al., 2018). Another bottleneck in human evaluation is in the random sampling of the test set. Metrics could form the basis of an importance sampling procedure to choose test sets that would best differentiate two systems, as a form of robust evaluation (Chaganty et al., 2017).
On the metric side, if we can reliably estimate metric bias, we can skip human evaluation altogether when the metric is known to be good. Probabilistic reinterpretations of current metrics could be a useful technique for confidence estimation (Keith and O'Connor, 2018). Optimistically, metrics could have provable guarantees, ensuring the correctness of metric decisions (Jia et al., 2019).

Best practices for metrics research
We reinterpret problems in evaluating metrics with correlation ( §2.2) as a set of guidelines for metrics research. To next year's organizers of the WMT metrics shared task and the broader metrics community we suggest the following: (1) Pairwise accuracy has desirable properties as an evaluation measure for metrics. Our bias-variance-noise decomposition shows that the observed pairwise accuracy is very close to the true pairwise accuracy from a noise-free, infinite test set setting ( §4.4). We suggest the use of pairwise accuracy as it reflects metric performance well (which may be verified using this analysis). As a normalized form of pairwise accuracy, Kendall's τ is also a suitable measure.
(2) Since pairwise accuracy is computed against noisy human predictions, on average, it should be impossible for metrics to achieve a perfect accuracy. We suggest providing an upper bound of metric performance ( §4.2) to clarify how much improvement is possible for metrics on the dataset.

Related work
The fact that a manual evaluation can be weak, and an automatic one can be better is gaining attention in the metrics community. Mathur et al. (2020b) studied a disagreement between crowdworkers and metrics, and a reevaluation favored the metrics over the human prediction. Recently, Freitag et al. (2021) shows that metrics can achieve higher agreement with professional linguists than crowdworkers in judging translation systems. Their results fit into our formalization: if we assume professional linguists are unbiased, the bias and variance properties of metrics combined are superior to those of crowdworkers. Our analysis assumes that crowdworkers are unbiased, where they assume professional linguists are instead.
We wish to highlight several works which inspired the elements of ours: Chaganty et al. (2018) and Hashimoto et al. (2019) formalize metrics as statistical estimators and provide understanding of their statistical properties and limits. In the replication of ImageNet, Engstrom et al. (2020) found that dataset bias accounted for classifier performance differences between the original and the replicated dataset, and provide a decomposition for the sources of error. In automated essay scoring, scorers are often evaluated against noisy human judgment, and Loukina et al. (2020) developed the PRMSE to calculate the MSE between scorer prediction and the true judgment, rather than noisy judgment. Finally, in bioinformatics, Li et al. (2020) derive an upper bound of the R 2 coefficient due to experimental noise when regressing on experiment-derived results.

Conclusion
Through rigorous comparison between metrics, humans, and the perfect segment-level annotator, we identify the settings where metrics outperform humans due to a statistical advantage in variance. These results challenge the notion that metrics are always secondary to human evaluation. Instead, we encourage practitioners to understand when human evaluation is weak, and when metrics are necessary. Finally, we hope to provide tools for analysis and future directions for evaluation.
Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1-61, Florence, Italy. Association for Computational Linguistics. A Equivalence between optimal prediction and true system differences There is a slight difference between the definition of the true difference in (5) which we can alternatively define as and the definition of the optimal prediction ∆ H * S,S in (12), which is positive when and the two are not immediately equivalent. However, if we assume that the central limit theorem applies (which can be reasonable as our sample means always have n > 100) and where φ is the CDF of the standard normal distribution. Since the standard normal is centered and symmetric, φ(x) > 1/2 ⇐⇒ x > 0. Together we have where for x = 0 the left and right hand sides are equivalent to (20) and (19), respectively.

B Convergence of metric predictions to the main prediction
A key assumption in interpreting the results from the bias-variance-noise decomposition in §4 is that as system-level metrics have access to more outputs for evaluation, metric predictions converges onto the main prediction. For many metrics, their system-level score is the mean of their segment-level scores (e.g. BLEURT, BERTSCORE, ROUGE etc.). This is true for all summarization metrics. For these metrics, assuming the central limit theorem allows us to prove that metrics converge to the main prediction, similar to the proof in Appendix A. However, some MT metrics (BLEU, TER, and CHRF) are not simple averages of their segment-level scores, making them harder to analyze.
For system-level metrics that are not simple averages, we analytically observe that their aggregation method is similar to a mean (e.g. BLEU is a macro-average). We empirically verify that as the system-level metric evaluates on more test set outputs, their pairwise predictions converge to the main predictions. Refer to Figures 3 and 4 Figure 4: Average agreement of the main prediction to metric predictions evaluated on varying test set sizes in SummEval. The main predictions were derived from all of our data. Each point was an estimated with 10K bootstrap trials. As the size of the test set increases, we see that the agreement monotonically increases. Note that all metrics are means of their segment-level scores. Table 3: Step-by-step derivation for the efficiency ratio r (fourth row) of the perfect annotator estimator for WMT16-19 as defined in §4.1. Square roots are taken so that values are in terms of the original units (standard deviations, judgments range from 0-100). These were calculated on to-English data only.

Expert Turker
Var(H(x)) 0.717 0.745 E [Var(H(x)|x)] 0.293 0.475 Var(P (x)) 0.655 0.574 Var(H(x))/Var(P (x)) 1.201 1.686 Table 4: Step-by-step derivation for the efficiency ratio r (fourth row) of the perfect annotator estimator for SummEval as defined in §4.1. Square roots are taken so that values are in terms of the original units (standard deviations, judgments range from 1-5). Note that there is little agreement between experts and turkers at the system level.

D SummEval analysis results
The main analyses in §5 and §6 are presented for SummEval here. When comparing expert humans to metrics, no metric comes close to expert performance at any number of expert judgments. For the power analysis, small differences are also hard to detect, similar to the findings in WMT. Note that while the perfect expert requires relatively less judgments compared to the perfect crowdworker in WMT, judgments from experts are likely to be much more expensive.  Table 5: Power analysis for the number of judgments needed from the perfect expert to give a pairwise judgment between two systems at .9 accuracy (α = 0.05, β = 0.95) under ttest assumptions (normality, equal variance) in SummEval. SummEval ratings are on a 1-5 scale, and the true segment quality variance was 0.655. Darker cells indicate less feasible experiments, and the colors are set on a log scale.