Better than Average: Paired Evaluation of NLP systems

Evaluation in NLP is usually done by comparing the scores of competing systems independently averaged over a common set of test instances. In this work, we question the use of averages for aggregating evaluation scores into a final number used to decide which system is best, since the average, as well as alternatives such as the median, ignores the pairing arising from the fact that systems are evaluated on the same test instances. We illustrate the importance of taking the instancelevel pairing of evaluation scores into account and demonstrate, both theoretically and empirically, the advantages of aggregation methods based on pairwise comparisons, such as the Bradley–Terry (BT) model, a mechanism based on the estimated probability that a given system scores better than another on the test set. By re-evaluating 296 real NLP evaluation setups across four tasks and 18 evaluation metrics, we show that the choice of aggregation mechanism matters and yields different conclusions as to which systems are state of the art in about 30% of the setups. To facilitate the adoption of pairwise evaluation, we release a practical tool for performing the full analysis of evaluation scores with the mean, median, BT, and two variants of BT (Elo and TrueSkill), alongside functionality for appropriate statistical testing.


Introduction
Research is driven by evaluation results, with attention and resources being focused on methods identified as state of the art (SotA). The proper design of evaluation methodology is thus crucial to ensure progress in the field. In NLP, evaluation usually consists in comparing the averaged scores of competing systems over a common set of test instances. Indeed, averaging scores independently for each system and declaring the one with the highest average to be best is particularly Evaluation scores of systems A, B, and C for five test instances. All systems have the same mean. C is better than A on all instances but one, so BT declares C > A Also, B is better than A on all instances but one, so BT declares B > A, whereas the median of A is greater, and the means are the same. Overall, mean and median fail to capture the complex instance-level pairing.
simple, well understood, and mirrors the expected risk minimization paradigm used to train systems. Here, we critically assess the specific choice of the average to aggregate evaluation scores. In particular, we emphasize that there is a natural instance-level pairing between the evaluation scores of systems, which aggregation mechanisms such as the mean or median fail to take into account: as they produce a score for each system independently, systems that have the same set of scores (but potentially in different order) cannot be distinguished.
Consider the three systems A, B, and C compared on five test instances in Fig. 1. Despite a complex pairing structure, they all have the same mean score across test instances. Moreover, even though B is better than A on all test instances but one, the median of A is greater than the median of B.
In this work, we discuss an alternative aggregation mechanism: the Bradley-Terry (BT) model (Bradley and Terry, 1952). BT compares systems for each test instance and estimates the latent strength of systems based on how frequently one system scores higher than another. Such paired mechanisms have already been successfully used to aggregate human judgments (Novikova et al., 2018;Sedoc and Ungar, 2020); for example, WMT evaluation protocols regularly employ TrueSkill (Herbrich et al., 2007), a Bayesian variant of BT (Sakaguchi et al., 2014).
Contributions. We contribute the first comprehensive analysis of the BT model (especially vis-à-vis mean and median) as an aggregation mechanism for comparing system scores in NLP.
(i) We illustrate the importance of accounting for instance-level pairing and discuss the conditions under which the mean, median, and BT disagree about the ordering of systems. In Sec. 3, we draw parallels with the field of statistical testing, where paired statistical tests are recommended when comparing paired variables. Thus, we argue that paired aggregation mechanisms such as BT are more robust alternatives to the mean and median. We support this argument with simulations in Sec. 4.
(ii) We show that the differences between mean, median, and BT matter in practice. By re-evaluating 296 real NLP evaluation setups across four tasks and 18 evaluation metrics, different aggregation mechanisms yield different conclusions as to which systems are SotA in about 30% of the setups (Sec. 5). These results hold when replacing BT by the Elo (Elo, 1978) and TrueSkill variants.
(iii) We discuss further advantages and potential limitations of BT, alongside possible resolutions, in Sec. 7.
(iv) We recommend replacing the mean by BT in future evaluations of NLP systems. To ease the adoption of more robust aggregation mechanisms, we release Pairformance, 1 a practical tool for performing full analyses of evaluation scores with mean, median, BT, and two variants of BT (Elo and TrueSkill). The tool reports paired evaluation results alongside appropriate statistical testing for all five aggregation mechanisms and various visualization functionalities to elucidate the pairing structure between system scores.
Code and data for replicating our analyses and experiments is available online. 2 1 https://github.com/epfl-dlab/ pairformance 2 https://github.com/epfl-dlab/BT-eval 2 Aggregation of evaluation results In this section, we briefly present the three aggregation mechanisms we consider.

Terminology
A standard evaluation setup typically consists of four elements: 1. At least two systems, A and B, to compare, with latent strengths λ A and λ B that we aim to estimate. 2. A test set T = (x l , y l ) : l = 1, . . . , n consisting of n test instances, where x l is the input and y l is the ground-truth target output. 3. An evaluation metric M for scoring system outputs based on target outputs y l , resulting in the sequence of evaluation scores M A = M(A(x l ), y l ) : l = 1, . . . , n for system A. 4. An aggregation mechanism Θ that decides whether system A is better than B based on the evaluation scores of the two systems. We use Θ T,M (A, B) = Θ(M A , M B ) to denote the comparison mechanism between A and B on the test set T with evaluation metric M. Here, Θ outputs its guess about which system is the best (or declares the comparison inconclusive if the difference is not statistically significant). For simplicity, we drop the dependency on T and M in the notation, simply writing Θ(A, B).
For example in text summarization, x l is a source document from the test set, y l its corresponding reference summary, and M might be ROUGE (Lin, 2004). The decision mechanism Θ usually compares the individual systems' mean evaluation scores, where the system with the highest mean score (here mean ROUGE score) is declared better.

Consistent evaluation result.
We say that the outcome of such an evaluation is consistent if it recovers the ordering of systems implied by the inherent strengths of systems: Probabilistic model. As commonly done in the literature on statistical testing, we view the evaluation scores of a system A as n indexed random variables: X (l) A , l = 1, . . . , n, where n is the size of the test set. Note that this sequence of random variables is not necessarily i.i.d. Furthermore, even though systems A and B are independent, their evaluation scores are not, since there is an instance-level pairing. Intuitively, knowing the score of A on an instance (x l , y l ) can provide information about the expected performance of B. For example, if A scores highly because (x l , y l ) is an easy instance, one might expect B to also score highly.

Aggregation mechanisms
We now introduce three aggregation mechanisms Θ. We investigate their properties in subsequent sections.
Mean. This is the current standard: the system with the highest average score is declared the strongest. We denote this aggregation mechanism as MEAN. The average score of system A is com- Median. The median is an interesting alternative to the mean because it is robust to outliers. Here, the system with the highest median score is declared to be the strongest. The median score M A of a system A is the central value in the sorted list of evaluation scores of A. We denote this aggregation mechanism as MEDIAN.
Bradley-Terry. The third option examined here is the Bradley-Terry (BT) model (Bradley and Terry, 1952). While MEAN and MEDIAN compute scores for systems A and B independently, BT is a function of the joint random variable X (l) . BT estimates the relative strengthsλ A andλ B of the two systems A and B, by comparing the evaluation scores for each test instance: Intuitively, P(A > B) is the probability that, for any given test instance, A scores higher than B. The BT model choosesλ A andλ B in order to best explain the observations. The system with the highestλ is declared strongest. When considering only two systems, the latent strengthλ A is the number of instances for which A scores better than B (and similarly for λ B ). When the number of systems is greater than two, BT solves an iterative optimization algorithm that is guaranteed to converge to a unique solution (Bradley and Terry, 1952). We give details about BT and its computation in the general case in Appendix E.
We denote as BT the decision mechanism based on the BT model. While it is much less common than MEAN and MEDIAN, we will see below that BT satisfies interesting properties making it a more robust alternative.
where E S and M S are the mean and median of the evaluation scores of system S, and M A−B is the median of the differences between the evaluation scores of A and B. Note that E S , M S , and M A−B are all random variables.
The proof is given in Appendix B. Note that, whereas the expectation is linear ( Robustness to ouliers. The mean is not robust to outliers: E A−B can be swayed above or below the threshold of 0 by a small number of test instances for which the difference between system scores is large. On the contrary, the median is a robust statistic that cannot be easily influenced by outliers. Similarly, BT is robust to outliers because its decision is based on the median of differences M A−B . Importance of pairing. The critical difference between BT, MEAN, and MEDIAN, is that only BT preserves the pairing information. Both MEAN and MEDIAN compute a statistic from the (unordered) set of scores X

(l)
A and X (l) B independently and then compare the aggregate statistics, losing the pairing structure. If the pairing actually does not matter, any permutation of the indices of system scores leaves the distribution of paired evaluation scores unchanged. This happens, for example, when both X (l) However, in the general case, the pairing matters. One particular example is when there exist different types of test instances and systems behave differently for different types, e.g., when there are easy instances on which all systems have higher scores. For example, consider the three systems and their evaluation scores on five test instances in Fig. 1. System A is worse than C on all instances but one, so C > A according to BT, yet the median of A is greater than the median of C (10 vs. 7). At the same time, B outperforms C on all instances but one, so B > C according to BT. For MEDIAN and MEAN, which ignore the pairing, A and B are completely equivalent, even though there is a clear difference regarding which system is more likely to be the best. This difference is revealed in the pairing structure. In general, any mechanism ignoring the pairing cannot capture the difference between A and B.
Choosing an aggregation mechanism. In Prop. 1, we stated the conditions for each mechanism to be consistent. Choosing an aggregation mechanism for a specific evaluation setup boils down to deciding what condition is more likely to hold in the setup. Note that none of the conditions implies any other condition in Prop. 1.
When comparing BT against MEAN (or ME-DIAN), there are three possible scenarios: (i) BT agrees with MEAN (or MEDIAN), (ii) BT is consistent but MEAN (or MEDIAN) is not, and (iii) MEAN (or MEDIAN) is consistent but BT is not.
In case (i), it does not matter whether we use BT or MEAN (or MEDIAN).
In case (ii), for most instances, the better system has a higher score than the worse system, but MEAN (or MEDIAN) fails. For example, MEAN may be swayed by outliers, and MEDIAN may be swayed by jumps in score lists as in the example above.
In case (iii), for most instances, the better system has a lower score than the worse system, yet particular variations in the marginals make the MEAN or MEDIAN get the ordering correct. This is a very peculiar scenario: for MEAN, it implies that on the few instances on which the better system did better, the difference between evaluation scores was large enough to lift the mean of the better system above the other. We argue that if one really believes that the evaluation setup is likely to be in case (iii), then one does not trust the evaluation setup in the first place. It corresponds to assuming that the observed scores are inconsistent for the majority of test instances. If this is the case, one should rather improve the evaluation setup (e.g., metric, test set) in order to be more representative of the phenomena that one desires to capture.
Overall, the condition making BT consistent appears to be the most natural one. Trusting MEAN or MEDIAN more than BT implies holding an unintuitive belief about the evaluation setup, namely that the better system does worse than the worse system on a majority of test instances.
From another perspective, the random variables E A − E B (MEAN) and M A − M B (MEDIAN) are less likely to be (correctly) greater than zero in the presence of (i) complex pairing structures or (ii) outliers. The variable M A−B (BT), on the contrary, is not affected by complex pairings or outliers. Fig. 2 summarizes the problem of ignoring the pairing and offers a graphical criterion to understand the decisions made by MEAN, MEDIAN, and BT. In each plot, the densities are estimated by placing test instances at coordinates given by the evaluation scores of the two systems. The evaluation scores of A (green) are on the x-axis, and the evaluation scores of B (blue) on the y-axis. We also plot the marginal distributions of evaluation scores, from which we can read off means and medians. When the mean of X (l) B is greater than that of X (l) A , the two extended lines representing the means meet in the upper triangle (above the line X A = X B ), and analogously for the median. But mean and median being only functions of the marginals, they completely ignore the pairing. Fig. 2 illustrates this by depicting three completely different pairing structures where the marginals (and thus the means and medians) of A and B remain unchanged. (In Appendix A.1, we explain how to generate infinitely many such examples.) On the contrary, BT, being a property of the pairing (the 2D density), predicts that B is better than A when there is more mass in the upper triangle, i.e., more instances for which B scores higher than A. In the middle figure, the pairing indicates that A is better than B, in disagreement with the decisions of MEAN and MEDIAN.

Connection with statistical testing
The above discussion about the differences between MEAN, MEDIAN, and BT has interesting parallels with statistical testing.
When comparing the means of two systems over the same test set, the recommended statistical test is the paired t-test (Fisher, 1935). When comparing medians instead of means, the appropriate test is the sign test, which measures whether the median of the difference is significantly differerent from zero. Interestingly, the statistic of the sign test is precisely the one in the condition for BT to be consistent (see Prop. 1). Wilcoxon's signed-rank test (Wilcoxon, 1945) is often used as an alternative to the sign test because it has more statistical power (at the cost of making more assumptions). However, : These 2D plots represent the distribution of test instances with coordinates given by the scores of the two systems being compared, i.e., the x-axis is the score X (l) A of system A on some test instance (x l , y l ), and the y-axis is the score X (l) B of system B on the same instance. While the 3 plots represent different instance-level performances of A and B, the marginal (unpaired) distribution of scores of A and B remain unchanged. From such 2D plots, not only do we see the global structure of the pairing between the scores of A and B, we can also read off the decision of MEAN, MEDIAN and BT based on simple geometrical criteria: (i) if the prolongation of the means intersect above the X A = X B line, then MEAN predicts that A is better, (ii) if the prolongation of the medians intersect above the X A = X B line, then MEDIAN predicts that A is better, (iii) if there is more mass in the upper-left triangle, then BT predicts that system A is better. The latter case corresponds to most of the test instances being located in the upper-left triangle (A > B). The half-space with more mass is shaded. Divine et al. (2018) showed that Wilcoxon's signedrank test does not always properly account for the pairing of data, unlike the sign test.
When performing statistical testing, it seems obvious that we should use the paired version of tests when the data is naturally paired (Rankel et al., 2011). Even works discussing statistical testing in NLP recommend Wilcoxon's signed-rank test (Graham, 2015;Owczarzak et al., 2012;Dror et al., 2018). Yet, to obtain aggregated scores for systems, the community still mostly uses aggregation mechanisms that ignore the pairing, such as MEAN. MEDIAN is the outlier-resistant version of MEAN, and BT is the paired variant of MEDIAN. Whenever one recommends a paired test of medians, such as the sign test or Wilcoxon's signed-rank test, to obtain p-values, one should use BT to compare system scores.

Simulations with synthetic data
Next, we perform simulations to extend the analysis of the previous section to (i) N > 2 systems, (ii) finitely many test samples, (iii) a practical implementation of BT (for N > 2 systems, BT is an iterative optimization algorithm, as discussed in Appendix E).
We synthesize evaluation scores with various properties starting with systems of predefined implicit strengths λ i . To create situations where the pairing of evaluation scores matters, we introduce multiple test instance types. For each type, systems perform differently but still have the same relative strength (P(A > B)), differing only by an added offset. For example, the evaluation scores obtained by A and B could be sampled from N (λ A , σ) and N (λ B , σ) for one test instance type, and by N (λ A + , σ) and N (λ B + , σ) for another type, with being the offset. We sample evaluation setups by varying the following properties: the number of systems, the number of test instances, the percentage of outliers, the numbers of test instance types, and the level of noise. This results in 2,880 simulated evaluation setups. In Appendix A.2, we give the detailed algorithm and parameters used to generate the data.
In Fig. 3, we report Kendall's τ between the latent scores λ i and the aggregated scores estimated by MEAN, MEDIAN, and BT. When the evaluation setup does not present any difficulty ( Fig. 3(a, b)), all aggregation mechanisms work equally well (within each other's 95% error bounds), improving with more samples (Fig. 3(b)) and deteriorating with more systems (Fig. 3(a)). Unsurprisingly, MEAN fails in the presence of outliers, whereas MEDIAN and BT are unaffected ( Fig. 3(c, e, f)).
When several types of test instances are considered, MEDIAN begins to fail ( Fig. 3(d)), which is made worse when outliers are also present ( Fig. 3(f)). Overall, BT is more robust and does not fail when the pairing matters Fig. 3(g, h).

Analysis of empirical data
In this section, we perform large-scale experiments using real evaluation scores from four NLG tasks. For summarization, we use the TAC-08,

Comparison of BT, MEAN, and MEDIAN
In Table 1, we report the disagreement between aggregation mechanisms over all the data with three measures: the percentage of pairs ranked in a different order (rescaled version of Kendall's τ ), the percentage of setups where the state-of-the-art (SotA) systems are different, and the percentage of setups where the top 3 systems are different (compared as sets). A significant fraction of pairs of systems (about 10%) are ranked differently by different mechanisms. More importantly, top systems are often different (in about 40% of setups for top 1 and 50% for top 3). We can conclude that the choice of aggregation mechanism has a real impact on evaluation outcome. The observed disagreement between the three aggregation metrics implies that we are not in the case depicted by Fig. 3(a) and Fig. 3(b), i.e., the pairing matters and there are outliers in real data. In the next paragraphs, we break down the disagreement per evaluation metric, task, and test set size. Detailed results are provided in Appendix C.
Which metrics are impacted most? We report in Fig. 4(a) the percentage of disagreement between aggregation mechanisms per metric averaged over datasets, when subsampling test sets of different sizes uniformly (see Appendix A.3 for details). While most metrics are available for all four tasks, METEOR and CIDEr are only available for the captioning task. Therefore, the observed disagreements for these metrics may be a feature of the task instead of the metrics. Interestingly, recent metrics  Which tasks are impacted most? Fig. 4(b) summarizes an analysis as above, but across tasks instead of metrics. Again, to control for the fact that some tasks may have larger datasets, we subsample uniformly from various test set sizes. The results are averaged over evaluation metrics. Machine translation and summarization suffer the least while dialogue and image captioning display larger disagreement between aggregation mechanisms. This suggests important future research directions to improve the evaluation setups in these tasks.
Importance of dataset size. In Fig. 4(c), we report disagreement across test set sizes, while averaging over datasets and evaluation metrics. It is reassuring to observe that with larger test sets, the different mechanisms tend to agree more, such that it matters less which one is actually chosen. However, for MEAN vs. BT and MEDIAN vs. BT, the disagreement does not continue to decrease below 10% with more test instances. For MEAN and BT the disagreement is lower but exhibits the same behavior, never falling below a certain threshold.
Different perspectives on uncertainty. In standard evaluation setups, not only system scores are reported but also whether the differences are statistically significant (Dror et al., 2018). Therefore, we ask how often differences that are statistically significant for one test are also statistically signif-icant for another. The details of this experiments are presented in Appendix D and show, perhaps unsurprisingly, different behavior for different tests.
In particular, the paired t-test is the one that most often finds differences to be significant (for 41% of pairs); Mood's test, an unpaired test to compare medians, finds significance for only 21% of pairs; and the sign test and Wilcoxon's sign-rank test (related to BT) are in between (for 35% and 40% of the pairs, respectively).

Sources of disagreement.
Based on the analysis of Sec. 3, we know that the difference between MEAN and MEDIAN is due to the presence of statistical outliers, while the difference between ME-DIAN and BT is due to the presence of different test instance types (Fig. 3). With real NLP datasets, in Fig. 4, we observe some discrepancy between MEAN 2015).   Fig. 4(a) shows the disagreement per evaluation metric averaged over tasks and uniformly subsampled test set sizes, Fig. 4

Discussion
We briefly discuss some possible questions raised by the use of BT-like metrics, with more details provided in Appendix E, F, G, and H.
Extension to other evaluation setups. The experiments of Sec. 5 focus on reference-based NLG evaluation metrics. However, the arguments laid out throughout the paper apply beyond this setup. Any comparison of systems based on score aggregation is susceptible to suffer from outliers and complex pairing structures (e.g., Fig. 2). Future work should replicate our experimental setup for reference-free NLG (Zhao et al., 2020), classification, or regression tasks.
Type imbalance. Imagine a test set with a majority of easy instances and few hard ones. A system A could perform slightly worse than B on easy instances but much better on hard ones and will be declared worse by BT. If one views this decision as problematic then one should probably acknowledge that the test set is not representative of what should be measured. If hard instances matter more there should be a majority of them in the test set.
Hoping that MEAN will be swayed to output the intuitive ordering of systems from a minority of test instances is a peculiar expectation to have about the evaluation setup. To diagnose such pathological cases, our tool, Pairformance, offers the possibility to view pairwise plots (as in Fig. 2) and histograms of score differences. More generally, better aggregation mechanisms such as BT do not solve all potential problems of evaluation methodologies.
Other aspects (such as choosing evaluation metrics or meaningful, representative, and large test sets) are all independent of the choice of aggregation mechanism, but also critical to the quality of the evaluation.
Transitivity. BT is not computed independently for each system, and it can happen that adding or removing a baseline impacts the scores of other systems. We explain this phenomenon in Appendix F and show that it is rarely a problem in real data. More generally, we discuss the connection with Arrow's impossibility theorem in the context of the aggregation of social preferences (Arrow, 1950). The Pairformance tool gets around this difficulty by offering the possibility of analyzing each pair of systems independently.
Relaxing assumptions. BT assumes that the relative strengths of systems remain constant across test instances. This might not always be true, especially when some systems are crafted for some specific kind of instances but perform badly on others. In such cases, BT still produces meaningful and easily interpretable results but fails to capture the latent structure of system strengths. Several refinements of BT are possible; e.g., item response theory extends BT by modeling instance difficulty, and Elo and TrueSkill allow system strengths to be stochastic and vary across instances. These refinements come at the cost of introducing new parameters, and it remains unclear how to choose these parameters in practice. Future work should investigate systematic ways to choose these parameters.
Tool description. We release Pairformance, a tool for performing full diagnostic analyses based on an evaluation dataframe made of the evaluation scores of systems and baselines. It can perform the analysis based on MEAN, MEDIAN, BT, Elo, and TrueSkill. For each aggregation technique, it outputs a full pairwise analysis of all pairs of systems. For MEAN and MEDIAN it compares score differences for pairs of systems. For BT, Elo, and TrueSkill, it estimates the probability that one system is better than another. All analysis is accompanied by appropritate statistical testing. See Fig. 5 for an example based on the BT mechanism. Furthermore, the tool can plot the histogram of paired differences X  tification of pathological patterns such as those discussed above.

Conclusion
We performed a critical assessment of the standard NLP evaluation methodology based on averaged scores, which ignores the natural instance-level pairing of evaluation scores when comparing systems. We showed the importance of the pairing and demonstrated the advantages of paired mechanisms such as Bradley-Terry (BT) over more standard aggregation schemes such as the mean or median. The choice of aggregation mechanism matters in real evaluation setups, and we therefore recommend BT as a robust aggregation mechanism. To facilitate adoption, we release Pairformance, a new tool to perform full analyses of system scores using BT and two of its variants, Elo and TrueSkill.

A Reproducibility
In this section, we give additional details to ensure the reproducibility of our experiments. Furthermore, the code and data to reproduce each figure and table of the main paper is available at: https://github.com/epfl-dlab/BT-eval.

A.1 Pairing examples
It is straightforward to generate examples where the marginal distribution of the evaluation scores of two systems remain unchanged even when the pairing varies.
To do so, one can define k types of test instances. For each type t i , each system has a probability distribution of scores for this type: N (λ t i ,S , 1). So for instances of type t i , the system S has score λ t i ,S in expectation with a variance of σ 2 = 1. Similarly, another system B can have different λ t i ,B parameters. An example is given in Table 2. Now, observe that permuting the columns of S without changing the row B leaves the marginal distribution of S and B unchanged but changes the pairing. Then, one can simply iterate over all permutations of the row S to obtain many different pairings with fixed marginal distributions.

A.2 Simulation
We discuss the synthetic data and experiments depicted in Fig. 3.
To introduce pairing issues, we create a variable number of test instance types: N types . For each test type, each system has a different distribution of scores. On test type t i , the system s j has a normal distribution of scores: N (λ i, j , σ 2 ), where we fix σ 2 = 1 throughout our experiments. For each system, the λ i, j are sampled uniformly from [0, 1]. Depending on the values of λ i, j , the score distribution of system s j can become multimodal. When, there is only one test type, the score of each system s j is a normal N (λ j , σ 2 ). In that case, the pairing can be ignored and MEAN and MEDIAN are expected to work well.
For outliers, we define f as the fraction of test instances on which systems' scores are not drawn from their distribution scores. For such instances, we first draw the scores for each systems according to their distribution and then perform a random permutation, so that each system receives a score that is not sampled from its score distribution.
Then, we vary the number of systems present in the evaluation N sys and the number of test instances M. Each choice of N types , f , N sys , and M gives a dataframe corresponding to an evaluation setup on which we can compare MEAN, MEDIAN, and BT against the true latent strengths of systems λ i, j . The evaluation and the y-axis in Fig. 3 is then the Kendall's τ between the ordering resulting from MEAN, MEDIAN, or BT against the ordering resulting from the λ i, j .

A.4 Implementations
We implement BT with scipy.org and numpy. For the statistical tests, we use the default implementation from scipy.org. For Elo, we implement a wrapper around existing code: https://github. com/ddm7018/Elo. Similarly, for TrueSkill, we implement a wrapper around existing code: https: //pypi.org/project/trueskill/.

B Proof of Proposition 1
Proof. We observe that the case of the MEAN and the MEDIAN are direct by definition.
M A−B > 0 is equivalent to saying that for more than 50% of instances, X

(l)
A > X (l) B , i.e., A is better than B on more than 50% of instances. On the other hand, BT correctly gives A better than B ⇐⇒ P(A > B) > P(B > A) ⇐⇒ P(A > B) > 1 2 , i.e., A is better than B on more than 50% of instances. So, BT is consistent ⇐⇒ A is better than B on more than 50% of instances ⇐⇒ M A−B > 0.

C Disagreement breakdown
Compared to experiments in the main paper, we provide a more detailed breakdown of the disagreement in Table 3.

D Different view on uncertainty
As argued in the main paper ( Sec. 3.2), the choice of aggregation mechanism bears strong similarities with the choice of statistical test. Thus, we measure in how many setups difference between systems that are statistically significant according to one test are also significant according to another.
We compare: paired t-test (usually to compare means), the Mood's median test, and the sign test (consistent with BT). We also add the Wilcoxon sign-rank test as it was often recommended by previous work (Owczarzak et al., 2012;Dror et al., 2018).
In Fig. 6, we plot the frequency with which test j yields a significant difference among the pairs of systems for which the test i has already yielded a significant difference. The diagonal depicts the overall percentage of pairs of systems for which the test finds a significant difference. Note that the matrix is not symmetric.
Interestingly, when the Mood's median test says the difference between two system is significant, 98% of the times it is also the case for the paired t-test and 89% of the times it is also the case for the Sign test. So the Mood's median is the most restrictive, finding less often significant difference than the other two. In comparison, the Sign test and the Wilcoxon's sign-rank test find significant differences between systems much more frequently. In general, the paired t-test is the one finding differences the most frequently.

E Details about the Bradley-Terry model
Given a pair of systems S i and S j , the Bradley-Terry model estimates the probability p i, j that the system S i is better than the system S j based on their relative strengths: λ i λ i +λ j . BT estimates these parameters λ i for each of the n systems from the observed results of evaluation. We denote as ω i, j the number of instances for which S i scores higher than S j . Note that, in our setup, there is one comparison per test instance. In the main paper, we said that the solutions forλ are found in closed-form for n = 2. When the number of systems is greater than 2, the parameters are   Figure 6: In this matrix, the cell in row i and column j indicates the frequency with which the test j finds a difference significant among the pairs of systems for which the test i has found the difference significant. For example, when the Mood's median test finds a significant difference between a pair, 98% of the times, the paired t-test also finds the difference significant.
Denote W i as the number of comparison in which system i is better: W i = j ω i, j . Then, the algorithm iteratively performs the following two up-dates (at step t): It can be shown that starting from a random λ this algorithm improves the log-likelihood at every iteration and converges to a unique maximum. For the practical implementation, only a threshold defining when to stop has to be decided. We choose to stop iterating when at step t, if the new vector of parameter λ remains close to the previous one: λ (t+1) − λ (t) 2 < . Throughout our experiments, we always set = 1 · 10 −9 .

F Transitivity with BT and Arrow's theorem
One possibly counter-intuitive behaviour of BT is that adding or removing a baseline can impact the scores and ordering of other systems. For example, consider two systems A and B with the following scores: M A = [1, 2, 3] and M B = [2, 3, 1]. Then, BT identifies system B has better with a relative strengths of 2 3 . Now suppose another system C is added with scores M C = [3, 2, 1], running BT on these 3 systems together gives the result that all systems have an equal strength, so now B is not seen as better than A.
We search for triple of systems which exhibit this pattern in our data and couldn't find any as long as we use more than 10 test instance.
Can we hope to fix this weakness? Arrow's impossibility theorem says no (Arrow, 1950). Our setup matches very well the problem of aggregating social preferences from voters. In this context, Arrow (1950) proved that no aggregation mechanism with more than 2 voters and 3 possibilities can simulataneously meet the 3 following criterion: (i) monotonicity: if every voter prefers X over Y , then the aggregation ranks X above Y , (ii) (IAA) the aggregated preference between X and Y should remain unchanged if voter preferences between other pairs change, and (iii) no dictators: the outcome is not decided by a single voter. In our framework, voters are test instance and preferences are given by the evaluation metrics. BT can fail on the second criteria, and MEAN and MEDIAN can be dictatorial (as seen in the paper). A way around this problem is to remain with pairwise comparisons of systems n < 3 and use BT. In that case, there is no possibility for BT to fail on IIA.

G Variants of BT: Elo and TrueSkill
BT has been extended in various ways. We discuss here two important variants that we incorporate in our analysis tool: Elo and TrueSkill.

G.1 Elo ratings
The Elo rating (Elo, 1978) is variant of the BT with an online update rule, i.e., the rating of systems (players) is updated as new test instances (new games) arrive. As BT, Elo computes the probability that systems S i beats system S j . Now, the t-th test instance arrives and system S i receives the score s i and system S j receives the score s j . We update the rating R based on this observed difference δ i, j : where K is parameter that has to be chosen, R the rating of some system, and Q plays a role analogous to λ k in BT. K controls how much each new instance can change the ratings. It can be shown that, implicitly, Elo corresponds to a version of BT where the strength of systems is represented by a normal distribution: λ i + i , i ∼ N (0, σ 2 ), with a variance σ 2 shared by all players (Elo, 1978). In our implementation, we provide the user with the ability to choose K and set it to 20 by default.

G.2 TrueSkill
TrueSkill (Herbrich et al., 2007) is Bayesian variant of the Elo rating system. It also updates the ratings of systems online, i.e., ratings change as new test instances arrive. Now, the strength of a system S i is represented by a normal distribution, N (λ i , σ 2 i ). In contrast to Elo, each player has its own variance. The update follows Bayes rule, but is intractable in general, so message passing approximation are often employed.

H Comparison of Elo, TrueSkill, and BT
We repeat the experiments of Table 1 from the main paper by replacing BT with Elo and TrueSkill with their default parameters. The results are shown in Table 4. With Elo and TrueSkill, the same conclusions from the main paper hold, i.e., paired aggregation mechanisms exhibit significant disagreement with MEAN and MEDIAN. Some discrepancies between BT, Elo, and TrueSkill remain which calls for further investigations about which one to choose.