The Glass Ceiling of Automatic Evaluation in Natural Language Generation

Automatic evaluation metrics capable of replacing human judgments are critical to allowing fast development of new methods. Thus, numerous research efforts have focused on crafting such metrics. In this work, we take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics altogether. As metrics are used based on how they rank systems, we compare metrics in the space of system rankings. Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans. Automatic metrics are not complementary and rank systems similarly. Strikingly, human metrics predict each other much better than the combination of all automatic metrics used to predict a human metric. It is surprising because human metrics are often designed to be independent, to capture different aspects of quality, e.g. content fidelity or readability. We provide a discussion of these findings and recommendations for future work in the field of evaluation.


Introduction
Crafting automatic evaluation metrics (AEM) able to replace human judgments is critical to guide progress in natural language generation (NLG), as such automatic metrics allow for cheap, fast, and large-scale development of new ideas. The NLG fields are then heavily influenced by the set of AEM used to decide which systems are valuable. Therefore, a large body of work has focused on improving the ability of AEM to predict human judgments.
Human judgment data is typically employed to decide which metric to select based on correlation analysis with human annotations (?Owczarzak et al., 2012;Graham, 2015). In this work, we take a step back and investigate the relationship between existing AEM and human judgments globally. We do not make metric recommendation but reflect upon the global progress in the field of automatic evaluation. Our work is motivated by the findings of Fig. 1. It depicts the improvement over time, when new metrics were introduced, in the ability to fit human judgments when using all existing metrics as features. The fit is measured by the correlation with humans of a trained classifier in a 5-fold cross-validation setup. Surprisingly, we observe small marginal improvement and little progress over the years. Recent works emphasized the importance of viewing metrics in terms of how they rank systems instead of just comparing score values (Novikova et al., 2018;Peyrard et al., 2021;Colombo et al., 2022a). Indeed, not only ranking is a more robust framework of comparison, it is also more aligned with the way metrics are used: identifying and extracting the "best system". Thus, we perform our analysis in the space of rankings. i.e., how do metrics rank systems? By analyzing 9 datasets covering 4 tasks and 270k scores, we made the following observations: Findings. (i) Automatic metrics are much more similar to each other, in terms of how they rank systems, than they are to human metrics. It means that AEM, even the more recent transformer-based ones are similar to the older ones when used in practice (ROUGE and BLEU). (ii) This lack of complementarity results in the inability to fit human judgments even when all these metrics are taken together as features for a classifier predicting humans. (iii) Quite surprisingly, different human dimensionsdifferent annotations guidelines such as readability, or content fidelity -are very predictive of each other, whereas AEM are much less predictive of humans. This finding is striking because human metrics are designed to capture different and independent aspects of quality whereas AEM have been selected precisely for their ability to match humans. We would expect human metrics to be uncorrelated and automatic metric to be highly correlated with humans but we observe the opposite. First, it casts serious doubt about the ability of AEM to replace human judgments. Then, the correlation between independent human annotations of quality hints at some latent inherent goodness of systems: good systems are good in different aspect whereas bad systems are bad across all aspects.
Our findings have several consequences that can inform future research. Newly introduced metrics are not complementary to previous ones, resulting in small global improvements. As a way forward, we propose that research, instead of crafting metrics that maximize correlation with humans, focus on making metrics that also aim to be explicitly complementary to the set of existing metrics. This would enforce maximal marginal gain and ensure that the field, as a whole, makes progress towards capturing the complexity of human annotations.
For practitioners, it is common practice to report several AEM in the hope to get a better view of system performances. However, reporting several metrics that all produce similar rankings does not bring useful additional information. With our proposal, reporting a set of complementary metrics would better serve the intended purpose.
To help research build upon our work and use our measure of complementarity, we make our code available at github.

Methodology
Terminology. Let X be the space of possible outputs for an NLG task. An NLG metric is a function m : X ×X → R + which, from a given textual candidate C ∈ X and corresponding reference R ∈ X , computes a score m(C, R) reflecting the properties that C should satisfy (e.g. fluency, fidelity...). Of course, it is illusory to summarize subtle semantic properties by a single scalar and one is rather seeking for metrics that are able to discriminate between different systems. In fact, crafted AEM are evaluated by comparison to human judgments: one usually computes ranking correlations such as the Kendall's τ . Higher correlations indicating that the AEM is a better replacement for the human metrics.
Encoding metrics with rankings. Since the usage of NLG metrics is to rank systems, we choose to represent an NLG metric, automatic or human, by the ranking it induces on a set of systems or of utterances. More formally, for N ≥ 1 NLG systems evaluated on a dataset made of K ≥ 1 utterances, there exists a natural ranking representations of m: Each utterance k ∈ {1, . . . , K} induces a ranking σ m k ∈ R N of the N systems seen as a vector σ m k , where σ m k (S) is the rank of system S ∈ {1, . . . , N}. For a system S, the representation of a metric m, noted σ m,S , is sum of rankings over the utterances: We call this System level representation. Symmetrically, each system k ∈ {1, . . . , N} induces a ranking σ m n ∈ R K of the K utterances, where σ m n (k) is the rank of utterance k. The Utterance level representation of m is sum of rankings over the systems: Using the space of rankings has been shown to be more robust than the raw scores as it is less sensitive to outliers and statistical variations (Novikova et al., 2017;Peyrard et al., 2021;Colombo et al., 2022a). Furthermore, this representation is closely tied to Borda counts, which enjoys theoretical properties: the ranking induced by σ m,S is a 5approximation of the Kemeny-consensus which is a good notion of average in the symmetric group (Kemeny, 1959;Young and Levenglick, 1978;Coppersmith et al., 2006). It is moreover the fastest approximation of the Kemeny-consensus whose computation is NP-hard (Ali and Meilȃ, 2012).
Complementarity. We measure the complementarity between two metrics -humans or automatic -by the average over utterances of the distance between their rankings of systems. Formally, for two metrics m 0 and m 1 , complementarity is given by: where d τ is the normalized Kendall's distance between the vectors of rank. It is related to the Kendall's rank correlation τ by: Similarly, we define the complementarity between a metric m 0 and a set of other metrics m := {m i } i=1,...,l , as the average pairwise complementarity: Complementarity measures the extent to which a metric ranks systems differently than another metrics or a set of other metrics. Whether comparing two metrics or a metric with set, it is a number between 0 and 1 where 0 indicates that the metrics rank systems in the exact same order and 1 indicates the exact opposite order. In between, it counts the number of inversions between the two rank lists normalized by the number of possible pairs of systems.

Dataset description
To ensure a wide coverage of NLG we focus on four different problems i.e., dialogue generation (using PersonaChat (PC) and TopicalChat (TC) (Mehri and Eskenazi, 2020)), image description   Figure 2: Complementarity: For each dataset, the pairwise complementarity between each pair of metrics as computed by Eq. 3 both human and automatic. In these matrix plot, symmetric by design, we ordered metrics to have the human one first and the automatic ones after, the red lines trace the limit between humans and AEM.
(Ng and Abrecht, 2015)), BERTScore (Zhang et al., 2019), MoverScore (Zhao et al., 2019). For MLQE we solely consider several version of BERTScore, MoverScore and ContrastScore. The human evalutions criterion are specific to each dataset and will be identified by starting with an H:. Overall, our final datasets gather over 270k scores.

Experiments
Finding 1: Automatic metrics are similar to each other much more than they are to human metric. In Fig. 2, we report the pairwise complementarity between each pair of metrics as computed by Eq. 3 for both human and AEM. When aggregated over pairs and over datasets, we obtain an average complementarity between: (i) two human metrics of .16 ± .01, (ii) two AEM of .20 ± .01 and (iii) a human and an automatic metric of .35 ± .02. Importantly, we observe across datasets low complementarity, i.e., strong similarity, between AEM, low complementarity between human metrics but high complementarity, i.e., low similarity, between automatic and human metrics.
We draw two conclusions from this analysis: (i) AEM rank systems similarly but (ii) differently than humans. There is some nuances across datasets. The effect described above is particu- larly strong in the Dialog, MLQE and SUM-Eval datasets. In particular, we notice that TAC datasets, from the summarization task, have lower complementarity in general, meaning that all metrics, human and automatic, are more similar. Indeed, a lot of works have relied on these datasets to develop new metrics. Interestingly, the more recent REAL-SUM and SUM-Eval reveal much lower metric similarity.
Finding 2: Automatic metrics even all combined do not explain human metrics. If AEM are rather different than human metrics, we might wonder whether it is possible to get a good approximation of human judgments by combining existing AEM together. To account for possible correlations, we rely on XGBoost regressors with 5-fold crossvalidation to predict human judgments. The training is performed on three different features space: (i) AEM only, (ii) other human metrics only and (iii) both sets of metrics combined. We compute the Kendall's τ between predictions and ground truths and report the results in Fig. 3. The plot confirms that AEM struggle to capture human judgment subtlety: correlation rarely exceeds .4 on held-out data. In contrast, human metrics are much more predictive of each others, even if they are often supposed to capture different concepts. Finally, it is worth noting that adding AEM to human ones do not marginally improve the prediction power. These findings cast shadows over recent progress in the field. In next section, we discuss the implications and make a proposition for future work.

Discussion
Our analysis reveals that automatic metrics are not complementary, and recent automatic metrics actually capture the same properties of human judgments as older ones. Furthermore, the existing metrics are not strong predictors of human judgments. Quite surprisingly, other human metrics which are often designed to be independent of each other end-up being more predictive of each other than automatic metrics. This predictability of human metrics from one another can be explained due to the available datasets: when a system is good at extracting content, it is also often good at making the content readable, when a system is bad it is often bad across the board in all human metrics. However, the fact that automatic metrics are less predictive than other human dimensions casts some shadow over recent progress in the field. It shows that the current strategy of crafting metrics with slightly better correlation than baselines with one of the human metrics has reached its limit and some qualitative change would be needed.
A promising strategy to address the limitations of automatic metrics is to report several of them, hoping that they will together give a more robust overview of system performance. However, this makes sense only if automatic metrics measure different aspects of human judgments, i.e., if they are complementary. In this work, we have seen that metrics are in fact not complementary, as they produce similar rankings of systems.
Proposition for future work To foster meaningful progress in the field of automatic evaluation, we propose that future research craft new metrics not only to maximize correlation with human judgments but also to minimize the similarity with the body of existing automatic metrics. This would ensure that the field progresses as whole by focusing on capturing aspects of human judgments that are not already captured by existing metrics. Furthermore, the reporting of several metrics that have been demonstrated to be complementary could become again a valid heuristic to get a robust overview of model performance. In practice, researchers could re-use our code and analysis to enforce complementarity by, for example, enforcing new metrics to have low complementarity as measured by Eq. 3. Even though we have considered a representative set of automatic evaluation metrics, new ones are constantly introduced and could be added to such an analysis. Similarly, new datasets could be added to the analysis and impact the results. In an effort to make our findings relevant in the long run, we release an easy-to-use code base to replicate our analysis with new metrics and datasets.
Like the majority of analysis on automatic evaluation metrics, ours rely on the assumption that human judgments are valid and meaningful. However, some works have questioned the quality of human judgments in standard datasets.

A.1 Utterance level Representation
In the main paper, we focus on System level representation. Each utterance k ∈ {1, . . . , K} induces a ranking σ m k ∈ R N of the N systems, where σ m k (n) is the rank of system n. The system level representation of m is the sum of rankings over the utterances: In the supplementary, we also provide an analysis at Utterance level representation. Each system k ∈ {1, . . . , N} induces a ranking σ m n ∈ R K of the K utterances, where σ m n (k) is the rank of utterance k. The utterance level representation of m is the sum of rankings over the systems:

A.2 A remark on the rank representations
For a given family of l ≥ 1 objects, the formal mathematical object describing a ranking is a permutation σ ∈ S l which describes how the objects must be interchanged to be ordered. The set of permutations is a group where the notion of mean is not straightforward since the addition of two permutations is not a well defined object. For a given family σ 1 , . . . , σ p , the classical surrogate is called a Kemeny consensus, defined by: where d the Kendall distance, given by: However, computing a Kemeny consensus is a NP hard problem (Bartholdi et al., 1989;Dwork et al., 2001). It turns out that the Borda count, defined as the sum of ranks induced by the permutations, is a very good approximation of the Kemeny consensus (Ali and Meilȃ, 2012), justifying our choices (5) and (6).

B Extending Finding 1 using clustering analysis
In this section, we want to obtain a visual and interpretable representation of both automatic and human metrics to understand their relationships better. Formally, we study the abstract space of metrics when encoded at the System or Utterance level. We ask the two following questions: • What is the effective dimension of this space?
• Does it exist clusters of metrics?

B.1 Representing the metrics in a 2D space
In Figure 4a and Figure 4b, we report the variance analysis given by a PCA (Jolliffe and Cadima, 2016) for each dataset at the System and Utterance levels, respectively. Analysis: We observe that only a few components (less than 6) are needed to explain over 80 % of the variance. This behavior is typical to all considered datasets and can be observed when studying the ranks at the System and Instance levels.
Takeaways: Automatic and human metrics present in our datasets can be represented in a lowdimensional space. This confirms the low complementarity already observed in the main paper: the effective dimension of metrics is small. We will use the two first components in the next experiments to represent the metrics in a 2D space.  In Figure 5 and Figure 6, we represent all the considered metrics (both human and automatic) on the 2D-dimensional space corresponding to the two first components of the PCA. We cluster the metrics with the Louvain Algorithm (Blondel et al., 2008) performed on the similarity matrix between metrics. Analysis: From the figure, we observe a low number of clusters, i.e., two in most cases and at most three in the case of utterance level representations. When using system-level representation, the Human metrics have their cluster in all the configurations except for FLICKR, where H:overall is in the same cluster as JS 2 . We observe a similar trend when studying the utterance level representation: human metrics often belong to the same cluster, which contains a low number of automatic metrics. It is also worth noting that in most figures, human metrics are isolated.
Takeaways: This experiment further validates Findings 1: Automatic metrics are similar to each other much more than they are to human metric. The proposed procedure could be used in the future to find properties of newly introduced metrics and obtain visual representations of the metrics.

C Further results for Findings 2
In this section, we provide further experiments that validate Findings 2 and provide a method for future research to understand newly introduced metrics better. Specifically, we aim to answer the following research question: • In Findings 1 we showcase that human metrics carry different information than automatic metrics. How to measure the amount of information missing between the automatic and human metrics?
• What metric or group of metrics are the most useful to predict a given human metric?

C.1 Measuring the information missing in automatic metrics
In this subsection, we extend the result provided by Figure 3. We measure the ratio between the MSE-error of a linear regression trained with automatic metrics together with human metrics and a linear regression trained only with automatic metrics for varying regularization coefficient. For each dataset, we provide mean and variance corresponding to the prediction of available human metrics. When solely one human metric is available, the dataset is not considered.
Observations: From the Figure 7, we observe a strong decrease in error when adding human metrics to predict another human metric. When α increases, all the coefficients are set to 0, and the relative MSE  (d) Detailed for each dataset when using utterance level Representation Figure 7: Human metrics contain useful information that is not in automatic metrics for predicting other human metrics. On this plot, we report the ratio between the MSE-error of a linear regression trained with automatic metrics together with human metrics and a linear regression trained only with automatic metrics. For each dataset, we provide mean and variance corresponding to the prediction of available human metrics. is thus 0. It is worth noting that these observations hold for both system and utterance level representation. When observing the details per dataset, we observe a similar trend for all human metrics. Takeaways: When predicting a specific human metric, other human metrics contain useful predictive information that is not present in the automatic metric.
C.2 Which metrics are the most useful to predict human judgment at the System level?
For this experiment we will rely on a Lasso Regression and denote the multiplier of the L1 term α. For several values of α (x-axis), we report each metric's weights (y-axis) in Figures 8 and 9. Observations: When increasing the weights given to the L1 penalization term, we observe that the regression weights of the human metrics are the ones that are the last to be set to 0. Human metrics contain the most relevant information. It is worth noting that this phenomenon is generic across the datasets and human criteria. Takeaways: Human metrics are the most useful metrics when predicting other metrics.   Figure 9: Human metrics are the most useful metrics when predicting other metrics. Regression weights (y-axis) obtained by each metric when training a Lasso Regression to predict a human metric for different regularization coefficients (x-axis) on the system level representation of the metrics.