Re-Examining Summarization Evaluation across Multiple Quality Criteria

,


Introduction
The broad interest in text generation has triggered significant progress in the development of respective automatic evaluation metrics, complementing costly manual evaluation.In order to assess and rank various automatic metrics, the conventional setup compares the ranking of various systems according to metric scores and human scores, where a high correlation between them indicates an effective metric.In some cases, a task requires evaluation by multiple quality criteria (QCs), as is prominently the case for text summarization, where common criteria include coherence, fluency, faithfulness and saliency.In such cases, comparing metric scores to human scores separately for each criterion should supposedly determine which metrics Table 1: Correlations (Kendall's τ ) between metric and human scores for each QC, as taken from Fabbri et al. (2021).Only the best metric in each QC is presented.
are most suitable for evaluating each respective criterion.
To enable such a metric evaluation procedure, some annotation efforts have been conducted (e.g., Bhandari et al., 2020;Fabbri et al., 2021), where system summaries were manually scored according to several QCs.This yielded a "gold ranking" of systems per QC, to which metric rankings can be compared.The recent SummEval benchmark (Fabbri et al., 2021) has attracted interest due to its thorough data collection of 1,700 annotated system summaries over 17 systems.Overall, all summaries were rated over 4 different QCs: Relevance, Consistency, Fluency and Coherence (clarified in §2).Within this benchmark, the majority of the evaluated metrics were primarily intended to assess Relevance, with a smaller subset comprising statistical analysis-based metrics such as summary length, which were not specifically designed to measure any of the four aforementioned QCs.In our work, we focus on the major group of metrics that were designed to measure Relevance.Table 1 presents correlations between the best evaluation metrics in each QC and the SummEval human scores.The full table can be found in Table 6 in Appendix A.1.
Surprisingly, a closer examination of metric performances in SummEval reveals that many of the high correlations are unlikely to imply high fitness of the best performing metrics and the respective QC.For example, ROUGE-1 (Lin, 2004), which measures the lexical match of unigrams between the system and reference summaries, appears to perform well as a fluency metric.Fluency is typically not determined based on unigrams alone and should not depend on the reference summary.This raises questions such as (1) what was the cause for these high correlations, and (2) when can correlations be trusted.
In this paper we address these two questions with thorough analyses providing two contributions.First, we point out that the conventional scheme for metric ranking is problematic in the case of multiple quality criteria.To achieve this we discovered that even the best metrics for each QC fail to penalize system summaries containing corruptions that should adversely affect that QC.This suggests that these metrics are ineffective in measuring the intended QC.In addition, we show that there are spurious correlations in a multi-QC setup like sum-mEval, caused by high performance correlations across criteria.Second, we suggest a method for detecting metric-to-human correlation scores that are suspected as spurious, by removing the effect of a confounding variable, which reveals a performance degradation in many cases.This provides a first step to cope with this obstacle, while calling for further research on metric evaluation in the multi-QC case.

Background and Related Work
A high-quality summary should satisfy several requirements, including preserving the salient information of the source, being faithful to the source, and being coherent.The DUC benchmarks (NIST, 2002) made an early attempt at capturing these requirements with a human evaluation procedure that assessed several readability qualities as well as content responsiveness (Dang, 2005).Later, additional benchmarks focused on content quality (e.g., Bhandari et al. (2020), TAC (NIST, 2008)) or linguistic quality (Chaganty et al., 2018) separately.Kryscinski et al. (2019) laid out four Quality Criteria (QCs) on which to assess summary quality, which were reinforced in the SummEval summary evaluation benchmark (Fabbri et al., 2021), annotated over the CNN/DailyMail (Nallapati et al., 2016) summarization dataset.
Described briefly, the four QCs are: Relevance, measuring the importance of summary content with respect to the source; Consistency, measuring the faithfulness of the summary content to the source; Fluency, measuring the linguistic quality of individual sentences in the summary; and Coherence, measuring the quality of the collective structure of the sentences in the summary.A summarization system is rated for each QC by averaging human scores (1 to 5 scale) over all input instances.Thus, multiple systems can be scored and ranked separately for each QC in accordance with the respective human scores.
To assess the effectiveness of an automatic evaluation metric, it is first applied on the outputs of several summarization systems, yielding a respective score per system (averaged over all instances).The metric performance for a certain QC is then measured as the correlation between these metric scores and the corresponding human scores for that QC.The described meta-evaluation procedure enables to rank multiple evaluation metrics per each QC separately, as illustrated in Table 1.In our work, we question whether highly ranked metrics on a particular QC are indeed suitable for evaluating that QC, claiming that such correlation to human scores are not necessarily indicative in the multi-QC setting.
Some studies examined other aspects of this meta-evaluation scheme.Peyrard (2019) and Bhandari et al. (2020) showed that meta-evaluation using old systems or datasets, like those of DUC and TAC, yield erroneous trends on modern summarization benchmarks.Deutsch et al. (2021) investigated the preciseness of correlations between metrics and human annotations in meta-evaluation benchmarks, and proposed approaches to improve the level of confidence.Finally, Deutsch et al. (2022) discussed ways of improving the reliability of system-level correlations in meta-evaluation.

Metric Robustness to Summary Corruptions
In this section, we aim to investigate whether a metric that was found to strongly correlate with human scores for a certain QC according to SummEval reliably measures this specific QC.Specifically, we suggest that the performance of metrics -that were designed to measure Relevance -on other QCs (e.g., using ROUGE to measure Fluency) may be questionable.To examine this, we artificially corrupt SummEval (Fabbri et al., 2021) system summaries in various forms, each specifically designed to degrade a single QC.The corruption is expected to have a large impact on the corresponding QC measurements, while having minimal effect on other QCs.Accordingly, we may conclude that a metric that fails to penalize corrupted summaries does not actually measure the specific QC for which corruption was introduced.
In what follows, we experiment with the following corruptions of the system summaries with respect to each QC (except for Relevance which is the QC most metrics were designed to capture).Fluency: All verbs are replaced with their lemma form, resulting in ungrammatical sentences.Coherence: Sentences are randomly shuffled to disrupt the structure of the summary.This corruption is inspired by the Shuffle Test (Barzilay and Lapata, 2008) used to evaluate whether models can detect incoherent text.Consistency: All PERSON named entities (using SpaCy NER, Honnibal et al., 2020) are replaced with different PERSONs from the source document.This is in fact a common factual mistake of models (Pagnoni et al., 2021).An example for each corruption type can be found in Table 9 in the Appendix.
To test the meta-evaluation of metrics presented in SummEval, we examine the sensitivity of the best metric for each QC, according to SummEval results (from Table 1), to the QC-specific corruption.More specifically, for each QC, we ranked all systems with the best metric, corrupted each system in turn, and examined whether that system's ranking was subsequently downgraded below other systems that were initially scored significantly lower.We found that none of the corrupted systems were downgraded in such relative rankings with the Coherence and Fluency corruptions, while only two (out of 17) were downgraded with the Consistency corruption.Assuming our drastic corruptions should have caused a ranking downgrade, these results indicate that the top performing metrics for Coherence, Fluency and Consistency were mostly unable to penalize the corrupted system summaries, suggesting they are not sufficiently reliable at measuring these QCs.
To validate the aforementioned assumption, that our corruptions should actually cause a ranking downgrade, we manually annotated a sample of 9 systems that contain corrupted summaries and found that all system rankings were downgraded after corruption.(More details can be found in Appendix B.) Since the best automatic metric did not reflect this change in ranking, we conjecture that SummEval meta-evaluation scores, even when they appear to be high, are not reliable.In the next section, we investigate the possibility that this  is caused by spurious correlations, and propose a possible means to identify them.

Analysis of Spurious Correlations
A possible explanation for the contradicting findings that "high performing" metrics fail to penalize corrupted summaries, is the high correlations observed between human scores of different QCs (termed correlation human ), as seen in system-level scores in Table 2 (left-most figure in each cell).As a result, high correlations between metrics and human scores on all QCs (termed correlation metric ) are usually due to the high correlation metric with one confounding QC combined with the strong correlation human between this confounding QC and the remaining QCs.As all of the aforementioned best metrics were initially designed to measure Relevance, we conjecture that the Relevance QC acts as a main confounding variable, making other correlations metric spurious.Although spurious correlations metric can be found in different setups, they are more likely in multi-QC setups, like summarization.In such a setup, models are typically optimized to excel in all QCs, resulting in enhanced systems across all QCs that consequently yield high correlations human among these QCs.
Next, we suggest a systematic method to detect a confounding QC that undermines correlations metric with other QCs.To that end, we propose to remove the effect of each QC, in turn, on the remaining QCs by annulling the correlations human between them.To do this, we calculated correlations metric over small subsets (i.e., buckets) of instances in which the annulled QC has low variance.We show that for most metrics when the correlation human to Relevance is annulled, the correlations metric to Fluency, Coherence and Consistency drops drastically, while the correlations metric to Relevance is immune to similar manipulations of the other three QCs.This suggests that Relevance is indeed a confounding factor for the other QCs.
To compute correlations metric within buck-ets we first note that the original SummEval correlations metric were calculated at the system level.That is, the 100 instance scores per system are averaged to produce 17 data points for correlation.However, dividing these 17 data points into buckets results in statistically unreliable correlation scores inside each bucket due to the limited number of data points in it.To address this, we utilized the instance-level correlation approach (Bojar et al., 2017;Freitag et al., 2021), which incorporates all 1,700 instance scores as individual data points without averaging per-system.Dividing the entire 1,700 instances into buckets ensures a sufficient number of data points within each bucket.Since confounding factors are inherent to the data, identifying a confounding variable at the instance-level implies its presence at the system level as well.
Inspired by stratification analysis (Mantel and Haenszel, 1959), in order to remove the effect of a particular QC (termed anchor QC) to assess its potential to be a confounding factor, we divide the system summary instances, which are associated with human score tuples of <Relevance, Consistency, Fluency, Coherence>, into buckets.Each bucket contains tuples with roughly equal scores of the anchor QC.Since scores are on a 1-to-5 scale, we use 5 buckets.As an example, if we would like to anchor Relevance, the first bucket will contain tuples with Relevance ≈ 1. Accordingly, the correlation human inside each bucket between the anchor QC human scores and each other QC degrades substantially.As Table 2 shows, averaging these low correlations human over all 5 buckets and weighting by bucket size, results in "bucketing" value in each cell that reduces the initial instancelevel correlations human between QCs by 2-5 times.
Next, we used this approach to calculate the correlations metric inside each bucket, thus neutralizing the effect of the anchor QC.Again, the five bucket correlations metric are averaged and weighted by bucket size.Finally, to measure whether the obtained bucketing value has changed significantly with respect to the original instance-level correlation metric , we calculate the absolute relative difference between the two scores.A high relative difference means that the correlation metric has changed significantly after removing the anchor QC.This undermines the reliability of the original correlation metric scores, and suggests the anchor QC as a confounding factor.While our work does not provide a specific threshold for detecting spurious correlation metric based on relative difference, this process does alert for potential unreliability when the relative difference is relatively high.
The relative difference scores by each one of the anchor QCs and each metric are shown in Appendix A.3.To summarize these detailed findings, we focused on the majority of metrics that were designed to measure Relevance.For each anchor and evaluated QC, we computed the median relative difference of all metrics.As can be seen in Table 3, we observe that the largest relative differences occur when Relevance serves as the anchor QC, as presented in the blue column.This means that the original correlations metric to Coherence, Fluency and Consistency were strongly affected by Relevance, as a strong confounding variable.However, when Relevance serves as the evaluated QC, as demonstrated in the yellow row, the relative differences are quite low, regardless of the anchor QC.This means that other QCs are probably not confounding variables for Relevance.
We also used the bucketing analysis to evaluate two other metric groups that roughly estimate other QCs, and observed the same phenomenon.The first group contains metrics that measure the percentage of repeated n-grams in the system summary.As a summary with repeated information is less coherent, these metrics are more suitable as rough estimates for Coherence.Accordingly, Table 4 shows high relative differences when Coherence functions as the anchor (marked as a blue column), meaning that when neutralizing Coherence, the correlations metric to other QCs change dramatically.On the other hand, when other QCs function as anchors, the correlation metric with Coherence is almost unchanged.This is expressed by the low relative difference (marked as the yellow row).Overall, this analysis suggests that Coherence is the confounding factor of this metric group and the original correlations metric are spurious.
The second group contains metrics measuring the percentage of novel n-grams in the summary that are not found in the input document.These metrics capture abstractiveness, making them potentially useful as rough (negative) estimates for Consistency and Fluency.This is due to the capability of these metrics to identify extractive summaries, which inherently possess consistency and fluency.Accordingly, we show the same phenomenon in   differences (blue columns), while bucketing by other QCs leaves low relative differences to Consistency and Fluency (yellow rows).As in this case we found two confounding factors, we conjecture that there is another unmeasured human QC that assesses abstraciveness directly that eventually influences Consistency and Fluency.In such a case, this abstraciveness QC therefore functions as a confounding factor.
Overall, this analysis points out the problematic conventional scheme of metric evaluation in a multi-QC setting.We found that except for the QC that the metrics were designed to measure, most of the correlations metric to other QCs are spurious.Further exploration of adjusting this analysis for other scenarios, such as cases involving two confounding factors, is left as future work.
It is worth noting that spurious correlations can alternatively be detected with the partial correlation approach (Whittaker, 2009).The confounding variable is neutralized by calculating the correlation between the residuals resulting from the linear regression of each of the variables to the confounding variable.The partial correlation scores in our setting indeed display the same trend as our bucketing method, detecting Relevance as the confounding variable in most metrics.In contrast to the partial correlation approach, bucketing has the advantages of being more interpretable and not assuming linear dependency between variables.See Appendix C for an analysis and comparative discussion between the methods.

Conclusion
We challenged the conventional manner of evaluating and ranking summarization metrics according to correlation with human scores in the multi-QC setting.We found that human ratings over recent state-of-the-art systems tend to correlate between different QCs, which leads to unreliable metric performance scores.To demonstrate this, we showed that metrics that allegedly measure a QC, negligibly decrease their score for corrupted summaries with respect to this QC.To cope with this obstacle, we proposed a bucketing method that removes the effect of the confounding variable and detects unreliable correlations.While this work mostly highlights the problem, we strongly encourage the development of additional approaches to tackle it.

Limitations
This study highlights a phenomenon that occurs when assessing summarization metrics across varying quality criteria.The findings are empirically shown only on SummEval, which is a relatively large-scale and high-quality meta-evaluation benchmark.Furthermore, there do not exist other major benchmarks that would enable a similar analysis.Nevertheless, the findings would be further strengthened if they could be examined on additional benchmarks.Additionally, although our analysis offers strong empirical evidence that the Relevance QC is the confounding variable in most metrics in the Sum-mEval setting, there could be other external factors that cause the strong correlations among the QCs.
We also rely, to a certain degree, on logical intuition and understanding of the proposed metrics in order to convince the reader of our findings.For example, it is very reasonable to assume that certain summarization metrics do not actually have the ability to measure a specific QC.In the case of ROUGE-1, there should not be a true relationship between the number of overlapping unigrams with another text and the Fluency of the evaluated text.Any corresponding chance correlation is presumably not due to a direct intent of the metric.

A.1 Running Systems and Metrics
All of our experiments were based on the resources provided by the SummEval benchmark (in the GitHub repository at https://github.com/Yale-LILY/SummEval). System summaries and human scores were present in the GitHub repository.We ran the metrics on summaries using the code provided in the repository, with a few minor adaptations.The original correlation table of SummEval is presented in Table 6.

A.2 Corruptions
An example of all corruption types is presented in Table 9.

A.3 Bucketing
To map the 1-5 scores into 5 buckets we rounded the scores to the nearest integer.All metrics' bucketing correlations for each anchor are presented in Tables 10,11,12,13.To compute the absolute relative difference between the original correlations metric and the bucketing correlations metric , we first computed the absolute values of each correlation score.This allowed us to assess the metric's ability to capture the human scores, whether in a positive or negative relationship.Next, we calculated the absolute difference between the two correlation values.A high absolute difference indicates a significant modification in the original correlation after bucketing, either in an upward or downward direction, highlighting the unreliability and spurious nature of the original correlation.While the majority of cases showed a positive difference, indicating that the original correlation was higher than the bucketing correlation, there were rare instances where the difference was negative.A negative difference implies that the original correlation was initially low but experienced a significant increase after bucketing.

B Manual Annotation for Corrupted Systems
In Section 3, we aim to validate the assumption that substantially corrupted systems should be penalized in human ranking.To support this claim, we conducted a manual annotation process on a randomly selected subset of 20 documents from a total of 100.Specifically, for each QC, we chose three corrupted systems (with 20 documents) that were not identified for degradation by the best automatic Table 6: SummEval Kendall's τ correlations between metrics and human annotations for each QC (taken from (Fabbri et al., 2021)).^denotes reference-free metrics.
The most-correlated metric in each column is in bold.
metric as described in Section 3.These systems were then annotated against sampled lower-ranked systems, ensuring a ranking difference of at least 6 places (above one-third of the number of systems) based on the best automatic metric.Additionally, we confirmed that the corrupted systems, when not corrupted, achieved higher rankings compared to the lower-ranked systems according to summEval manual annotation.Finally, for each QC, we annotated 3 pairs of systems, consisting of one corrupted system (corrupted with the QC-specific corruption) and one lower-ranked uncorrupted system.During the annotation process, we compared each system pair in terms of the relevant QC at the instance level, and aggregated the results to identify the system with more preferred instances.Specifically, for a specific pair of systems, the annotator gets two system summaries (one per system) of the same document, and should select the best summary in terms of the corrupted QC.After annotating all pairs, the system with the most preferred summaries is considered better than the other system.
For the annotation of Coherence, one of the au-  thors annotated all 3 system pairs.However, since the Fluency and Consistency corruptions (lemmatizing all verbs and replacing all PERSON entities with others) can be easily noticeable by the authors of the paper, we used Amazon Mturk1 workers for their annotation.We used workers from a list of 90 pre-selected workers from English speaking countries.These workers accomplished high quality work in other NLP-related tasks we have conducted in the past.Each pair of summaries was annotated by three workers, and the majority vote was considered for the final score.Table 7 presents the rate of uncorrupted lowranked system summaries that were preferred over the corrupted summaries in each system pair.As can be measured clearly, in all pairs the uncorrupted summaries were preferred in 50 percent or more although they were ranked lower prior to corruption.This indicates that the corrupted systems indeed should be downgraded in their ranking.The conclusion is therefore that our corruptions are effective in degrading the corrupted system's ranking.

C Partial Correlation
When examining the statistics literature for methods for detecting spurious correlations, we focused on the two prominent approaches: partial correlation (Whittaker, 2009) and data stratification (Mantel and Haenszel, 1959).Our bucketing method, presented in Section 4, is a form of stratification.
Methodologically, we identified two advantages of the bucketing approach in comparison to partial correlation.First, bucketing arranges the data into sets (buckets), where the confounding variable is neutralized in each set.This allows for further interpretability and analysis of the data, e.g., examining correlations and other statistics within each set.Meanwhile the partial correlation method only provides a final score without any further interpretation of the score or of the data.In other words, while provides debiased correlation, the bucketing method provides debiased data.Second, partial correlation is based on linear regression between the assessed variables and the confounding variable, and hence assumes a potentially linear correlation between these variables.Bucketing, on the other hand, generally allows the neutralization of any type of spurious correlation, making this method more robust for future extensions (even if in the current work we measured only linear correlations within each bucket).
Empirically, the partial correlation scores show the same trend as the bucketing scores.Table 8 displays the median (with the 25th and 75th percentiles) differences between the original and partial correlations metric for a group of metrics designed to measure Relevance, using the same format as described in Section 4 for the bucketing method.Similar to the bucketing method, we observe significant differences in cases where Relevance serves as the anchor QC (the column marked in blue), while minimal differences are observed for the Relevance as an evaluated QC (yellow row).This indicates the Relevance QC's role as a potential confounding factor.

D Licenses
All system summaries, reference summaries and human-annotated ratings were taken from the SummEval repository under the MIT license.Some of the reference summaries are originally from the CNN/DailyMail dataset.The documents corresponding to the summaries, also from CNN/DailyMail, were retrieved via the Huggingface distribution.All CNN/DailyMail data is released under the Apache-2.0License.

Table 2 :
Kendall's τ correlations between human QC ratings on the system-level/instance-level/bucketing.The columns serve as anchors solely for bucketing.Applying bucketing diminishes correlation between QCs.
Table 5 where bucketing by Consistency or Fluency as anchors yields high relative

Table 3 :
Median (quantile 25-quantile 75) of the absolute relative difference between original correlation and bucketing correlation, over metrics that were designed to measure Relevance.

Table 4 :
Median (quantile 25-quantile 75) of the absolute relative difference between original correlation and bucketing correlation, over metrics that roughly measure Coherence.

Table 7 :
Percent of the uncorrupted system summaries that were manually preferred in each pair of systems.Notice that the system pairs are different for each QC, therefore the columns are not comparable.

Table 8 :
Median (quantile 25-quantile 75) of the absolute relative difference between original correlation and partial correlation, over metrics that were designed to measure Relevance.

Table 9 :
An example of a system summary with all corruption types.A replaced PERSON named entity is marked in red.A lemmatized verb is marked in blue.Capital letters were inserted manually to facilitate reading.

Table 10 :
Absolute relative difference between original performance and bucketing performance anchored by Relevance.

Table 11 :
Absolute relative difference between original performance and bucketing performance anchored by Coherence.

Table 12 :
Absolute relative difference between original performance and bucketing performance anchored by Consistency.

Table 13 :
Absolute relative difference between original performance and bucketing performance anchored by Fluency.