Towards Interpretable and Efficient Automatic Reference-Based Summarization Evaluation

Interpretability and efficiency are two important considerations for the adoption of neural automatic metrics. In this work, we develop strong-performing automatic metrics for reference-based summarization evaluation, based on a two-stage evaluation pipeline that first extracts basic information units from one text sequence and then checks the extracted units in another sequence. The metrics we developed include two-stage metrics that can provide high interpretability at both the fine-grained unit level and summary level, and one-stage metrics that achieve a balance between efficiency and interpretability. We make the developed tools publicly available at https://github.com/Yale-LILY/AutoACU.


Introduction
Automatic evaluation is an integral part of scaling natural language generation (NLG) system development and evaluation.While neural models have seen great success in NLG systems, their adoption in automatic metric development has been much slower, and classic metrics such as ROUGE (Lin, 2004) and BLEU (Papineni et al., 2002) are still used more often than neural ones (Sellam et al., 2020;Zhang* et al., 2020;Yuan et al., 2021).Compared to neural systems, we argue that there are unique requirements for neural metrics to be adopted: (1) interpretability -the metric scores should be interpretable and provide intuitive insights into system performance and system output quality.(2) evaluation efficiency -ideally, the automatic metrics should only introduce a small computation overhead, since it should be possible to use them on the fly for system development and fine-tuning.In this work, we aim to design neural metrics that are more aligned with these requirements, which we believe will facilitate adoption.
Our method focuses on a two-step decomposition of text sequence comparison following related work (Bhandari et al., 2020;Zhang and Bansal, 2021;Liu et al., 2022a) -first dissecting the information in one text sequence into multiple simple facts and then checking the presence of these facts in another text sequence.Specifically, our automatic evaluation pipeline mirrors the human evaluation protocol of Liu et al. (2022a), which uses atomic content units, or ACUs, as the simple facts for comparing text sequences.We believe such a two-stage automatic evaluation can be more interpretable and transparent.1 Specifically, at the ACU level, the evaluation result indicates the presence or absence of an extracted information unit; at the summary level, the aggregated ACU score represents the percentage of information overlap from one text sequence to another.In contrast, it can be difficult to interpret the results of certain neural metrics such as BERTScore (Zhang* et al., 2020) or BARTScore (Yuan et al., 2021).For example, BARTScore assigns the normalized log-likelihood as the text similarity score, which is non-positive and non-linear, making it difficult to understand the system output quality based on its metric score.
Despite their advantages, such two-stage metrics (Deutsch et al., 2021;Zhang and Bansal, 2021;Fabbri et al., 2022) can be much slower to run, even slower than the evaluated systems.Therefore, apart from the two-stage evaluation method, we also propose a more efficient one-stage metric that directly predicts the aggregated summary-level scores by training on the recently proposed RoSE (Liu et al., 2022a) benchmark.The one-stage metric retains the summary-level interpretability as in the twostage evaluation, striking a balance between efficiency and interpretability.We also explore using the two-stage evaluation as pre-training for the onestage metric, which further improves performance.
Our contributions can be summarised as: (1) A fine-grained two-stage automatic metric for reference-based summarization evaluation, which provides high interpretability; (2) An efficient onestage automatic metric, which offers a balance between interpretability and evaluation efficiency; (3) Both types of metrics we developed achieve stateof-the-art performance on text summarization evaluation, and we make them publicly available and release them as an easy-to-use Python package.

Preliminaries
Information Similarity for Summarization Evaluation The most common and important evaluation method of summarization systems is assessing the similarity between system-generated summaries and reference summaries.Since there are no widely-accepted definitions of such similarity among the related work (Nenkova and Passonneau, 2004;Lin, 2004;Bhandari et al., 2020;Fabbri et al., 2021;Zhang and Bansal, 2021), we refer it as the information similarity in this work -two completely similar text sequences should convey exactly the same information to the users, following the suggestion of Deutsch and Roth (2021).

Automatic Metrics of Information Similarity
Traditional automatic metrics of information similarity compare the lexical overlap of two text sequences, such as ROUGE (Lin, 2004), BLEU (Papineni et al., 2002).In contrast, a family of neural automatic metrics (e.g., BERTScore (Zhang* et al., 2020), MoverScore (Zhao et al., 2019), BARTScore (Yuan et al., 2021)) focus more on semantic similarity by leveraging the pre-trained language models.Neural metrics can also be pretrained on pseudo training signals (Sellam et al., 2020;Zhong et al., 2022) or related corpus (Yuan et al., 2021;Gao et al., 2021), or be supervisedly fine-tuned (Rei et al., 2020).Apart from the onestage metrics, related work on two-stage metrics proposes methods of decomposing the evaluation process into finer-grained sub-tasks, such as the QA-based QAEval (Deutsch et al., 2021) metric, and the Lite 3 Pyramid (Zhang and Bansal, 2021) metric, which automates the evaluation process of LitePyramid (Shapira et al., 2019) protocol.Evaluation of Automatic Metrics To evaluate information similarity metrics for text summarization, a few human evaluation benchmarks (Bhandari et al., 2020;Fabbri et al., 2021;Zhang and Bansal, 2021;Liu et al., 2022a) have been collected containing system-generated summaries and their human evaluation scores.Automatic metric performance is measured by the correlation between the automatic metric scores and human evaluation scores of the system-generated summaries, for which we provide detailed definitions in Appendix B.

Methods
We first describe our approach of a two-stage decomposition of automatic information similarity evaluation -(1) extracting fine-grained content units from one text sequence; (2) checking the existence of the extracted units in another text sequence.Then we introduce methods of training a one-stage automatic metric for information similarity and using the two-stage decomposition for pre-training.

Two-Stage Evaluation
Content Unit Extraction A (long) text sequence can contain more than one fact, or simple information unit.Therefore, we follow Liu et al. (2022a) by using Atomic Content Units (ACUs) to refer to the basic information units.We formulate automatic ACU extraction as a sequence-to-sequence (Seq2Seq) problem (Sutskever et al., 2014): where S is the input text sequence, A is a concatenation of a set of ACUs A generated as a sequence, and g is a Seq2Seq model for ACU extraction.
Content Unit Checking Having extracted a set of ACUs A from one text sequence S 1 , we use a Natural Langauge Inference (NLI) (Gururangan et al., 2018) model f to check if the information in an extracted ACU a is conveyed by another text sequence S 2 : where l a is the label of a assigned by the model f .In addition to this standard NLI setting of viewing S 2 as the premise and a as the hypothesis, we also explore adding S 1 as part of the model input serving as context: We use BERT (Devlin et al., 2019) as the NLI model architecture and follow its input format for the standard setting (Eq.2).For the extended setting (Eq.3), the input is a concatenation of S 1 , S 2 , a.
Based on these two stages, we can define a recall information similarity score of S 2 w.r.t.S 1 :

One-Stage Metric and Its Pre-Training
To improve evaluation efficiency, we propose using our two-stage approach to generate pre-training data for a more lightweight, one-stage metric.Specifically, we use BERT as the backbone for a scoring model h to approximate the two-stage recall score (Eq.4): The model is trained with the mean squared error between the target score and the predicted score.Devlin et al. (2019), and a single linear layer is introduced to map the hidden representation of "[SEP]" into the predicted numeral score.We can also define an F1 score as Pre-training Corpora As shown by Sellam et al. (2020), the robustness of automatic metrics can be improved by pre-training with synthetic pretraining data.We further extend this approach following the finding of related work (Liu and Liu, 2021;Liu et al., 2022b), which shows that pretrained summarization models such as BART can generate diverse and high-quality candidate summaries and summarization models can benefit from contrastive learning with the generated candidates.
In a similar spirit, we construct the pre-training corpora on the existing summarization datasets such as CNN/DailyMail (Nallapati et al., 2016), and for each data example we generate multiple candidate summaries using a fine-tuned summarization model.The generated summaries are then scored by the two-stage evaluation method (Eq.4), which are used for the pre-training of one-stage model h.

Experimental Settings
Datasets We mainly use a recently-introduced summarization evaluation benchmark, RoSE (Liu et al., 2022a), for automatic metric development and evaluation.It contains human evaluation recall scores of the system-generated summaries based on the reference summaries w.r.t. the information similarity on three summarization datasets, CNN/DailyMail (CNNDM) (Nallapati et al., 2016), XSum (Narayan et al., 2018), and SamSum (Gliwa et al., 2019).The human evaluation is conducted following the ACU evaluation protocol (Liu et  2022a), and the dataset also provides humanwritten ACUs on the reference summaries and the associated results (ACU labels) of checking the ACUs against system-generated summaries.The dataset statistics are in Tab.1.
Baseline Metrics We compare our methods with related automatic metrics for text sequence comparison.Since the RoSE benchmark provides recall information similarity scores, only metrics with recall scores are compared, and their recall score variants are used for comparison.ROUGE (Lin, 2004) compares two text sequences by the n-grams overlap between them.We report the performance of its ROUGE-1/2/L variants.
BARTScore (Yuan et al., 2021) interprets the similarity of text sequence x to y as the probability of x given y predicted by a pre-trained language model such as BART (Lewis et al., 2020).We report its variants pre-trained on CNNDM (BARTScore C ) and ParaBank2 (Hu et al., 2019) (BARTScore P ).
QAEval (Deutsch et al., 2021) measures information similarity by question answering accuracy.We report both its exact match score (QAEval EM ) and F1 score (QAEval F 1 ).
Lite 3 Pyramid (Zhang and Bansal, 2021) introduces a similar approach as our two-stage evaluation but uses a semantic role labeling (He et al., 2017) model to extract content units.We report its variants that use the predicted label (Lite 3 Pyramid L ) and probability (Lite Table 2: The Kendall's correlation between the automatic metric scores and human evaluation scores on the RoSE dataset.The correlation is calculated at both the system level and the summary level.We use the recall score of the automatic metrics.A 2 CU is our two-stage evaluation method, A3 CU is our one-stage metric.†: significantly (p < 0.05) better than the best baseline.
the human-written ACUs provided in RoSE.As for the NLI model for ACU checking (Eq.2&3), we use a pre-trained DeBERTa (He et al., 2021) NLI model 3 as the start point and further fine-tune it on the RoSE dataset with the available gold standard ACU labels.We name the two-stage method A 2 CU (AutoACU), and it has three variants in total: A 2 CU P is with the pre-trained NLI model, A 2 CU F is with the fine-tuned NLI model (Eq.2),A 2 CU F C is with the fine-tuned NLI model taking the source text as part of the input.
For the pre-training of one-stage metric, we use system-generated summaries generated by a pretrained BART model on the CNNDM dataset.For each data example, 12 summaries are scored by the two-stage evaluation method and used to pretrain the one-stage metric.After the pre-training, the one-stage metric can be further fine-tuned on the gold-standard scores.We name the one-stage metric as A 3 CU (AcceleratedAutoACU), and it has three variants: A 3 CU F is directly fine-tuned on RoSE, A 3 CU P is pre-trained with A 2 CU only, A 3 CU P F is first pre-trained then fine-tuned.
We note that all the training on the RoSE dataset is performed on the validation split of CNNDM, which is further split for training and validation.

Results
We report the results in Tab.2.Kendall's correlation coefficients are used to evaluate the metric performance at both the system and summary levels, which shows the following: (1) Both our twostage and one-stage metrics can outperform the baseline methods across three datasets.
(2) The improvement of our metrics is more significant at the summary level than at the system level.( 3 (Cer et al., 2017) and WMT19 Metrics Shared Task DARR benchmark (Ma et al., 2019).

Conclusions
We develop high-performing reference-based summarization automatic metrics, including two-stage metrics providing fine-grained interpretability and one-stage metrics for a balance between efficiency and interpretability.Furthermore, we show that the two-stage metric can be used to effectively pretrain the one-stage metric, helping to mitigate the data scarcity in developing automatic metrics.

B Correlation Calculation
To evaluate the performance of automatic metrics, the human evaluation result on the same evaluation target is considered the gold standard, and metric performance is measured by the correlation between the human evaluation scores and automatic metric scores.For text summarization metrics, such correlations can be calculated at the system level and the summary level.More specifically, given n input articles and m summarization systems, the human evaluation and an automatic metric result in two n-row, m-column score matrices H, M respectively.The summary-level correlation is an average of sample-wise correlations: where H i , M i are the evaluation results on the i-th data sample and C is a function calculating a correlation coefficient (e.g., the Pearson correlation coefficient).In contrast, the system-level correlation is calculated on the aggregated system scores: where H and M contain m entries which are the average system scores across n data samples, e.g., H0 = i H i,0 /n.

C Evaluation on Related Information Similarity Benchmark
Apart from the RoSE benchmark, we evaluate metric performances on two related benchmarks for assessing the information similarity between two text sequences: the STS (Semantic Textual Similarity) benchmark (Cer et al., 2017) and the human evaluation results from the WMT19 Metrics Shared Task (Ma et al., 2019).
STS Benchmark STS benchmark4 contains English sentence pairs and the associated humanannotated semantic similarity scores (Cer et al., 2017), which has a similar evaluation target as the information similarity for text summarization evaluation.With this benchmark, the semantic similarity metrics are evaluated by the correlation between the predicted similarity scores and the reference scores.Following the previous work (Gao et al., 2021;Reimers et al., 2016), we report Spearman's correlation coefficients on the test split of the benchmark containing around 1.5k data examples.Since the metrics we evaluated previously are not designed specifically for the STS task, we compare another metric, SimCSE (Gao et al., 2021), which achieves strong performance on this task.
WMT19 DARR Benchmark This benchmark contains human-annotated scores of systemgenerated translations (Ma et al., 2019).We use only the to-English part of the benchmark, which results from reference-based direct assessment (DA) of translation quality.The human-annotated DA scores are then transformed into relative rankings between two translations, i.e., the DARR scores. 5We follow the evaluation setting of Ma et al. (2019) by using Kendall's Tau-like correlation to evaluate the segement-level performance of automatic metrics.The benchmark contains the translations from seven languages to English and results in around 21k translated sentence pairs.Apart from these two benchmarks, we also report the metric performance under normalized ACU scores in RoSE, which are decorrelated with summary lengths to evalaute F1-based metrics (Liu et al., 2022a).We note that unlike in §4.2, all the metrics compared here are F1-based.In particular, for A 3 CU we follow Eq.6 to calculate the F1 scores.The two-stage metrics including A 2 CU are not reported because of their lower efficiency.
The results in Tab.4 show that: (1) While the compared metrics have a similar evaluation target, i.e., information similarity between text sequences, their performance varies across different benchmarks and there is no single metric that can consistently outperform others.A similar finding in Ma et al. (2019) shows that Pearson's correlation between human-annotated cross-lingual STS scores and machine translation quality estimation scores is only 0.41.
(2) On the STS benchmark, A 3 CU outperforms other metrics except for SimCSE which is specifically designed for the STS task, which we believe results from A 3 CU's better interpretability thanks to its underlying evaluation process.Specifically, A 3 CU's computed scores directly indicate the in- formation overlap between text sequences, which is close to the definition of the STS task.
(3) On the WMT benchmark, A 3 CU fails to outperform BERTScore and BARTScore.We hypothesize this is because A 3 CU is relatively insensitive to the minor differences in candidate translations when implicitly comparing them based on the same reference translation as we found that the systemgenerated translations in the WMT benchmark have higher similarity than the system-generated summaries in the RoSE benchmark.Specifically, while both RoSE and WMT19 DARR benchmarks contain reference-based human annotations of candi-date output quality, we note that they have different data distributions.In detail, the systemgenerated translations of the same source sentence in the WMT benchmark have higher similarities than the system-generated summaries of the same source articles.We visualize this discrepancy in Fig. 2, which shows the similarity (as evaluated by ROUGE-1) between different candidate outputs (of the same example) on WMT19 DARR benchmark and the XSum test split6 of RoSE benchmark.
As illustrated by the figure, candidates in WMT19 benchmark are more similar, which can lead to the performance difference when the same automatic metric is evaluated on these two benchmarks.

Figure 1 :
Figure 1: Example of two-stage automatic summarization evaluation based on the Atomic Content Unit (ACU) protocol.In the first stage, an automatic ACU extraction model dissects the information in one text sequence into ACUs.In the second stage, an automatic ACU checking (matching) model checks the presence of the extracted ACUs against another text sequence.

Table 1 :
al., Statistics of the RoSE dataset.#Doc. is the number of input documents.#Sys. is the number of evaluated summarization systems.#ACU is the total number of written ACUs on the reference summaries.#Summ. is the number of summary-level annotations.

Table 3 :
Performance comparison of different variants of the A 3 CU merric fine-tuned on RoSE.A 3 CU is based on BERT large model and pre-trained on the twostage evaluation method.A 3 CU-B is its counterpart based on BERT base model, A 3 CU-R is the counterpart pre-trained to predict ROUGE scores.

Table 4 :
Metric performance on related benchmarks.STS is the STS benchmark with Spearman's correlation coefficients.WMT is the WMT19 DARR benchmark with Kendall's Tau-like correlations.CNNDM, XSum, SamSum correspond to Kendall's correlation coefficients based on the normalized ACU scores on the RoSE benchmark on different test splits.All the correlations are calculated at the segment level.We use the F1 score of the automatic metrics.Figure 2: Comparison of candidate outputs similarity in the WMT19 DARR benchmark and the XSum test split of RoSE benchmark.The similarity of candidate output pairs of the same example is evaluated by ROUGE1.