A Training-free and Reference-free Summarization Evaluation Metric via Centrality-weighted Relevance and Self-referenced Redundancy

In recent years, reference-based and supervised summarization evaluation metrics have been widely explored. However, collecting human-annotated references and ratings are costly and time-consuming. To avoid these limitations, we propose a training-free and reference-free summarization evaluation metric. Our metric consists of a centrality-weighted relevance score and a self-referenced redundancy score. The relevance score is computed between the pseudo reference built from the source document and the given summary, where the pseudo reference content is weighted by the sentence centrality to provide importance guidance. Besides an F_1-based relevance score, we also design an F_\beta-based variant that pays more attention to the recall score. As for the redundancy score of the summary, we compute a self-masked similarity score with the summary itself to evaluate the redundant information in the summary. Finally, we combine the relevance and redundancy scores to produce the final evaluation score of the given summary. Extensive experiments show that our methods can significantly outperform existing methods on both multi-document and single-document summarization evaluation. The source code is released at https://github.com/Chen-Wang-CUHK/Training-Free-and-Ref-Free-Summ-Evaluation.


Introduction
Text summarization systems have been developed rapidly due to the appearance of sequence-tosequence frameworks (Sutskever et al., 2014;Bahdanau et al., 2015;See et al., 2017;, transformer architectures (Vaswani et al., 2017) and large-scale pre-training models (Devlin et al., 2019;. How to accurately * This work was mainly done when Wang Chen was an intern at Tencent AI Lab. evaluate the summaries generated from these systems also attracts more and more attention in this research area. One of the most accurate evaluation methods is human evaluation. However, human evaluation is expensive, time-consuming, and nonreproducible. Thus, it is necessary to develop automatic evaluation metrics for text summarization systems. Existing automatic summarization evaluation metrics can be roughly categorized into two groups: reference-based metrics and reference-free metrics. In this work, we focus on reference-free metrics. Reference-free summarization evaluation metrics have been developed in parallel in multidocument summarization and single-document summarization. The SOTA reference-free method for multi-document summarization evaluation, SU-PERT , predicts a relevance score for each (document, summary) pair to estimate the informativeness of the summary and then averages all the scores from multiple documents as the final evaluation score. For each pair, SUPERT employs the top-ranked sentences which are ranked by the position or centrality as a pseudo reference of the document and then applies BERTScore  to produce a relevance score between the pseudo reference and the given summary. The SOTA single-document summarization referencefree evaluation metric, LS Score , combines a learned linguistic scorer for the summary and a cosine similarity scorer for the (document, summary) pair to produce the final score.
Although SUPERT and LS Score achieve the SOTA performance on their own areas respectively, they still have several drawbacks. For example, SUPERT only considers the relevance score between the document and the summary while ignoring the other aspects such as how much redundant information is contained in the summary. Besides, SUPERT assumes that all pseudo reference sen-tences are equally-important. However, in the real world, the key information of a document is unevenly distributed over sentences. Therefore, such an assumption may introduce extra noise for the evaluation. Note that although SUPERT may employ sentence centrality to select document sentences as a pseudo reference, they ignore the sentence centrality after the selection and still treat the selected sentences equally-important. As for LS Score, although it does not require a reference during the evaluation of a summary, it requires a large-scale training dataset with reference summaries to train the linguistic scorer. Besides the intrinsic drawbacks in these SOTA methods, to our best knowledge, there is no reference-free evaluation metric showing that it can achieve the SOTA performance on both multi-document and singledocument summarization.
To solve the above limitations, based on SU-PERT, we propose a novel training-free and reference-free metric for both multiple and single document summarization evaluation. Our metric is composed of a centrality-weighted relevance score and a self-referenced redundancy score.
For the relevance score which is employed to estimate the informativeness of the summary, we incorporate the following new features. First, unlike previous work which only utilizes the tokenlevel representations, motivated by , we engage a hybrid way that contains both token-level representations and sentence-level representations to encode the document and the summary. The purpose of the hybrid representation is to enable our method to consider richer mapping styles (i.e., token-to-token, sentence-to-token, and sentence-to-sentence) and help to produce a more comprehensive evaluation score. Second, we utilize the sentence centrality computed from sentence-level representations of the source document to produce the importance weights of the pseudo reference sentences and tokens. Based on the weights, we compute a weighted relevance score that is more precise by considering the relative importance. Third, besides the F 1 version of our relevance score, we also propose an adaptive F β version where recall is considered β times as important as precision. β is computed based on the length ratio between the pseudo reference and the given summary. The motivation is to punish the short summary that can easily get high precision while covering very limited important information in the pseudo reference (i.e., low recall).
To measure the redundancy of a summary, we design a simple but effective self-referenced similarity score. If a summary contains much redundant information, there must exist plenty of semantically similar tokens or sentences. Based on this assumption, we use the summary itself as the reference and input a (summary, summary) pair into a selfmasked BERTScore to produce a redundancy score that evaluates the averaged degree of semantic similarity of each token or sentence with other tokens or sentences.
After obtaining the centrality-weighted relevance score and the self-referenced redundancy score, we combine them to predict the final evaluation score. Depending on either F 1 or F β is applied in our relevance score, we propose two variants of our method: the F 1 -based version and the F β -based version. Extensive experiments are conducted on both multi-document and single-document summarization datasets. The results show that our F 1based method already outperforms all the SOTA baselines on all datasets. Moreover, our F β -based method can further improve the performance on multi-document summarization datasets.
Our contributions are summarized as follows: (1) A novel training-free and reference-free summarization evaluation metric which considers both relevance and redundancy; (2) A centrality-weighted relevance score that effectively utilizes the sentence centrality of the documents to provide importance guidance for the pseudo reference tokens and sentences. Besides the F 1 version, we also develop an F β based relevance score which pays more attention to recall; (3) A self-referenced redundancy score that utilizes a self-masked BERTScore to detect the duplicated information of the given summary; (4) To the best of our knowledge, we are the first evaluation metric that can achieve SOTA performance on both multiple and single document summarization under the reference-free setting.

Preliminary
Notations. We denote vectors as bold lowercase characters and matrices as bold uppercase characters. The characters that are not bold are used to denote scalars. Calligraphy uppercase characters are utilized to represent sets.
Problem Definition. We formally define the reference-free summarization evaluation problem as follows. Give a set of documents D = {d 1 , d 2 , ..., d K } and a generated summary x, the goal is to predict a score to represent the overall quality of the summary. K = 1 and K > 1 indicate single-document and multi-document summarization respectively.

Our Methodology
The overall framework is illustrated in Figure 1. Our final evaluation score of a summary consists of an averaged centrality-weighted relevance score and a self-referenced redundancy score. Both scores are calculated on a semantic-level instead of utilizing n-gram overlapping. The averaged relevance score is computed from the relevance score between the summary and each document in the document set. The redundancy score is calculated based on the summary itself.

Centrality-weighted Relevance Score
Our relevance score aims to estimate the informativeness of the given summary. We first encode each document in the document set and the summary into hidden representations. Then, for each document, we select essential sentences by centrality to build a pseudo reference. Next, we compute a centrality-weighted relevance score between the summary and each pseudo reference. Finally, we average all the relevance scores as the final relevance score of the summary. We use the k-th document d k and a summary x as an example to show the workflow.
Encoding. Following SUPERT , we first split the document d k and the summary x into sentences. Then, the pre-trained SBERT 1 is employed to encode the tokens of each sentence into token-level contextual hidden representations. We also apply max-pooling on all the tokens of a sentence to obtain the sentence-level hidden representation. Following previous work, when utilizing the token-level representations to compute the relevance and redundancy scores, we will filter out the non-informative tokens such as stop-words to improve the efficiency.
Building Pseudo Reference. We do not choose all the document sentences of d k to evaluate the relevance of the summary. Because the whole document usually contains plenty of unimportant sentences which may introduce extra noise for the relevance evaluation. Thus, we select important document sentences to build a pseudo reference r for the evaluation. The sentence selection is based on the centrality of each sentence, which is computed by the unsupervised algorithm, PacSum (Zheng and Lapata, 2019), using the sentence-level representation. After obtaining the centrality scores of all sentences of the document, we choose the top-M 2 sentences as the pseudo reference. Besides, we normalize the centrality scores to [0, 1] and denote the normalized centrality scores of the selected sen-tences asā s = [ā s 1 ,ā s 2 , ...,ā s M ] whereā s i ∈ [0, 1] and the superscript s means sentence-level. We denote the pseudo reference building process as PacSumTopM.
Computing Relevance Score with One Pseudo Reference. Instead of only using token-level representations, we also leverage the sentence-level representations to provide multi-level information. The hybrid representations of the summary x and the pseudo reference r are denoted as follows: where n and N (m and M ) are the token number and sentence number of the summary (pseudo reference). w and s represent the token and sentence hidden representations respectively. Besides the hybrid representations, we also introduce a centrality weighting scheme to weight the tokens and sentences of the pseudo reference, which is different from previous work that either treats them equally or uses the surface statistics like IDF as the weights. Based on the centrality scores of the selected pseudo reference sentences i.e.,ā s = [ā s 1 ,ā s 2 , ...,ā s M ], we assign the weights of the pseudo reference tokens as follows: whereā i:w j ∈s i indicates the token w j inherits the centrality score from its sentence s i . Since we have already removed the non-informative tokens in the token-level representations of each sentence, the remaining tokens capture the key information of the sentence and consequently it is reasonable to perform such a weight inheritance. Next, we combine token weightsā w and sentence weightsā s to get the final normalized centrality-based weights of the hybrid representations: where "[·; ·]" represents concatenation. Based on the hybrid representations (i.e., X and R k ) and the centrality-based weights of the pseudo reference tokens and sentences (i.e., a), we compute the relevance score between the summary and the pseudo reference by a weighted BERTScore . For brevity, we denote the j-th element of X as x j , the i-th element of R k as r i , and the i-th element of a as a i : where "Sim" denotes the cosine similarity and |X| equals to n + N . Recall, P recision, and F 1 are in the range of [-1, 1]. Besides the F 1 version, we also propose an adaptive F β version of relevance score as follows: where |R k | = m+M , |X| = n+N , and γ is a positive integer hyper-parameter. In our experiments, γ is set as 2 after fine-tuning on the validation dataset and is fixed for all the testing datasets. The physical meaning of β is that the Recall score is considered β times as important as the P recision score. In summarization evaluation, the coverage of the key information is always the most important quality indicator of the summary. Thus, we set the lower bound of β as 1. On the other hand, the metric should not only evaluate the key information coverage, containing less unimportant content in the summary should also be considered. Therefore, we set the upper bound of β as √ 2. As shown in Eq.12, within the range of [1, √ 2], β adaptively changes according to the ratio between |R k | and |X|. The intuition comes from that a longer pseudo reference implies more key information needs to be covered by the summary. Besides, a shorter summary can easily get high precision but covers very limited important information in the pseudo reference. Thus, we give Recall a higher weight to punish such short summaries when the pseudo reference is long.
Final Averaged Relevance Score. After computing the centrality-weighted relevance score between the summary and the pseudo reference of each source document, we employ the average as the final relevance score of the summary: score rel = mean([F 1 * , ..., F k * , ..., F K * ]), (13) where * is 1 for the F 1 variant and β for the F β variant. The superscript k indicates the F * score is computed with the k-th document. Note that score rel ∈ [−1, 1] and higher is better.

Self-referenced Redundancy Score
In this section, we introduce our self-referenced redundancy score. We engage the summary itself as the reference to evaluate the degree of the semantic similarity between each summary token or sentence with the other tokens or sentences. The averaged semantic similarity degree is used as the redundancy score. The computation is based on a self-masked BERTScore as follows: where "j : i = j" means we do not consider the similarity between x i and itself, i.e, self-masked.
Because of the symmetric property, the F 1 , precision, and recall scores are equal with each other. This is also the reason that we use precision in Eq.14 as the final redundancy score. Note that score red ∈ [−1, 1] and lower is better.

Final Evaluation Score
After obtaining the relevance score and the redundancy score, we apply a linear combination to produce the final evaluation score of the summary based on the document set: where 0 < λ ≤ 1 is a hyper-parameter to scale the redundancy score and score ∈ [−1, 1].
Higher score means better summary quality. In our experiments, after fine-tuning on the validation set, λ is set as 0.6 and is fixed for all the testing datasets. We denote the variants of our final method as Ours(F β )-PacSumTopM and Ours(F 1 )-PacSumTopM depending on whether the adaptive F β is employed.  Table 1: Statistics of datasets. "Valid." and "Test." indicate the dataset is used for validation and testing, respectively. "|T opic|" is the number of topics. Under each topic, a set of documents is given and summaries are from different systems associating with humanannotated quality scores. "|Set|" is the number of documents in the document set. "Ave.S" and "Ave.T" represent the averaged sentence number and token number per document or summary. Note that the token number is counted after the tokenization. "|Systems|" denotes the number of summarization systems in the dataset. For TAC datasets, we compute correlation coefficients between predicted scores of an evaluation method and the annotated Pyramid scores of summaries to measure the effectiveness of the method. Following Gao et al. (2020), a correlation is computed for each topic. Then, the averaged correlation from all the topics is engaged as the final correlation of the method with human ratings.
For CNNDM dataset, correlations are calculated with the human scores in three dimensions including Overall, Grammar, and Redundancy. Following , the correlation is computed between predicted scores of the 499 × 4 = 1996 (document, summary) pairs with corresponding human ratings.

Baselines
In this section, we briefly introduce our baselines.
We choose TF-IDF, JS (Louis and Nenkova, 2013), and REPEAR  as traditional reference-free baselines. All these traditional baselines do not build pseudo references and    directly utilize the full content of the documents. For fairness, we also show the performance of our methods without building pseudo reference. We denote them as Ours(F 1 )-All and Ours(F β )-All since they use the whole document as a reference.
We also extend several popular referencebased methods as baselines. We adapt ROUGE-1/2/L (Lin, 2004), MoverScore , and S+WMS  into the referencefree scenario via building the pseudo reference with the PacSumTopM method. We add the suffix "-PacSumTopM" to these baseline names to indicate the pseudo reference building process.
Besides, the SOTA reference-free summary evaluation metrics are also selected as our strong baselines, including C-ELMO/C-SBERT (Sun and Nenkova, 2019), SUPERT/SUPERT-IDF , and LS Score . C-ELMO (C-SBERT) encodes the document and the summary using the pre-trained ELMO (SBERT) and then computes their cosine similarity. SUPERT-IDF is an extension of SUPERT, which utilizes the inverse document frequency (IDF) as the importance weight of each token. For fair comparisons, we also apply the same pseudo reference building process i.e., PacSumTopM, to C-ELMO/C-SBERT/SUPERT/SUPERT-IDF and add the suffix "-PacSumTopM" to the their names.

Main Results
The main experimental results on multi-document summarization datasets are shown in Table 2. We find that our F 1 version (i.e., Ours(F 1 )-PacSumTopM) already consistently outperforms all the baselines, which indicates the effectiveness of our centrality-weighted relevance score and our self-referenced redundancy score. The results also and Ours(F 1 ) on TAC-2011 for different |Set| and |Systems|. Positive gaps mean our F β can improve the performance while negative gaps indicate our F β degrades the performance. When changing one of them, the other is fixed. "all" means the full size is applied, i.e., 10 for |Set| and 50 for |Systems|.
demonstrate that our F β version can further improve the performance of multi-document summarization evaluation. By comparing Ours(F β )-PacSumTopM and Ours(F β )-All, we see that the pseudo reference building process can significantly improve the performance. This is also the reason why we apply the same pseudo reference building process into SOTA baselines for fair comparisons.
In the remaining part of this paper, we omit the suffix "-PacSumTopM" for simplicity when we mention a method. We also test our methods on the single-document summarization dataset without further fine-tuning the hyper-parameters. The main results are displayed in Table 3. We note that our F 1 version still outperforms all the baselines, which manifests the high generalization ability of our F 1 -based method. One interesting finding is that the performance significantly drops after incorporating the F β score.
To study the reason for the performance degradation on CNNDM after incorporating F β , we compare CNNDM and TAC datasets first. From Table 1, we note the main differences between them are the size of the document set for each topic (i.e., |Set|) and the number of the summarization systems (i.e., |Systems|). CNNDM has much smaller |Set| and |Systems|. We use the TAC-2011 dataset as an example to investigate whether our F β is unsuitable for smaller |Set| and |Systems|. We change |Set| and |Systems| respectively and report the gap of Spearman's ρ between Ours(F β ) and Ours(F 1 ) in Figure 2. From the results, we observe that our F β  Table 4: Spearman's ρ of incorporating the centrality weighting and redundancy score into MoverScore based framework. "+Both" means these two features are simultaneously applied.
can consistently improve the performance for different |Set|. For the single-document summarization setting, i.e., |Set|=1, it still obtains a positive gap. Nevertheless, when the |Systems| is small such as 4, applying our F β leads to a dramatic performance dropping. From Table 1, we also see that CNNDM and TAC-2011 have different summary lengths (73.2 for CNNDM and 120.9 for TAC-2011). However, when we limit the |Systems| of TAC-2011 to smaller numbers, the average length of generated summaries is still around 120, which indicates the performance degeneration is indeed from the change of system numbers. Therefore, we suggest using Ours(F β ) when |Systems| is large like 12 and employing Ours(F 1 ) when |Systems| is small like 4.

Ablation Study
For better understanding the contributions of our proposed components, we conduct ablation studies on the best-performed method on each dataset, i.e., Ours(F β ) for the multi-document summarization datasets and Ours(F 1 ) for the single-document summarization dataset. We display results of the rank-based Spearman's ρ in Figure 3. As shown in the figure, after removing one of the three components (i.e., the centrality weighting, the hybrid representation, and the redundancy score), the performance of our methods become worse in most cases. This finding demonstrates the effectiveness of our proposed components. Besides, we also note that removing the redundancy score significantly degrades the performance on the redundancy evaluation on CNNDM, which indicates our redundancy score effectively captures the redundancy degree of the summaries.

Apply Centrality Weighting and
Redundancy Score into MoverScore Besides basing on BERTScore, we also study whether our key features i.e., the centrality weighting and redundancy score, can work well in a : Ablation studies for Ours(F β ) on TAC datasets and Ours(F 1 ) on CNNDM. "-CentralityW." means that we remove the centrality weighting when computing relevance scores. "-HybridR." represents we only utilize the token-level representations when calculating relevance and redundancy scores. "-Redundancy" indicates we omit the redundancy score.
MoverScore based framework (i.e., the relevance and redundancy scores are computed using Mover-Score). Note that our F β is not applicable to Mover-Score since it is not an F -measure. The results are listed in Table 4. We find that these two features significantly improve the performance of the original MoverScore on single-document summarization evaluation while degrading the performance dramatically on multi-document summarization evaluation. On CNNDM, the enhanced Mover-Score even outperforms Ours(F 1 ) on the "Overall" and "Redundancy" aspects, which indicates Mover-Score is a promising basis for our proposed new features. We leave solving the performance dropping of the enhanced MoverScore on multi-document setting as future work.

Robustness Analysis
We investigate the robustness of our method on the following factors and report the experimental results on the validation dataset (i.e., TAC-2010) in Figure 4: (1) the hyper-parameter λ for scaling the redundancy score; (2) the hyper-parameter γ in F β ; (3) the number of selected sentences for pseudo reference i.e., M ; (4) different pre-trained contextual encoding models including BERT-base 5 , BERTlarge 6 , RoBERTa-base 7 , and RoBERTa-large 8 . Since both Spearman's ρ and Kendall's τ are rank-based correlation coefficients, we omit Kendall's τ for simplicity. From this figure, we observe that the performance of our method is relatively stable for different λ and γ. We also find that a small M leads to lower correlations because much important information may be abandoned when building the pseudo references. But a large M will also degenerate the correlations since more noises are introduced. Thus, a moderate M is better. As for encoding models, we note that large encoding models obtain better performance than base encoding models. However, large models need more computation resources and time to encode the input text. Note that for our final method, we only fine-tune λ and γ on the TAC-2010 and set them as 0.6 and 2. As for M and encoding models, following the configuration of SUPERT , we directly set M as 12 and employ the BERT-large as the encoding model. All these factors are fixed for all testing datasets.

Performance on Bad/Good Summaries
In this section, we evaluate the ability of our method to distinguish bad and good summaries. The bad and good summaries are selected by human ratings. We use TAC-2011 as an example and choose SUPERT as a strong baseline. The corresponding distributions of the reversed rank for bad and good summaries are illustrated in Figure 5. A smaller (larger) reversed rank represents the summary is assigned with a lower (higher) score. From the figure, we find that compared with SUPERT, Our(F β ) has a better ability to assign bad sum- maries lower scores and good summaries higher scores, which demonstrates the effectiveness of our method again. Moreover, we also note that both SUPERT and Ours(F β ) are good at giving bad summaries lower scores while having difficulty in assigning good summaries higher scores. We leave solving this problem as another future work under the reference-free setting.

Related Work 7 Conclusion
In this paper, we propose a novel training-free and reference-free summarization evaluation metric consisting of a relevance score and a redundancy score. Experiments on multi-document and single-document summarization settings show the effectiveness of our methods. One promising future direction is to solve the performance dropping issue after applying our key features into Mover-Score and the other is to tackle the problem that current metrics struggle to assign higher scores for good summaries.