DocAsRef: An Empirical Study on Repurposing Reference-based Summary Quality Metrics as Reference-free Metrics

Automated summary quality assessment falls into two categories: reference-based and reference-free. Reference-based metrics, historically deemed more accurate due to the additional information provided by human-written references, are limited by their reliance on human input. In this paper, we hypothesize that the comparison methodologies used by some reference-based metrics to evaluate a system summary against its corresponding reference can be effectively adapted to assess it against its source document, thereby transforming these metrics into reference-free ones. Experimental results support this hypothesis. After being repurposed reference-freely, the zero-shot BERTScore using the pretrained DeBERTa-large-MNLI model of < 0.5B parameters consistently outperforms its original reference-based version across various aspects on the SummEval and Newsroom datasets. It also ex-cels in comparison to most existing reference-free metrics and closely competes with zero-shot summary evaluators based on GPT-3.5.


Introduction
Summarization is an important natural language generation (NLG) task.A problem that goes hand in hand with it is summary evaluation, which quantifies the quality of a summarizer or a system summary it generates.The traditional approach to automated † summary quality assessment is reference-based, such as ROUGE (Lin, 2004), BERTScore (Zhang* et al., 2020) and Mover-Score (Zhao et al., 2019), which assesses a system summary against one or a plurality of humanwritten reference summaries.† The ground truth is still human evaluation.
Requiring highly educated human labor, reference summaries are very costly to obtain.Therefore, many reference-free metrics have emerged recently (Scialom et al., 2019;Vasilyev et al., 2020;Bao et al., 2022), which directly compute a score between a system summary and its source document.However, the performance of referencefree metrics has historically lagged behind that of reference-based metrics because a human-written reference summary serves as a fluent and comprehensive representation of the key facts in the input document and thus gives reference-based metrics an advantage.
Recently, large language models (LLMs) have shown promise in building reference-free summary quality metrics.Metrics based on LLMs like GPT-3.5/4 (Liu et al., 2023;Wang et al., 2023;Gao et al., 2023) have outperformed both reference-free and reference-based baselines.However, LLMs are computationally expensive, and the closed nature of GPT-3+ restricts their usage with legal and reproducibility ‡ limitations.A more viable solution that uses much more cost-effective language models is highly expected.
To build an accurate but efficient metric, we revisit the reference-based metrics and hypothesize that they can be repurposed into referencefree metrics by directly comparing a summary with its source document.After being repurposed, BERTScore outperforms not only its original reference-based version, but also most existing reference-free metrics across the SummEval, Newsroom, and TAC2010 datasets on both semantic and linguistic aspects.Notably, the repurposed BERTScore achieves superior or comparable per-1 1226 formance to GPT-3.5-based summarization evaluators.It is worth noting that these results are achieved using foundation models with significantly fewer parameters (<0.5B) compared to GPT-3.5's extensive 175 billion parameters.
We hope this paper can inspire more work into zero-shot summarization or NLG evaluation using cost-effective (e.g., <1B parameters) LMs.Our source code is at https://github.com/SigmaWe/DocAsRef.In summary, the key findings of this paper include: 1.The proposed reference-free repurposing does improve performances for Transformer-based metrics including BERTScore and BLEURT.
2. The repurposed BERTScore can significantly outperform all non-GPT-3.5baselines using underlying LMs of the similar capacity.
3. With LMs hundreds of times smaller, the repurposed BERTScore can further match the performance of those based on GPT-3.5 in most of the cases.

Background: Ref-based and ref-free summary evaluation metrics
A system summary is generated from a source document by a summarizer, which is usually embodied by a neural network model today.A corresponding reference is generated from the same document by a human.Metrics for summary evaluation fall into two categories: the reference-based (short as ref-based) ones which are functions comparing a candidate summary and a human-written reference summary: f (system summary, reference), and reference-free (short as ref-free) ones which are functions that evaluate a candidate summary based solely on the input document: f (system summary, document).Ref-based metrics, such as ROUGE (Lin, 2004), BERTScore (Zhang* et al., 2020), BLEURT (Sellam et al., 2020), and MoverScore (Zhao et al., 2019), historically have an advantage over ref-free ones, such as Blanc (Vasilyev et al., 2020), Sum-mQA (Scialom et al., 2019), SDC* (Liu et al., 2022), andSueNes (Bao et al., 2022), because the human-written reference summary serves as a fluent and comprehensive representation of the key facts in the input document.Recent GPT-based summary metrics (Gao et al., 2023;Wang et al., 2023;Liu et al., 2023) are all ref-free in nature.

Repurposing ref-based to ref-free
The idea of repurposing ref-based metrics for reffree evaluation involves leveraging the mechanism employed by these metrics to compare two texts.Although ref-based metrics were originally designed to compare a system summary against a reference summary, we hypothesize that they can still be effective in directly comparing the system summary with the document.
To repurpose a ref-based metric f into a ref-free one, we simply feed the document in lieu of the reference when using f .While the idea of using the document as the reference is not new, the specific approach proposed here, which is straightforward and direct, has not been previously explored.Embracing the principle that simplicity is beautiful in science, we decide to give it a try.
Remarkably, our simple strategy has yielded good results.Three representative ref-based metrics gain their performances after being repurposed (Table 1).One of them, BERTScore employing generically trained LMs such as RoBERTa-large has a performance very close to the performances of metrics based on GPT-3.5, which utilizes hundreds of times more parameters (Tables 2 & 3).This outcome highlights the effectiveness of repurposing ref-based metrics for ref-free evaluation.

Variants of BERTScore
The promising initial results encouraged us to explore modifications to the ref-based metrics for enhanced performances.ROUGE and BLEURT have limited room for tweaking because ROUGE-1 and ROUGE-2 have been the best among its variants in the past two decades and BLEURT is already finetuned explicitly for summary evaluation.Hence, we focus on refining BERTScore.
The first tweak we applied onto BERTScore is to try different small-scale, pretrained language models (LMs).We conducted experiments with three LMs: RoBERTa, DeBERTa, and BART, both their base versions (around 110M parameters) and large versions (around 400M parameters).Additionally, we explored the variants of these LMs that have been officially fine-tuned on the MNLI dataset.Our hypothesis is that an LM fine-tuned for the MNLI task may be better suited for computing text similarity than generic LMs.
The second tweak we explored is expanding BERTScore to the sentence level by calculating the similarity between sentences instead of tokens.Various similarity measures and sentence weight-2 1227 ing schemes were proposed (Appendix B).Unfortunately, they rarely perform better than the original token-level BERTScore.

Settings
Because of their exceptional performances and impacts, four ref-based metrics are picked as candidate metrics to be repurposed: ROUGE (Lin, 2004), BERTScore (Zhang* et al., 2020), BLEURT (Sellam et al., 2020), and Mover-Score (Zhao et al., 2019).ROUGE is the classic metric used in summarization.The rest three are widely used as baselines in the field in recent years.
Three multi-facet summarization evaluation datasets with human ratings are used as the test datasets: SummEval (Fabbri et al., 2021), Newsroom (Grusky et al., 2018) and TAC2010 (NIST, 2010).SummEval and Newsroom are for singledocument summarization while TAC2010 is for multi-document summarization.SummEval covers four aspects: CONsistency, RELevance, COHerence, and FLUency.Newsroom covers four aspects: INFormativeness, RELevance, COHerence, and FLUency.TAC2010 reports three scores: Pyramid (Nenkova et al., 2007), linguistic, and overall scores.For TAC2010, only Set A of TAC2010 is used in this paper because Set B "update summarization" does not fit the problem formulation in § 2.1.Measuring how well a summary covers key pieces of information in the source document, RELevance or Pyramid score is generally considered the most important aspect of a summary.CONsistency a raising concern recently due to the hallucination issue.Details for the datasets and their aspects can be found from their respective papers.
Underlying language models (LMs).The LMs used in repurposed BERTScore variants are discussed in § 2.3.The default LM is RoBERTalarge.All ref-free baselines involving finetuning: § We did not run the experiments on baselines but simply copied the numbers from their original papers to here.For the three GPT3.5-basedbaselines, we pick their best results from their papers.BLANC, SummaQA, and SueNes, share the common initial checkpoint, BERT-base.MoverScore and BLUERT use RoBERTa-large and BLUERT-20 as the LMs.
BERTScore is a pairwise comparison metric.Depending on the axis along which max pooling is done, each BERTScore variant yields three scores: P (Precision), R (recall), and F (F1).The experiments are carried out on individual RTX 3090 24GB GPUs.For more details, see Appendix A.

Results
Following the trend in recent summary evaluation studies (Peyrard et al., 2017), we report the results at the summary level.Spearman's correlation coefficients between metrics' predictions and humanrated ground truth are the performance measure.For space sake, we present selected results here with extended results available in the appendices.

Is repurposing useful? Before vs. after
The answer is yes!Despite that ref-based metrics historically perform better than ref-free metrics, Table 1 shows that the three modern metrics, MoverScore, BERTScore, and BLEURT, gain their performances after being repurposed, on nearly all aspects of all datasets.The lexicon-based ROUGE-1/2/L also improves its performance on some aspects or datasets after being repurposed.
After being repurposed (top of half of Table 1), BERTScore outperforms all other metrics across datasets, with only a couple of exceptions.It outperforms MoverScore and BLEURT significantly.While BERTScore underperforms ROUGE on SummEval before repurposing, it turns the tide after.
The ref-free metrics used in their original designated way perform extremely bad on the Newsroom dataset (bottom half of Table 1 and additional evidence in Appendix D).This is due to that in Newsroom, a reference summary can be as short as one sentence.Here, the reliance to reference summaries becomes a weakness of ref-based summary quality metrics.In this case, the original document may be better than the reference summary to compare with for judging the summary quality.underlying LMs are used with BERTScore.Due to space limit, here we only report the results using RoBERTa-large and DeBERTa-large which give the best performance.The results on SummEval are given in Table 2. Repurposed BERTScore outperforms all non-GPT baselines by a significant margin.Additionally, it performs comparably to GPT3.5-based baselines on the RELevance and COHerence aspects.It is superior than one of the two GPT-3.5-basedapproaches on the CONsistency aspect.It should be noted that SummEval is challenging due to its coverage of 23 modern summarizers, many of which exhibit highly similar behavior.

Repurposed
Table 3 reports the results on the Newsroom dataset.The Newsroom dataset poses a significant challenge for new metrics since the baselines already perform very well on this dataset, likely because it evaluates only seven systems with dis- tinct performances.Despite the challenges, repurposed BERTScore outperforms all baselines except SueNes, which is finetued using data explicitly augmented for the summary evaluation task, on all aspects.
Because the non-GPT baselines, BLANC, SummaQA, and SueNes, use BERT-base as the underlying LM, for a fair comparison, we include BERTScore's results using RoBERTa/DeBERTa/BART-base in Appendix D.Even when they use LMs of the same size, BERTScore still outperforms them.
Table 4 shows the results on the TAC2010 dataset where BERTScore outperforms baselines on all aspects except linguistics.As a multidocument summarization dataset, TAC2010 provides 10 source documents d 1 , • • • , d 10 for generating a system summary s.We use the formula  summary s given a single-document summarization metric f .

What makes BERTScore powerful
While the result of this paper may sound surprising because the method is very simple, it is totally explainable.Comparing a summary with a document is theoretically more challenging than comparing it with a reference, because information is more sparse in a document than in a reference.This might be the reason that strong NLG evaluation metrics are historically reference-based.However, BERTScore exhibits exceptional performance after being repurprosed from ref-based to ref-free.We attribute this to both the contextual embedding of the underlying LMs and the maxpooling step of BERTScore.
The Transformers have the ability to identify important information in a context: by showing strong attentions to the important tokens as learned in pretraining.In other words, encoder-only Transformers used in BERTScore can identify important tokens and function as implicit summarizers.Extraneous information in a summary causes the summary's context to diverge from that of the original document, resulting in a reduction of semantic similarity, even when comparing the same token in the summary to its 'counterpart in the document.The maxpooling step of BERTScore further focuses on the alignment of the most semantically proximate token pairs between the document and the summary.Because the document and the summary are independently embedded in BERTScore, only when important information in the document and the summary align, the BERTScore can be high.On a related note, BERTScore alone is found very effectively in measuring factual inconsistency in summaries (Laban et al., 2022).The IDF part of BERTScore may not play an important role because the attention mechanism already factors in what IDF does.A stopword or a boilerplate word has a weak attention to other tokens.In BERTScore's original paper (Zhang* et al., 2020), IDF makes very marginal impact on all except one datasets/tasks.Table 5 shows our ablation study on the impact of IDF.IDF makes a very small impact and in many cases, it even decreases the performance.
The repurposed BERTScore shows relatively robust performance with respect to the choice of the underlying LMs.For example, on Newroom, BERTScore's worst performing variant in every aspect still outperforms the ChatGPT-based.The only aspect on which BERTScore is not stable is the COHerence aspect of SummEval.

Conclusion
In this paper, we explore repurposing summary evaluation metrics that were originally designed or trained for reference-based use as reference-free metrics.The motivation was to reuse their power in comparing texts.Comprehensive experiments on multiple datasets show that four representative metrics generally perform better after the repurposing.The best among them, BERTScore, is further studied with different configurations.The repurposed BERTScore using 0.5B-parameter LMs can outperform all non-GPT baselines significantly and even most of the times those based on GPT3.5.1230 work.Forrest Bao also wants to dedicate this paper to the people of Ukraine who have been courageously fighting for freedom since February 24, 2022.

Limitations
The test sets are all from the news domain which is the only domain that human evaluation to system summaries has been done.This is a limit beyond our control.
Unfortunately, our attempt (Appendix B) to expand BERTScore from token-level to sentencelevel fails.
Moreover, unlike token-level BERTScore, which remains stable across different LM choices, sentence-level BERTScore is highly sensitive to the selection of LMs.Extended results can be found in the appendices.
BERTScore can have a variant, which is at chuck-level.This idea was proposed in REUSE for machine translation (Mukherjee and Shrivastava, 2022).Since we have tried token-level and sentence-level BERTScore, trying chuck-level BERTScore in summarization can be part of the future work.many sentence-level BERTScore variants, they are referred to in this A-B-C nomenclature, where A is the similarity measure which is Cosine if cosine similarity and MNLI if using entailment confidence from an MNLI-finetuned model, B is the underlying LM, and C, optional, is the sentence weighting method g.

C Our idea in code
We hope this code can help explain what we mean by "repurposing" and also how to directly use the conclusion of this paper.At line 13, conventional approaches plug in human-written references.But in our proposed idea (line 14), just plug in the source documents, and you will get the best reference-free summary quality assessor.

D More comprehensive results
Please refer to Table 6 and Table 7.

E Leadword heuristic
It is common that important information is clustered at the beginning of a document.Hence, Leadword is a simple but effective method to extract important information.SUPERT build pseudo-references by extracting salient sentences and found that Leadword is better than any other simple extractive approach.So we also experiment with limiting the BERTScore-style pairwise comparison to top-k sentences in the input document.We use top-k slightly different from its common use in text generation.Here k means a ratio rather than an absolute number because the length of the input document varies a lot.
Is Leadword heuristic useful?In this study, no repurposed metrics benefit from the Leadword heuristic, unlike the result reported in SU-PERT (Gao et al., 2020).Nearly every metric loses performance after using the Leadword heuristic.The shorter the lead is, the more performance drop.Investigating the reason is part of our future work.8 1233 i∈[1..10]  f (d i , s) to approximate the score of a 4 1229 "this is a summary"],13# references = ["this is a reference"] # old way14 references = ["this is the DOC"] # DocAsRef 15 )[0]['recall']

Table 1 :
Performance before vs. after repurposing for four metrics.Summary-level.Spearman's.On the SummEval and Newsroom datasets.Best in each column in bold while 2nd best underlined.

Table 3 :
Summary-level Spearman's correlation coefficients on dataset Newsroom.Aspect names abbreviated.

Table 5 :
The performance of BERTScore-P with and without IDF.Summary-level Spearman's correlation coefficients in comparison.Model size: base.Yellow cells are when using IDF is worse than without IDF and green cells are for the opposite.