Zero-shot Faithfulness Evaluation for Text Summarization with Foundation Language Model

Despite tremendous improvements in natural language generation, summarization models still suffer from the unfaithfulness issue. Previous work evaluates faithfulness either using models trained on the other tasks or in-domain synthetic data, or prompting a large model such as ChatGPT. This paper proposes to do zero-shot faithfulness evaluation simply with a moderately-sized foundation language model. We introduce a new metric FFLM, which is a combination of probability changes based on the intuition that prefixing a piece of text that is consistent with the output will increase the probability of predicting the output. Experiments show that FFLM performs competitively with or even outperforms ChatGPT on both inconsistency detection and faithfulness rating with 24x fewer parameters. FFLM also achieves improvements over other strong baselines.


Introduction
Faithfulness evaluation for text summarization aims at measuring if the information in a summary is fully covered by and consistent with the source document1 .Although automatic text summarization has achieved remarkable improvements with pre-trained language models (Zhang et al., 2020;Lewis et al., 2020;Liu et al., 2021Liu et al., , 2022a;;Zhang et al., 2023) in recent years, especially in the aspect of fluency and informativeness.However, these neural models tend to generate unfaithful summaries.An effective faithfulness evaluation metric not only helps for implementing summarization systems in real applications but also plays a key role in developing more faithful summarization models, such as by data filtering (Matsumaru et al., 2020) or doing post-hoc corrections (Chaudhury et al., 2022).
Most previous work for faithfulness evaluation either takes advantage of models trained on related tasks for zero-shot evaluation (Goodrich et al., 2019;Falke et al., 2019;Wang et al., 2020), or does weakly-supervised evaluation with synthetic in-domain data (Kryściński et al., 2020).The former requires transferring out-of-box models to the summarization domain (Mishra et al., 2021), which lacks guarantees on the models' performance and suffers from error propagation (Ji et al., 2023).The latter one shows poor generalization ability (Laban et al., 2022) as a result of the limited synthetic rules that couldn't cover various kinds of hallucinations.Recently, as ChatGPT (OpenAI, 2022) has shown amazing generation abilities on various tasks, researchers attempt to do human-like evaluation by designing prompts to query the model in the zero-shot manner (Luo et al., 2023).However, such strong language models are still sensitive to nuances, showing unstable performance with different wording of prompts (Gao et al., 2023;Chen et al., 2023).
Considering the above weaknesses, we think that an ideal faithfulness evaluation metric for summarization should be independent of other tasks and dataset-specific expertise, be able to generalize among different benchmarks and robust for the same document-summary pair.Zhou et al. (2023) concludes that instruction tuning is just to teach the model to produce high-quality output while almost all of the knowledge has been learned during pre-training for large language models.Based on their findings, we wonder: can we get rid of the popular prompting approaches and calculate the faithfulness score simply with a foundation language model, which meets the above expectations?
In this work, we propose a metric named FFLM for zero-shot faithfulness evaluation with a foundation language model.The intuition behind FFLM is that the generation probability of a piece of text will increase when prefixing another piece of consistent text.Following this intuition, we classify different kinds of probability changes into changes with prior probability and changes with conditional probability.The former contains a comparison between the vanilla sequence-to-sequence probabilities of the summary given document and unconditional probabilities of the summary, and a similar comparison by changing the position of the document and the summary.The latter calculates the vanilla sequence-to-sequence probability with another conditional probability by adding a piece of prefix text.Similar intuition has been considered in previous works (She et al., 2023;Son et al., 2022).The major differences are that their metrics were carried out on models fine-tuned by summarization data and they only consider a single kind of probability changes.Our FFLM is based on the foundation language model, and we hypothesize that these different probability changes capture different hallucinations (see Sec. 4.4) which should be considered as a whole.
On top of these three components of probability changes, we introduce a feasible design of FFLM by re-weighting each token and each component to get the final faithfulness score.We did experiments in both the inconsistency detection setting and the faithfulness rating setting for summarization evaluation.The results show the favorable performance of our FFLM across different settings and datasets 2 .Our contributions are as follows: • We propose to do zero-shot faithfulness evaluation based on a foundation language model(Sec.4.6).
• We introduce a comprehensive evaluation metric FFLM by calculating the probability changes of the desired output in different ways(Sec.2) and verify the rationality of our metric design(Sec.4.3).
• Experiments on different evaluation settings show that FFLM based on LLaMa with only 7 billion parameters can achieve competitive performances or even outperforms ChatGPT among different datasets(Sec 4.1 and 4.2).

Approach
Given a source document X = {x 1 , ..., x n } and the corresponding summary Y = {y 1 , ..., y m }, the goal of this work is to design a metric FFLM measuring the faithfulness of Y based on the foundation model LM(•).We adopt LM(•) under the teacher-forcing strategy, which can provide a sequence of generation probabilities p of a given text with or without other conditional inputs.We first introduce three probability changes for faithfulness measurements and then propose a feasible design of our comprehensive metric FFLM.Scores proposed by She et al. (2023) and Son et al. (2022) are in Appendix A.

Faithfulness Measurements via Probability Changes
The intuition is that the generation probability of a piece of text will increase when providing more related and consistent information.On the contrary, the generation probability will drop when conditioned on inconsistent information.Accordingly, we considered three different probability changes in two categories as follows.
Changes with Prior Probability: The prior probability of Y can be estimated by the foundation model LM(•): and the sequence-to-sequence probability of Y given X is: If Y is a faithful summary, the sequence-tosequence probability p s2s Y should be larger than the prior probability p lm Y as more information consistent to Y is given by conditioning on X.Therefore, a faithfulness measurement can be defined as: From another point of view, we expect that the generation of Y highly relies on X, instead of parametric knowledge stored in LM(•) which is a main resource of hallucinations (Ji et al., 2023).
Similarly, a faithful Y can support the contents in X.Thus, the differences between the sequenceto-sequence probability of X given Y and the prior probability of X is another reasonable measurement: Changes with Conditional Probability: Instead of comparing sequence-to-sequence generation probabilities with prior probabilities, another way is to add more information P besides the input document X, leading to an influence on the generation of Y .Following She et al. (2023), we simply set P = Y .In this way, if Y is inconsistent with X, prefixing P will provide additional evidences for generating Y , leading to a larger p pref Y compared with p s2s Y , while a consistent one won't.Mathematically, the third measurement is: We didn't consider X and Y reversely here.The main reason is that inputting the sequence [P = X, Y , X] to LM(•) is much more costly and may exceed the max sequence length of most models since X is much longer than Y , i.e., n ≫ m.Goyal et al. (2022) found that high-loss tokens generally correspond to unfaithful contents during training a summarization model.Inspired by this finding and the success of the loss truncation training algorithms (Kang and Hashimoto, 2020), we think that more attention should be paid to such high-loss (or low-probability) tokens when calculating the faithfulness scores.So, instead of simply averaging the probability changes to get the final score for an (X, Y ) pair, we adopt two operations.First, we take the logarithm of the probabilities before subtraction, which will magnify changes on the low-probability tokens.Second, we re-weight each token based on p s2s Y and p s2s X correspondingly.We get:

A Feasible Design of FFLM
e p s2s y i (log p s2s y i − log p lm y i ) Finally, FFLM is a combination of these metrics: where α, β, and δ are weighting parameters in the range of 0 to 1 and α + β + δ = 1.These three weights can be tuned on a validation set, or set manually as hyper-parameters.

Experiment Setup
We present two evaluation settings considered by previous work for faithfulness evaluation first, with the implementation details of FFLM for them later.

Inconsistency Detection
Inconsistency detection regards the faithfulness evaluation as a binary classification problem.In other words, human annotators or automatic metrics only need to recognize whether the summary is faithful to the document or not.
Datasets: The SUMMAC Benchmark (Laban et al., 2022) is a benchmark consisting of six summarization evaluation datasets, including Co-GenSumm Falke et al. (2019), SummEval (Fabbri et al., 2021), FRANK (Pagnoni et al., 2021), Polytope (Huang et al., 2020), FactCC (Kryściński et al., 2020) and XSumfaith (Maynez et al., 2020).It standardized these datasets by changing their original labels into a binary label and split each dataset into a validation set and a test set.Most of the original datasets are labeled by three or more annotators, except Polytope and FactCC.
Evaluation Metric: Balanced accuracy (Brodersen et al., 2010) is adopted as the primary evaluation metric, which requires binary labels for computation.For approaches with continuous scores, a threshold can be selected via the validation set.
and SUMMAC CONV (Laban et al., 2022).Besides, we implemented the language modelingbased metric BARTScore (Yuan et al., 2021) and metrics based on probability changes include CoP (She et al., 2023) andHaRiM (Son et al., 2022).These three metrics were suggested to use the CNN/DM (Nallapati et al., 2016) fine-tuned BART model4 for calculation.We also improved the latter two metrics with our proposal by calculating with a foundation language model, LLaMa, for comparisons.

Faithfulness Rating
Faithfulness rating defines the evaluation as a Likert scale coring problem.Annotators or metrics score each summary according to its faithfulness.Generally, the higher, the more faithful.Datasets: Following Son et al. ( 2022), we experimented on five different datasets: FRANKCNN and FRANKXSUM from Pagnoni et al. (2021), QAGSCNN and QAGSXSum from Wang et al. (2020), andSummEval (Fabbri et al., 2021).For the first four datasets, human judgments were originally done on the sentence level.The faithfulness rating of the whole summary is collected by doing majority voting on each summary sentence among annotators and averaging among sentences.Sum-mEval contains human scores in the range of 1 to 5 in the aspect of consistency.More details are in Table 1.
Evaluation Metrics: Pearson(γ), Spearman(ρ), and Kendall(τ ) correlation coefficients are used to measure the alignments between faithfulness ratings annotated by annotators and automatic metrics.The correlations are the higher the better.We consider the summary-level correlations for all datasets.
Besides, system-level correlations are calculated on SummEval which contains annotations for 16 extractive or abstractive summarization models.
Baselines: Rouge-2 F1 (Lin, 2004), Meteor (Banerjee and Lavie, 2005), BLEU (Papineni et al., 2002) and BERTScore F1 (Zhang et al., 2019a) are widely-accepted summarization evaluation metrics.We report their best results in Son et al. (2022) by calculating between the summary and the source document.QAGS (Wang et al., 2020) is another QA-based metric.Others are the same as the ones for inconsistency detection.

Implementation Details
We implemented FFLM with the foundation language model LLaMa (Touvron et al., 2023).It contains models with different sizes, where LLaMa7b is selected for our main experiments.We add "TL;DR" between the conditional sequence and the target sequence.The weights in Eq. 7 are determined in {0.0, 0.1, ..., 1.0} according to the performance on the corresponding validation set for inconsistency detection.For faithfulness rating, we set α, β, δ as 0.25, 0.25, 0.5 respectively, with the intuition that the former two are from the same category as introduced in Sec.2.1.Our experiments are done on a single RTX 3090.

Results and Analysis
This section includes the main results for inconsistency detection and faithfulness rating, together with an ablation study, an analysis of error types, and comparisons of different model sizes of our FFLM.We also discussed our metric and the prompting approach with or without instruction tuning under the same model size.

Performance on Inconsistency Detection
The results on inconsistency detection are in Table 2. Our proposed metric FFLM achieves stateof-the-art performance on 3 datasets including Co-GenSum, SummEval, and FRANK, and outperforms ChatGPT on 5 out of 6 datasets from the SUMMAC benchmark except XSumFaith.
Both Polytope and FactCC are only labeled by a single annotator.As a result, their labels may not be as convincing as the other datasets.Although QuestEval, the best QA-based metric, achieves the top-1 accuracy on Polytope, it performs mediocrely on the rest.that may have certain similarities with the FactCC dataset.Therefore, FactCC CLS shows strong performance on the FactCC dataset while relatively weak on the others including datasets in Table 3, the same as the findings in Laban et al. (2022).Also, that's why SummaC Conv shows around 12% significant improvements over our FFLM.
Concentrating on the metrics based on probability changes, zero-shot metrics CoP BART and HaRiM BART perform not badly compared with previous SOTA SummaC ZS , showing the potential of using probability changes for faithfulness evaluation.After introducing the foundation language model, their performances don't drop in most cases, indicating that fine-tuning with indomain data is not necessary.However, the leading performance between these two metrics is unstable among datasets.HaRiM LLaMa outperforms CoP LLaMa on FRANK and Polytope, while on the rest datasets, the opposite is true.FFLM, as a comprehensive metric, successfully achieves improvements over both of them on 5 out of 6 datasets.

Performance on Faithfulness Rating
Summary-level results are in Table 3.The results of ChatGPT borrowed from Luo et al. (2023) show its inconsistency improvements among datasets: It doesn't exceed previous baselines on FRANKCNN, performs similarly on SummEval, and achieves conspicuous gains on FRANKXSUM.Besides, similar to the above analysis for comparisons among probability change-based metrics, our FFLM induces performance gains on 4 out of 5 datasets over CoP LLaMa and HaRiM LLaMa , especially on datasets sourced from XSum.Unfortunately, FFLM still lags behind ChatGPT with 175 billion parameters on FRANKXSUM, showing ChatGPT's strong ability on dealing with highly abstractive summaries.This is also in line with ChatGPT's favorable performance on XSumFaith in Table 2.After all, FFLM achieves the best scores on FRANKCNN, SummEval, and QAGSXSUM, and performs competitively on the other datasets.
We also report the system-level results on Sum-mEval in Table 5. FFLM performs similarly to ChatGPT according to the Spearman correlation.To recap, our FFLM generalizes well among different task settings and different datasets, showing favorable performance over the baselines.It is backed on LLaMa with only 7 billion parameters and performs competitively with or even outperforms ChatGPT with 175 billion parameters, which is much more efficient for faithfulness evaluation.

Ablation Study on Metric Designs
We carried out ablation studies of FFLM on faithfulness rating in Table 4.The ablation results on inconsistency detection are in Appendix B.
Ablations on the metric components: We test different combinations of the three probability changes.The results show that ∆ cond Y is the most powerful component of FFLM.Its combination with ∆ prior Y ranks first among ablations on both FRANKXSUM and QAGSXSUM.Together with ∆ prior X , our metric FFLM shows over 5% increases in Spearman correlation on QAGSCNN, 1.8% on FRANKCNN and SummEval, without much loss on the other two datasets, records more robust results.Moreover, combining different probability changes induces performance gains in most cases, reflecting the necessity of designing a comprehensive metric(More in Sec 4.4).
Ablations on the metric designs: We use w and log to annotate the token-level weights and the logarithm operation introduced in Sec 2.2.Both operations contribute to the final FFLM, where log is more effective for datasets sourced from XSum and w for the others.
Ablations on the combination weights: For the faithfulness rating task where we empirically set the weights α, β and δ as 0.25, 0.25 and 0.5, we compared it with the equaling weights, i.e., α = β = δ = 1 3 .FFLM performs relatively better.

Analysis on Error Types
By taking a look at the correlations between pairs of the metric components in Figure 6, we can see that the correlations vary among different datasets.None of the pairs show a high degree of correlation, indicating that these components may capture Table 6: Correlations between pairs of the metric components on faithfulness rating.unfaithfulness from different aspects.
To figure out if different probability changes correlate well with different error types in the gensummaries, we take advantage of labels in the FRANKCNN and FRANKXSUM datasets.Pagnoni et al. (2021) divides the factual errors in the generated summaries into three groups.Semantic frame errors(Sem) include errors on the predicate, entities, and additional information about the circumstance.Discourse errors(Disc) consist of coreference errors and discourse link errors.Content verifiability errors(CVer) are closely related to extrinsic hallucinations (Ji et al., 2023), containing the out-of-article error and grammatical error.We randomly picked 50 error cases and 10 error cases for each error type from FRANKCNN and FRANKXSUM respectively, and mixed them with the rest faithful summaries.Spearman correlations averaged over 10 times are in Fig. 1.
We observed that ∆ cond Y captures different errors best, which is accord with the ablation results in Table 4. Comparing among the scores for each ∆ horizontally, we can see that the probability changes with prior probability is good at CVer errors on both datasets, and ∆ cond Y at Sem errors or Disc errors.The differences among datasets reflect their different characteristics (Pagnoni et al., 2021).Summaries in FRANKCNN are made up of multiple sentences, resulting in more diverse and challenging situations for Disc errors than FRANKXSUM with single-sentence summaries.Thus, ∆ cond Y increases dramatically from 14.2% on FRANKCNN to 41.7% on FRANKXSUM for Disc.FFLM made further improvements over ∆ cond Y on both Sem and CVer, showing that combining different probability changes is reasonable and effective in most cases except Discs.

Performance on Different Model Sizes
To test FFLM's performance on different models sizes, we select LLaMa with 3 billion(3B), 7 billion(7B) and 13 billion(13B) parameters5 that are trained on the same data volume with 1 trillion tokens and draw the diagram in Fig. 2 for faithfulness rating datasets.The scores consistently increase from LLaMa-3B to LLaMa-7B across the five datasets, while the improvements are not consistent for LLaMa-13B.Given a certain amount of data, increasing the number of parameters can enhance the model's language modeling ability and be helpful to faithfulness evaluation.On the other hand, when the model size keeps scaling up, more unexpected biases in the pre-training corpus may be memorized and will hurt the performance.This has also been pointed out by Ranaldi et al. (2023) and Nadeem et al. (2021).
In this way, we think that using larger foundation models may not be the best choice for faithfulness evaluation on summarization, which is also closely related to the research on investigating the optimal model size and dataset size for training foundation language models (Hoffmann et al., 2022).

Comparisons with Prompting and Instruction-tuning
We compare our metric with prompting and instruction-tuning techniques under the same model size in Table 7 for faithfulness rating.Here, LLaMa-7B is the vanilla foundation language model.Vicuna-7B (Chiang et al., 2023) and Alpaca-7B (Taori et al., 2023) are initialized from LLaMa-7B and instruction-tuned with data collected in different ways.We present the maximum scores for each dataset among different prompts designed by previous works (Chen et al., 2023;Gao et al., 2023;Luo et al., 2023).The detailed prompts for each evaluation task are listed in Appendix C. First, we observe that using models containing 7 billion parameters, FFLM outperforms the prompting approach across different models and datasets.The prompting results here lag behind the performance of ChatGPT dramatically.This leads to the conclusion that the effectiveness of prompting approaches relies highly on much larger models, while our metric FFLM can be a cheaper alternative with smaller models.Second, instruction tuning is important for improving the prompting approach, while is not necessary for our FFLM.It enhances the models' understanding ability on instruction templates in the prompts by further tuning with relatively small datasets.However, such manually collected datasets may contain unconscious bias and hurt FFLM's performance.

Faithfulness Evaluation for Summarization
Faithfulness evaluation metrics can be classified into zero-shot ones and weakly-supervised ones.Zero-shot evaluation metrics mainly take advantage of the models trained with related natural lan-guage tasks.Goodrich et al. (2019) adopted information extraction tools to extract the fact tuples from both the source document and the summary.Tuple mismatches reflect the hallucinations.The intuition behind question-answeringbased metrics (Wang et al., 2020;Durmus et al., 2020;Scialom et al., 2021) is that identical answers should be generated when asking the same question to a summary and the corresponding document respectively.Natural language inference also shares commonalities with faithfulness evaluation in the way that information in a consistent summary should be entirely entailed by the source document (Falke et al., 2019;Mishra et al., 2021;Laban et al., 2022).However, all of these metrics highly rely on the domain-transfer ability of out-ofbox models and suffer from error propagation.
Instead, weakly-supervised approaches choose to train classifiers by constructing synthetic indomain data with heuristics by experts.Different kinds of inconsistency errors are simulated by perturbing the reference document-summary pairs (Kryściński et al., 2020;Utama et al., 2022;Yin et al., 2021).The limited heuristic makes it hard to cover all kinds of errors and shows poor generalization ability among datasets (Laban et al., 2022).
As language modeling-based metrics (Egan et al., 2022;Liu et al., 2022b) receive more attention, another small group of work for faithfulness evaluation computes probability changes with models fine-tuned on summarization datasets (She et al., 2023;Son et al., 2022;Xie et al., 2021), showing a biased preference for abstractive summaries.Based on this line of work, we propose FFLM based on the foundation language model.Our zero-shot metric doesn't require further training with in-domain or synthetic data and shows a strong generalization ability.

Evaluation with Large Language Models
With orders of magnitude more parameters and extensive training on large-scale data, large language models (LLMs) (Brown et al., 2020;Touvron et al., 2023) have exhibited surprising abilities that may not be observed in previous small language models.The strong capability in language comprehension naturally spurs research in exploring LLMs as better automatic evaluators for various text generation systems (Wang et al., 2023).
There are also some attempts of faithfulness evaluation by prompting large models (Luo et al., 2023) with different templates and strategies, such as adding detailed definitions (Gao et al., 2023) and chain-of-thought (Chen et al., 2023).None of these strategies achieve consistent improvements over the original prompt.Besides, neural models are sensitive to the choices of words (Chen et al., 2023), resulting in unstable performances(See Appendix D).
Our FFLM takes advantage of the strong capability of LLMs for faithfulness evaluation in a different way and shows competitive performance requiring a much smaller number of parameters than the well-known ChatGPT (OpenAI, 2022).

Conclusion
This paper focuses on zero-shot faithfulness evaluation for summarization and introduces a novel evaluation metric FFLM which is simply based on the foundation language model.Experiments on both the inconsistency detection benchmark and faithfulness rating datasets show the strong generalization ability of FFLM across various task settings and different datasets.It also shows favorable performance over strong baselines including ChatGPT.Using our proposed metric for more fine-grained consistency detection and designing more faithful summarization systems are future directions.

Limitations
The main idea of this work is to do faithfulness evaluation based on a foundation language model by a combination of different probability changes.FFLM is just a feasible but not perfect metric design.Although it makes improvements over each ∆ on almost all of the datasets in Table 4, it failed on the errors related to discourse errors on the FRANKCNN and FRANKXSUM dataset according to Fig. 1.Designing better aggregation metrics based on specific analysis of different error types will be considered in the future.
Besides, in this work, our FFLM only calculates a single score for the whole summary without pinpointing the exact erroneous words or the specific error type.Considering the success of CoP (She et al., 2023) on token-level inconsistency detection and detailed inconsistency category evaluation, we hypothesize that our metric FFLM can be also used for these evaluation scenarios by adjusting the aggregation weights or combining it with the prompting approach.
Moreover, we limit our scope to faithfulness evaluation for text summarization in this paper because the definition of faithfulness evaluation for other generation tasks has some non-trivial differences.For example, the chit-chat utterances in dialogue generation (Dziri et al., 2022) are supposed to be acceptable under the evaluation for faithfulness, instead of being regarded as extrinsic hallucinations.The evaluation for sentence paraphrasing (Zhang et al., 2019b) should be bi-directional, i.e., the first sentence has to be consistent with the second one, and vice versa.We consider transferring FFLM with adjustment on the other tasks as future work.

Figure 1 :
Figure 1: Spearman correlation(%) of different error types on FRANKCNN and FRANKXSUM.Highest correlations for each ∆ is highlighted with red boxes.

Table 1 :
Statistics of the datasets."C" and "X" are short for CNN/DM

Table 2 :
The weakly-supervised baselines FactCC CLS and SummaC Conv are trained with synthetic data constructed with human expertise Balanced accuracy(%) on the SUMMAC benchmark.The best result for each dataset is in bold.Scores of FFLM better than other metrics based on the foundation model are underlined.

Table 4 :
Ablations of FFLM on faithfulness rating.The highest scores are in bold.

Table 5 :
System-level correlations between metrics and human ratings on the SummEval dataset.

Table 7 :
Comparisons with prompting and instruction-tuning techniques under the same model size.The highest correlations are in bold in each column and are underlined among each kind of approach.