Factored Verification: Detecting and Reducing Hallucination in Summaries of Academic Papers

Hallucination plagues even frontier LLMs— but how bad is it really for summarizing academic papers? We evaluate Factored Verification , a simple automated method for detecting hallucinations in abstractive summaries. This method sets a new SotA on hallucination detection in the summarization task of the HaluE-val benchmark, achieving 76.2% accuracy. We then use this method to estimate how often language models hallucinate when summarizing across multiple academic papers and find 0.62 hallucinations in the average ChatGPT (16k) summary, 0.84 for GPT-4, and 1.55 for Claude 2. We ask models to self-correct using factored critiques and find that this lowers the number of hallucinations to 0.49 for ChatGPT, 0.46 for GPT-4, and 0.95 for Claude 2. The hallucinations we find are often subtle, so we advise caution when using models to synthesize academic papers.


Introduction
Hallucination-the generation of inaccurate or ungrounded information-is a largely unsolved problem for LLMs (Kryściński et al., 2019;Maynez et al., 2020;Ji et al., 2023).This is acceptable for creative use cases such as story generation and brainstorming, but would be highly problematic if common for academic summarization and Q&A where factual accuracy is key.How common is hallucination for SotA models when answering questions given the abstracts of multiple scientific papers?
To answer this question, we first construct a simple method for checking hallucination inspired by Kadavath et al. (2022) and Lightman et al. (2023): Given a summary, we automatically decompose it into key claims, assign a model-generated probability to each of the claims given the relevant sources, and combine these into an overall correctness probability.We validate this method on the hallucination detection benchmark HaluEval and set a new SotA, exceeding the previous chain-ofthought-based method by 10 absolute percentage points using the same language model.
We then apply Factored Verification to detecting hallucination in a real-world scientific summarization task.Given the abstracts of eight papers and a question, the task is to provide a question-relevant summary.We measure hallucination for SotA models including GPT-4 (OpenAI, 2023) and Claude 2 (Bai et al., 2022a), and estimate that the average summary has between 0.62 and 1.57 hallucinations.
Given that we can automatically detect some hallucinations, can we use this knowledge to reduce them?We treat the claim-wise critiques generated by Factored Verification as model-generated advice (Saunders et al., 2022a) and show that we can reduce detected hallucinations for every model we study, but that significant hallucination remains.We first develop and validate Factored Verification, a simple method for using LLMs to detect hallucinations in settings where the relevant source material is provided.

Defining "hallucination"
We call a claim "hallucinated" if it is not backed by the source material provided in context, even if it could be supported with other sources.For example, if the source material discusses the implementation of a public transport policy and the model-generated summary infers that the policy was aimed at addressing sustainability challenges, this is a hallucination unless the source explicitly talked about this as the goal of the policy.

Method
Following Lightman et al. ( 2023), we break each summary into a list of claims and then assign each claim a probability of being correct, both using LLM prompting.The claim decomposition prompt is in Appendix A.1.1.
To compute the likelihood that a single claim is correct we use a few-shot prompt with GPT-4 base (OpenAI, 2023) and look up the probability of the final Yes token (Appendix A.1.2).For ChatGPT, which doesn't provide access to token probabilities, we ask the model to verify that each claim is supported using few-shot chain-of-thought (Jason Wei et al.), interpreting the resulting Yes/No answer as a 0/1 probability (Appendix A.1.3).
Assuming independence of the correctness of claims for simplicity, the probability that the summary is correct is the product of the probabilities of each of the individual claims: We classify a summary as hallucinated if P summary is greater than a threshold θ.

Dataset
To measure how well Factored Verification works, we use the summarization task of HaluEval, a hallucination benchmark (Li et al., 2023).Each item in this task consists of a document and two summaries, one of which contains a subtle hallucination.
Due to cost constraints, we randomly sample 5% of the dataset as a "training" set, and 20% as a test set (1000 and 4000 items respectively).The training set is only used to set the decision threshold θ by running GPT-4 with the aforementioned claim likelihood prompt on the training set examples and calculating the average of P summary for both faithful and hallucinated summaries.

Results
Factored Verification is SotA for hallucination detection, exceeding prior few-shot and chain-ofthought approaches for both ChatGPT and GPT-4 (Table 1).

Model
Few For comparison, we also tested a single-prompt equivalent of Factored Verification, asking Chat-GPT to write out each claim and verify it in one go.This achieved an accuracy of 63.3%, which is below the multi-prompt factored method (71.2%) and comparable to standard CoT results (61.2%).
We were surprised by GPT-4's low 30.9%accuracy when using the exact few-shot prompt from Li et al. (2023).To check our implementation, we ran the same code with ChatGPT and found an accuracy 59.9% on our sample, roughly matching the 58.5% from Li et al. (2023).Given that GPT-4 is much more competitive in the CoT and Factored Verification settings, we suspect that the Li et al. (2023) few-shot prompt has idiosyncratic limitations.
3 How often do models hallucinate in scientific summaries?
We now turn to our task of interest, generating summaries based on the abstracts of scientific papers.
We will apply Factored Verification to estimate how frequently models hallucinate in this context.

Task
Given a question and eight abstracts, the task is to generate a question-relevant summary where each claim is cited with the paper that backs it up.Figure 2 shows a simplified example with only two papers.
Query: What is the impact of creatine on cognition?
Summary: The papers suggest that creatine supplementation may have a positive impact on cognition.Ling  The dataset of queries and abstracts1 is generated by using 200 anonymized search queries and their associated top results from Elicit (Stuhlmüller and Byun, 2023).

Method
We follow the hallucination detection strategy outlined in Section 2.2, with the following modifications to reduce compute cost: 1. We directly treat sentences as claims.
2. We only check claims that have associated citations.
3. When checking each claim, we provide only the abstracts of the cited papers as sources.
We expect that these modifications lead to little degradation given that almost all sentences have citations and the simplification step from sentence to claim is not doing much work.

Results
We run Factored Verification with ChatGPT, GPT-4, Claude 2, and Claude Instant (OpenAI, 2023;Bai et al., 2022a).Table 2 shows that for all models, our method reports at least one hallucination in the majority of summaries.
We include additional results that show interactions between ChatGPT and GPT-4 when used as generation, criticism, and judge models in Table 3 in the Appendix.

Interpretation
Based on the 76% accuracy of Factored Verification on HaluEval, we know that there are likely false positives and/or false negatives, so we can't take the reported hallucination rates literally.
We manually inspected about a hundred claims evaluated by GPT-4.When GPT-4 said that a claim is supported, we agreed in all cases.When GPT-4 reported an unsupported claim, we agreed 66% of the time.So, our best guess for the true hallucination rate is 2/3 of the reported hallucination rate.
Many of the claims we encountered were wrong in subtle ways that we would likely have missed without seeing the GPT-4 critiques, and would expect non-expert evaluators to miss, including: • Stating that a claim is supported by two abstracts when it is only supported by one • Slightly exaggerating the findings of a paper • Conflating the purpose of the study with the outcome • Implying that two independent findings are linked This augmentation of human evaluation is consistent with prior work by Saunders et al. (2022b) which found that model-generated critiques help humans find flaws in summaries.

Reducing hallucination in scientific summaries with Factored Verification
It is common for LLMs to apparently fail at a task, only to then succeed with better prompting.Can we prompt models using the detected inaccuracies to automatically reduce hallucination in scientific summaries?

Baseline
We ask GPT-4 to self-correct by first identifying false claims in its initial summary, then revising the summary given this correction (prompts in Appendix A.4.1 and A.5).This increased the average number of detected hallucinations from 1.55 to 2.13.Huang et al. (2023a) similarly found that the GPT-4 generation of LLMs struggles to directly self-correct across a variety of reasoning datasets.

Method
To improve on the baseline, we propose to reduce hallucination with Factored Verification in three steps, as illustrated in Figure 1   First, we create claim-wise critiques (true/false judgments and supporting reasoning) analogous to the hallucination detection method above: We ask the model to evaluate the supportedness of each sentence based on the cited abstracts.We then concatenate the critiques of the unsupported claims to form the factored critique.Finally we ask the model to revise the summary given that critique.

Results
Table 2 shows that Factord Verification reduces the number of summaries with reported hallucination by 5.50% to 23.25% (absolute) depending on the model, with ChatGPT being the lowesthallucination model before critique (0.62 estimated hallucinations per summary), and GPT-4 being the lowest-hallucination model after critique and across all settings (0.46 estimated hallucinations per summary).

Related work
Hallucination is widely known to be a significant problem for LLMs (Luo et al., 2023;Peng et al., 2023;Ji et al., 2023), although to a much lesser extent for abstractive summarization where the information needed to answer is fully provided (Cao et al., 2022;Huang et al., 2023c).
Various strategies have been proposed to mitigate hallucination.Some strategies aim to prevent their occurrence by checking how familiar models are with instructions (Luo et al., 2023).Others, including our proposed method, focus on the detection and correction of hallucinations postgeneration (Cao et al., 2022;Huang et al., 2023b).
Of these, notable strategies are the use of external knowledge and retrieval augmentation, and automated feedback (Shuster et al., 2021;Peng et al., 2023;Zhang et al., 2023).While external knowledge is less relevant here, Factored Verification can be viewed as a kind of automated feedback.
In simultaneous work, Dhuliawala et al. (2023) propose an automated feedback method called Chain-of-Verification, which is effectively the same as ours: (i) Draft an initial response, (ii) plan factchecking questions, (iii) answer the questions independently, and (iv) generate a final revised response.Their evaluation focuses on out-of-context fact checking.Consistent with our results, they find that the factored version of their approach reduces hallucination for question-answering and text generation tasks.

Relevance to AI alignment
To align powerful AI systems, we need to be able to provide accurate feedback and supervision even when systems surpass human-level performance, a property known as scalable oversight (Amodei et al., 2016).
Today, reinforcement learning from human feedback (RLHF) is often used to align LLMs (Christiano et al., 2023;Bai et al., 2022b;Ouyang et al., 2022).However, GPT-4 already surpasses the performance of the average human on many academic tasks (OpenAI, 2023), making it difficult for nonexperts to provide effective reward signals.In our Critique: -For the sentence "Giuntella (2015) found that immigration reduced waiting times for outpatient referrals, suggesting that policies affecting population dynamics can indirectly impact waiting times and thus healthcare costs." in the summary: The claim accurately reflects the findings of Giuntella (2015) that immigration reduced waiting times for outpatient referrals.
However, the claim extends the findings of the study to suggest that policies affecting population dynamics can indirectly impact waiting times and thus healthcare costs.While this may be a reasonable inference, it is not directly supported by the abstract.
-For the sentence "Propper (2002) found that GP fundholders were able to secure shorter waiting times for their patients, suggesting that giving family doctors the ability to choose the hospital where their patients are treated and the means to pay for some services can reduce waiting times." in the summary: The claim is not directly supported by the abstract.
The abstract states that the study investigates whether GP fundholders were able to secure shorter waiting times for their patients, but it does not provide the results of this investigation.Therefore, the claim that Propper (2002) found that GP fundholders were able to secure shorter waiting times for their patients is not supported by the abstract.attempts to delegate evaluation of academic claims to non-expert contractors, we observed only 38% inter-rater agreement for unsupported claims, a sign of similar difficulties.In the short term this can be solved by using contractors with specialized domain knowledge.However, this won't work if models surpass the capabilities of the best humans.
We have shown that factored critiques let models correct some of their own mistakes without need for human supervision.If similar approaches can be extended beyond hallucination reduction to richer tasks, they could help us scale supervision in lockstep with future model capabilities.

Discussion
Our main finding is that the absolute rate of hallucination of SotA models like ChatGPT, Claude 2, and GPT-4 is surprisingly high for academic summarization.This is true even with revision using factored critiques, which results in 0.46 to to 1.24 estimated hallucinations per summary.
A natural question to ask in this context is whether we can finetune on model-revised summaries, incrementally bootstraping to more and more accurate summaries, initially detecting and eliminating the most egregious failure modes, then more subtle ones with each training iteration.
Overall, despite incredible advances, language models still struggle with accurate summarization in academic contexts.Many mistakes are only clear upon careful inspection of the sources and look identical to genuine answers otherwise.For now, we advise caution in situations where accuracy matters, as we would for human summaries as well.

Figure 1 :
Figure 1: Factored Verification splits a summary into claims, checks each claim, and then optionally revises the summary to address the claim critiques.Each step is a language model task.

(
2009) found that creatine ethyl ester supplementation improved performance on several cognitive tasks, although the results require replication using objective measures of compliance.Benton (2010)  found that creatine supplementation improved memory in vegetarians, but not in those who consume meat.

Figure 2 :
Figure 2: Example of the scientific summarization task, simplified from eight to two papers.

Figure 3 :
Figure 3: Example of a factored critique generated with GPT-4

Table 2 :
Prevalence of hallucination for models when generating summaries of academic papers, before and after revision with factored critiques.Based on manual inspection of approximately 100 data points our best guess is that the true prevalence of hallucination ("adjusted") is 2/3 of the reports from automated evaluation.