FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

Evaluating the factuality of long-form text generated by large language models (LMs) is non-trivial because (1) generations often contain a mixture of supported and unsupported pieces of information, making binary judgments of quality inadequate, and (2) human evaluation is time-consuming and costly. In this paper, we introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source. We conduct an extensive human evaluation to obtain FACTSCOREs of people biographies generated by several state-of-the-art commercial LMs -- InstructGPT, ChatGPT, and the retrieval-augmented PerplexityAI -- and report new analysis demonstrating the need for such a fine-grained score (e.g., ChatGPT only achieves 58%). Since human evaluation is costly, we also introduce an automated model that estimates FACTSCORE using retrieval and a strong language model, with less than a 2% error rate. Finally, we use this automated metric to evaluate 6,500 generations from a new set of 13 recent LMs that would have cost $26K if evaluated by humans, with various findings: GPT-4 and ChatGPT are more factual than public models, and Vicuna and Alpaca are some of the best public models. FACTSCORE is available for public use via `pip install factscore`.


Introduction
Long-form text generated by large language models (LMs) has widely been used (Brown et al., 2020;Ouyang et al., 2022); nonetheless, evaluating their factual precision-whether each piece of information conveyed in a generation is factually accurateremains challenging for two reasons.First, a generation consists of a large number of pieces of infor- † Core contributors.
-Bridget Moynahan is American.
-Bridget Moynahan is an actress.
-Bridget Moynahan is a model.
-Bridget Moynahan is a producer.
-She is best known for her roles in Grey's Anatomy.
-She is best known for her roles in I, Robot.
-She is best known for her roles in Blue Bloods.
-She studied at the American Academy of Dramatic Arts.
-… Tell me a bio of Bridget Moynahan.
Bridget Moynahan is an American actress, model and producer.She is best known for her roles in Grey's Anatomy, I, Robot and Blue Bloods.She studied acting at the American Academy of Dramatic Arts, and …

66.7%
Bridget Moynahan is an American filmmaker and writer.She is best known for her work on the soap opera General Hospital, which she co-created with husband Charles Kelly.Moynahan was raised in a middle-class family in Los Angeles, … -Bridget Moynahan is American.
-Bridget Moynahan is a filmmaker.
-Bridget Moynahan is a writer.
-She is best known for her work on General Hospital.
-General Hospital is the soap opera.
-She co-created General Hospital.
-She co-created General Hospital with her husband.
-Her husband is Charles Kelly.
-Moynahan was raised in a middle-class family.
-Moynahan was raised in Los Angeles. -…

10.0%
Stable LM Chat GPT mation that are a mixture of true or false,2 making a binary judgment inadequate (Pagnoni et al., 2021).Second, validating every piece of information is time-consuming and costly.
In this paper, we introduce FACTSCORE (Factual precision in Atomicity Score), a new evaluation of an LM that represents the percentage of atomic facts (pieces of information) supported by a given knowledge source.Computing FACTSCORE involves (1) breaking a generation into a series of atomic facts-short statements that each contain one piece of information (Nenkova and Passonneau, 2004;Shapira et al., 2019;Zhang and Bansal, 2021;Liu et al., 2022), and (2) assigning a binary label to each atomic fact, allowing a fine-grained evaluation of factual precision.We evaluate FACTSCORE on the task of generating people biographies because generations consist of verifiable statements rather than debatable or subjective ones, and the scope is broad (i.e., covering diverse nationalities, professions, and levels of rarity).
Since human evaluation is costly, we next introduce an automatic evaluation of FACTSCORE through a model that estimates a FACTSCORE for a given LM.Our estimator decomposes generations into atomic facts and validates each based on a given knowledge source, leveraging retrieval from the given knowledge source and strong language models.Our estimator closely approximates FACTSCORE with an error rate of < 2% and can be applied to a range of new LMs at scale with no human effort.Our case study evaluates 6,500 generations from 13 LMs that could have cost $26K, with various findings: GPT-4 (OpenAI, 2023) and ChatGPT are far less factual than humans but are much better than public models, and there is a large variance between public models, with Vicuna (Chiang et al., 2023) and Alpaca (Taori et al., 2023) being some of the best.
In summary, our contributions are as follows.
1. We introduce FACTSCORE, a new evaluation of factual precision of LMs by breaking their generations into atomic facts and validating each against a given knowledge source.Human evaluation reveals that the state-of-the-art LMs with and without search have low FACTSCOREs.
to extend FACTSCORE for a broader set of generations (e.g., open-ended generation) and to further improve the estimator.

Related Work
Factual precision in text generation.Factual precision in text generation has been an active area of research in NLP.Most prior work studies factual precision of models supervised for a specific problem such as dialogue (Shuster et al., 2021), or focuses on question answering with short answers (Kadavath et al., 2022;Kandpal et al., 2022;Mallen et al., 2023;Nori et al., 2023).
More recent work has studied factual precision of text generation beyond short answers.Lee et al. (2022) evaluates the factual precision with proxy metrics, e.g., whether named entities in a generation appear in an article of the topic.A series of concurrent work verifies the precision of the citations (attributions) provided by the model (Gao et al., 2022;Liu et al., 2023a;Yue et al., 2023;Gao et al., 2023).A concurrent work by Manakul et al. (2023) automates the identification of factual errors in LM generations without using any knowledge source; we use their method as a baseline estimator in Section 4. In contrast, our work (1) considers much longer text generation4 from a variety of state-of-the-art LMs with and without search, (2) provides their fine-grained evaluation both by human experts and through an automated evaluator that closely approaches humans, and (3) applies it to a large set of LMs at scale.Fact Verification.Our work is closely related to prior work on fact verification (Thorne et al., 2018;Wadden et al., 2020) where claim sentences are automatically checked against a large knowledge source like Wikipedia or scientific literature.Most literature assumes a single, atomic claim, sometimes modeled with surrounding context (Nakov et al., 2018;Mihaylova et al., 2019;Shaar et al., 2022).There also has been work that verifies a longer sentence or text through decomposition to atomic facts (Fan et al., 2020;Wright et al., 2022;Chen et al., 2022;Kamoi et al., 2023) from which we take inspiration.The primary difference between fact verification literature and our work is that we focus on long-form model-generated text rather than sentence-level human-written claims.
Model-based Evaluation.Prior work has used learned models to define automated evaluation scores (Zhang et al., 2020;Liu et al., 2023b).This includes model-based evaluation in summarization that considers the consistency between a summary and a source document using QA or NLI (Kryscinski et al., 2020;Wang et al., 2020;Fabbri et al., 2022;Deutsch et al., 2021;Laban et al., 2022).We take inspiration from this work, and evaluate factual precision of LM generations by considering whether pieces of information are supported by a large text corpus.

FACTSCORE: Evaluating Factual Precision of Long-form Text Generation
We introduce FACTSCORE, a new evaluation of an LM that considers the factual precision of atomic facts generated by the LM.We perform human evaluations to calculate FACTSCOREs of the stateof-the-art LMs (Section 3.3) and discuss results (Section 3.4).FACTSCORE allows rigorous and fine-grained evaluation of factual precision, but is time-consuming and costly, motivating automatic evaluation in Section 4.

Definition
FACTSCORE is based on two key ideas.
Key idea 1: Atomic fact as a unit.Long-form text consists of many pieces of information that can each be either true or false.Prior work has explored using a sentence as a unit; however, even a single sentence is a mix of supported and unsupported facts, e.g., in 40% of the cases with ChatGPT.Previous and concurrent work either (1) defines an additional label of partial support (Manakul et al., 2023;Liu et al., 2023a) whose definition may be subjective and can lead to low agreement, or (2) takes the strictest definition of support that requires every piece of information to be supported (Rashkin et al., 2021;Gao et al., 2022), which ignores the partial support cases, e.g., assigning 0.0 to both generations in Figure 1 even though the first generation is considerably more accurate than the second.
In this paper, we define an atomic fact as a short sentence conveying one piece of information (examples in Figure 1), similar to summarization content units (Nenkova and Passonneau, 2004).An atomic fact is a more fundamental unit than a sentence for a piece of information and provides a more fine-grained evaluation, e.g., in Figure 1, rat-ing the first generation higher than the second.
Key Idea 2: Factual precision as a function of a given knowledge source.Prior work often considers factual precision as a single global truth (Manakul et al., 2023).In contrast, we adopt a perspective that the truthfulness of a statement should depend on a particular knowledge source that end users consider to be trustworthy and reliable.Therefore, instead of whether an atomic fact is globally true or false, we consider whether it is supported by a given source of knowledge.This has been used in the fact verification literature (Wadden et al., 2022) where conflict of information between different sources is relatively common.
Definition.Let M be a language model to be evaluated, X be a set of prompts, and C be a knowledge source.Consider a response y = M x for x ∈ X and A y , a list of atomic facts in y.A FACTSCORE of M is defined as follows.
M x responds means M did not abstain from responding to the prompt x.This definition assumes the following: 1. Whether or not an atomic fact is supported by C is undebatable.2. Every atomic fact in A y has an equal weight of importance, following Krishna et al. (2023).3. Pieces of information in C do not conflict or overlap with each other.
In the rest of the paper, we propose to use people biographies as X and Wikipedia as C because they satisfy these assumptions to a reasonable degree (Section 3.3).We discuss in which cases these assumptions hold or may not hold in more detail in the Limitation section.FACTSCORE considers precision but not recall, e.g., a model that abstains from answering too often or generates text with fewer facts may have a higher FACTSCORE, even if these are not desired.We leave the evaluation of factual recall for future work (more discussion in the Limitation section).

Data
We perform human evaluation of factual precision based on our definition.We prompt the LM SUBJ to generate people biographies and evaluate them against Wikipedia for the following reasons.
• Biographies are objective (not subjective or debatable) and contain specific (not vague) information, satisfying Assumption 1 in Section 3.1.
• Biographies allow evaluation across diverse nationalities, professions, and levels of rarities.
• Wikipedia offers reasonable coverage of information about people and is reasonably selfconsistent,5 satisfying Assumption 3.
Data collection.We carefully design an annotation pipeline to assign a factual precision to a long-form generation through the following steps.
Step 0: Sampling people entities.We sample 183 people entities from Wikidata who have corresponding Wikipedia pages.We sample entities to annotate from a uniform distribution over categories defined in Appendix A.1.
Step 1: Obtaining generations.We feed a prompt "Tell me a bio of <entity>" to the LM SUBJ and take a generation as it is.We implement rules to identify generations that abstain from answering and filter them out.
Step 2: Atomic facts generation.Human annotators break a generation into a series of atomic facts.
To save annotation time, we provide atomic facts broken down by InstructGPT which human annotators can take and revise.Details in Appendix A.2.
Step 3: Labeling factual precision & editing.We ask another set of human annotators to assign each atomic fact one of three labels.If the atomic fact is clearly not related to the prompt, and thus should be removed from the bio without a validation step, they assign Irrelevant.If the fact is relevant, they validate the fact based on the English Wikipedia, and label either Supported or Not-supported.
We recruit freelancers through Upwork and pay 15-25 USD per hour.Annotation requires extensive effort and time, leading to the cost of $4 per generation.We assign two freelancers for the 10% of the data and calculate the agreement rate: 96%, 90% and 88% for InstructGPT, ChatGPT and Per-plexityAI, respectively.More details are provided in Appendix A.3.

Results
Statistics of the data and results are reported in Table 1.
All LM SUBJ 's struggle with factual precision errors.InstructGPT and ChatGPT achieve FACTSCOREs of 42.5% and 58.3%, respectively.PerplexityAI, which uses a commercial search engine and thus should have a perfect FACTSCORE if directly copying the text from the correct Wikipedia page, attains a FACTSCORE of 71.5%.We provide a qualitative analysis of its error cases in the last paragraph of this section.
ChatGPT and PerplexityAI often abstain from answering which presumably improves their factual precision.InstructGPT rarely abstains from answering, likely because it is not trained to do so.
Irrelevant facts either (a) have dependencies on previous facts in a generation that turn out to be unsupported, or (b) are irrelevant to the prompt independent from other facts in a generation (examples in Appendix A.4).We find that (b) rarely happens with InstructGPT and ChatGPT but happens considerably with PerplexityAI, because PerplexityAI often directly copies search results even if they are largely irrelevant to the input prompt.This is in agreement with a concurrent work from Liu et al. (2023a)    Error rates are higher for rarer entities.There is a notable decrease in FACTSCORE as the rarity of entities increases, consistently across all LM SUBJ s.This is in agreement with Kandpal et al. ( 2022) and Mallen et al. (2023) which show that short question answering (QA) accuracy is highly correlated with the entity frequencies in the pretraining data.However, in contrast to Kandpal et al. (2022) and Mallen et al. (2023) who report QA accuracy of models with retrieval is robust to the rarity of entities, FACTSCORE of PerplexityAI still significantly drops as entities are rarer: a relative drop of 50% and 64% observed at the atomic-level and sentence-level, respectively.
Error rates are higher for facts mentioned later in the generation.Figure 2 (bottom) reports factual precision over relative positions in a generation.Across all LMs, the later part of the generation has significantly worse precision.This is likely because (a) information mentioned earlier is more frequently mentioned in the pretraining data (e.g., nationality, profession), and (b) error propagation affects the later part of the generation.This also implies that evaluating LMs solely based on short answers may not provide an adequate assessment of their factual precision, as it fails to account for errors that arise in the later stages of generation.
Qualitative analysis of Not-supported.One of the surprising findings in our empricial analysis is that a FACTSCORE of PerplexityAI (71.5%) is lower than expected despite having access to the search engine.To better understand its errors, we categorize 30 random samples whose label is Not-supported (Table 2).
• Single-sentence contradiction: A single sentence from Wikipedia provides direct contradic-tion to the generation, either at a word level (numbers, dates, or entities) or beyond.
• Page-level contradiction: Errors found after reading the entire page, often because a fact that should have been mentioned in Wikipedia if true is missing, e.g., whether the subject appears in a particular film.
• Subjective: Generation is subjective, often because PerplexityAI copies subjective text from Wikipedia, e.g., directly copying a quote from a journalist without realizing it.
• Fact is irrelevant: Generation is irrelevant to the subject due to a search error.
• Wiki is inconsistent & wrong: In the example, Wikipedia indicates that the subject won one award from the film Kick, but also includes text that they won multiple awards from Kick, which is inaccurate and cited a news article that does not support the claim.
• Annotation error: Annotators assign incorrect labels, typically because the information is not mentioned in the subject's Wikipedia page (likely because it is insignificant).
We also find that, although PerplexityAI provides citations to the references, citations have little correlation with factual precision.36.0%and 37.6% of supported and unsupported sentences have citations, respectively.Together with independent findings from Liu et al. (2023a), this indicates that commercial LMs that incorporate search and provide citations may not be as reliable as expected.More analysis is provided in Appendix A.5.

Estimating FACTSCORE for Automatic Evaluation
Human evaluation of factual precision is costly ($4 per generation) (Bohnet et al., 2022;Krishna et al., 2023) because validating every atomic fact against a large knowledge source is time-consuming, and one generation contains many (26-41) atomic facts.This prevents LM developers and practitioners from evaluating the factual precision in long-form generation of a new LM SUBJ at scale.In this context, we introduce a model that estimates FACTSCORE.This estimator takes a set of generations and automatically computes a FACTSCORE, and can be applied to any LM SUBJ .We describe our model (Section 4.1) and demonstrate its accuracy against human evaluation (Sec-tion 4.2).FACTSCORE estimated by our model is then used to evaluate twelve LMs (Section 4.3).

Model
Our estimator of FACTSCORE first breaks a generation into a series of atomic facts and then validates each against the given knowledge source.We find taking atomic facts generated by InstructGPT (used in data collection in Section 3.3) effective and close to human, consistent with findings from prior work (Chen et al., 2022).This section thus focuses on how to validate each atomic fact against a given knowledge source.
The validation is based on zero-shot prompting of an LM referred to as an LM EVAL to distinguish from an LM SUBJ .Specifically, a prompt-whose construction methods differ across four variantsis fed into an LM EVAL .The prediction is then made by comparing the conditional probability of True and False from the LM EVAL .If the logit values are unavailable (e.g., commercial LMs like Chat-GPT), the prediction is made based on whether the generated text contains True or False. 6he four variants we consider are as follows.
No-context LM uses <atomic-fact> True or False? as a prompt, closely resembling Kadavath et al. (2022). 7etrieve→LM retrieves passages from the given knowledge source and then prompts the LM EVAL .It first retrieves k passages, constructs the prompt by concatenating retrieved passages, the given atomic fact, and "True or False?", and feeds it to the LM EVAL to get the prediction.
Nonparametric Probability (NP) makes a judgment based on a nonparametric likelihood.It masks out each token in the atomic fact, computes its likelihood using a nonparametric masked LM (Min et al., 2023), averages probabilities over all tokens, and makes a prediction based on thresholding.
Retrieve→LM + NP is an ensemble of Retrieve→LM and NP which assigns Supported only if both methods assign Supported.

Evaluation of Estimators
Metrics.We report Error Rate (ER)-the difference between the ground truth and the estimated FACTSCORE-as well as whether the estimated FACTSCOREs preserve the ranking between three LM SUBJ s.Appendix B.2 discusses results with other metrics that consider individual judgments instead of aggregated judgments.We use the data in Section 3.3 as evaluation data.
Results are reported in Table 3.
Retrieval significantly helps.Models that use retrieval are consistently better than No-context LM which either has a significantly high ER or does not preserve ranking between three LM SUBJ s.This is likely because the LM EVAL has not memorized every factual information about the topic entity, thus benefiting from retrieval providing factual context.Nonetheless, just using Retrieve→LM may overestimate FACTSCORE, e.g., by up to 17% with Inst-LLAMA, when a LM SUBJ is InstructGPT or ChatGPT.In this case, ensembling Retrieve→LM and NP reduces an error rate by a significant margin.When a LM SUBJ is PerplexityAI, single methods (either Retrieve→LM or NP) give a low ER, and ensemble methods have a higher ER due to an underestimation of FACTSCORE.
ChatGPT is not always the best.Our results show that ChatGPT is not necessarily better than Inst-LLAMA.We investigate this further in Appendix B.3.In summary, ChatGPT is better at validating each individual atomic fact.However, most errors from ChatGPT are incorrectly assigning Supported to unsupported facts, overestimating FACTSCORE.In contrast, LLAMA+NP is not biased toward overestimation or underestimation of the factual precision, resulting in an aggregated factual precision to be closer to the ground truth.This is similar to the trade-off between systemlevel and segment-level correlations in summarization evaluation, which often produce different rankings (Bhandari et al., 2020;Deutsch et al., 2021).
The best estimator depends on the LM SUBJ .While using retrieval is consistently better than No-context LM, the best variant of estimator depends on a LM SUBJ : LLAMA+NP for InstructGPT and ChatGPT, and ChatGPT for PerplexityAI.Nevertheless, both evaluators give consistently correct ranking between three LM SUBJ 's, and Section 4.3 show scores from two estimators are largely correlated across 10+ LM SUBJ s (0.99 Pearson's r).We recommend users try both variants of our estimator when evaluating a new LM SUBJ and report their correlation.

Evaluation of New LMs
Our estimator allows evaluating factual precision of a large set of new LMs at scale with no human  efforts.As a case study, we evaluate ten new LMs that came out within two months at the time of conducting experiments (Table 4).These LMs were evaluated on many benchmarks but not in factual precision of long-form generation since such evaluation is costly.We aim to provide new insights on these LMs by estimating FACTSCORE of their long-form generations.

Setup
We evaluate 10 recently-released LMs as shown in Oasst-pythia11 is Pythia 12B fine-tined on humanwritten data collected through Open Assistant.12StableLM-tuned-alpha13 is based on StableLMbase-alpha14 fine-tuned on the data used in the Alpaca data, DataBricks Dolly, the ShareGPT data, the GPT4All data (Anand et al., 2023) and Anthropic HH (Bai et al., 2022).MPT Chat is based on MPT 7B15 fine-tuned on the ShareGPT data, the Alpaca data, Anthropic HH, HC3 (Guo et al., 2023), and Evol-Instruct. 16e prompt each LM SUBJ to generate biographies of 500 human entities as done in Section 3.3 but with no overlap in entities.We additionally include InstructGPT, ChatGPT, and human-written biographies obtained through DBPedia.Human-written biographies were unavailable for 11% of entities which we consider as abstaining from responding.
See Table 5 for their statistics.In total, we evaluate 6,500 generations from 13 subjects, which would have cost $26K if they were evaluated by humans.

Results
Figure 3 shows the ranking between 13 subjects provided by the two best variants of our estimator whose scores are largely correlated, e.g., having a Pearson's r of 0.99.This evaluation allows a better understanding of these models, including: • All LMs are substantially less factual than humans.This is in contrast to prior work that claims LMs approach human performance, even for complex tasks (Ding et al., 2022;Nori et al., 2023;Lee et al., 2023) even though the task of writing biographies is fairly easy.• GPT-4 and ChatGPT are comparable in factual precision.However, as reported in • Alpaca and Vicuna achieve performance that is very close to each other within the same size of models, possibly because they share the same base model and similar training data.Nonetheless, as shown in Table 5, Vicuna generates significantly more atomic facts than Alpaca does (51 vs. 17 per response).Also, Alpaca never abstains from answering while Vicuna does.
• Within public models, there are large gaps in factual precision even when the model size is similar, e.g., within the 7B models, Alpaca and Vicuna (∼ 40%) are more factual than MPT-Chat (30%) and StableLM (17%).Possible factors include the choice of the base LM, the data, and the training recipe (Hoffmann et al., 2022).
We highlight that this evaluation only considers factual precision, specifically in people biographies.
A holistic evaluation of LMs should include other aspects of generations such as fluency, coherence, relevance, consistency and creativity, which is out of scope of this paper.

Conclusion and Future Work
We introduced FACTSCORE, a new evaluation of the factual precision of long-form generation from LMs that breaks a generation down into a series of atomic facts and computes a fraction of facts supported by a given knowledge source.
We first performed extensive human evaluation, finding that commercial, state-the-art-art LMs-InstructGPT, ChatGPT, and search engine augmented, PerplexityAI-make a substantial amount of errors, e.g., having a FACTSCORE of 58% in the case of ChatGPT.Since human evaluation is time-consuming and costly, we proposed a model that estimates FACTSCORE, allowing an automatic evaluation of FACTSCORE.We found our estimator based on retrieval over a knowledge source and competitive language models estimates FACTSCORE close to the ground truth, and showcased its application by evaluating 12 recentlyreleased LMs that could have cost $65K if evaluated by humans and providing insights about them.Within four months since its initial release, FACTSCORE has actively been used in subsequent work, evaluating factual precision of recentlyproposed models (Ye et al., 2023;Sun et al., 2023;Malaviya et al., 2023;Dhuliawala et al., 2023).As future work, we suggest: (1) considering other aspects of factuality such as recall (coverage of factual information); (2) further improving the estimator for a better approximation of factual precision; and (3) leveraging FACTSCORE to correct model generations (briefly explored in Appendix C).

Limitations
Scope of FACTSCORE.All of our experiments focus on people biographies and Wikipedia, because many LMs can generate biographies with objective and specific facts (rather than subjective and vague ones) and Wikipedia has a high coverage for them.FACTSCORE can be applied to a broader domain, e.g., text about recent events whose knowledge source can be a collection of news articles, or text about scientific findings whose knowledge source can be a collection of scientific literature.We present a proof of concept in Appendix B.5 and leave further study for future work.
Due to the assumptions made in Section 3.1, FACTSCORE is not applicable when the facts are more nuanced, open-ended, and debatable (Chen et al., 2019;Xu et al., 2023) or with a knowledge source whose text frequently conflicts with each other (Wadden et al., 2022).Moreover, FACTSCORE may not be suitable for the humanwritten text that is nuanced and includes intentional or implicit deception.
Limitation in our estimator.While our estimator closely approximates humans and provides consistent ranking over a large set of LMs, it is not perfect in individual judgments, and the best variant depends on the degree of how close a generation is to human-written text and its linguistic complexity.Future work can investigate how the distribution of model generation affects the performance of the estimator and further improve the estimator.
Beyond factual precision.FACTSCORE focuses on factual precision-whether each piece of information in a generation is factually supported by a reliable source of knowledge-which is only one aspect of the broader factuality problem.For instance, FACTSCORE does not consider factual recall: the coverage of information in a generation.FACTSCORE does not penalize a model that abstains from responding too frequently or generates fewer facts, which can be unfair since there is an inherent trade-off between precision and recall.Moreover, the boundary between precision and recall is often blurry, e.g., it is possible that, even if every piece of information in a generation is supported, it misses a significant piece of information that should have been mentioned in order to be considered as correctly responding to the input prompt (example in Table 6).We leave a more holistic evaluation of factuality for future work, and recommend reporting FACTSCORE together with the % of abstention and the average number of atomic facts (as we did in Section 4.3).for one prompt, because we find it saves annotation time in total.10% of the HITs have two workers assigned to calculate the agreement rate; the rest have one worker assigned.The agreement rates are 96%, 90% and 88% for InstructGPT, ChatGPT and PerplexityAI, respectively.Appendix A.5 discusses disagreement cases in more detail.The full instructions and the interface are provided in Figure 6 and Figure 7, respectively.

A.4 Examples in annotated data
Table 7 provides examples of the human-annotated data, each atomic fact with an assigned label.Supported and Not-supported respectively indicate Wikipedia supports the fact and does not support the fact (either contradicts or does not contain any evidence).Irrelevant indicates the fact is irrelevant to the input prompt, which can further be divided into two cases: (1) the fact depends on other facts because it expands previous facts in a generation, and such other facts are Not-supported, e.g., in the first example in Table 7, and (2) the entire sentence is irrelevant to the prompt, independent from other facts in a generation, e.g., the second example in  1.3% in InstructGPT and ChatGPT, respectively.This is because PerplexityAI often directly copies search results even if they are largely irrelevant to the input prompt.This is in agreement with a concurrent work from Liu et al. (2023a) that shows generative search engines like PerplexityAI copy incorrect search results and generate text that is irrelevant to the input query.

A.5 Qualitative Analysis
Analysis of disagreement cases.We analyze the cases where two annotators assigned to a same generation disagree on a precision label for the same atomic fact.Categorization is provided in Table 8.The 70% is due to an inherent debatability on whether or not the fact is supported by a given source of knowledge, not satisfying Assumption 2 in Section 3.1.This is because there can be multiple interpretations of a fact, it is debatable whether or not an information can be inferred from a piece of text, or the atomic fact is subjective.For instance: • Gerhard Fischer is an inventor: Gerhard Fischer is widely known as an inventor of a metal detector, and even the title of the Wikipedia article is "Gerhard Fischer (inventor)".However, it later turns out that he did not invent a metal detector; rather, he commercialized it.
• Chadwick Boseman was a producer: Chadwick Boseman is widely known as another profession (singer) and there is no text that mentions him as a producer.However, he produced one music video.
Nonetheless, since our agreement rate is fairly high (91%), we think such cases are rare in our particular domain of people biographies.We include more discussion on other domains that such cases may be more frequent in the Limitation section.
Coverage of English Wikipedia.While factual prediction is inherently a function of a knowledge source given as part of the input, a potential concern is how representative using English Wikipedia as a knowledge source for evaluating people biographies with respect to its coverage.For instance, it is possible that, especially for rare entities, the coverage of information in Wikipedia is not high enough, and LMs may be penalized by generating information that is true even if not supported by Wikipedia (i.e., supported by other sources on the web).
To quantify the effect, we randomly sample 30 unsupported facts from ChatGPT on people whose categories are either 'rare' or 'very rare', and then validate them against the entire web.We found 10% (3 out of 30 facts) are in fact supported, even though they are not supported in Wikipedia.An example is [Hibo] Wardere published her memoir titled "Cut: One Woman's Fight Against FGM in Britain Today" which is not mentioned in Wikipedia but is found from Google Books.
Nonetheless, we found that Wikipedia has a high coverage and mentions most of the important information that we were able to find from any other sources on the web.This is in agreement with prior work that treated Wikipedia as a general knowledge source under the same reason (Chen et al., 2017;Petroni et al., 2021).

B Details in Estimators B.1 Implementation details
As an LM EVAL , we use the best open LM and the best commercial LM at the time of conducting experiments: LLAMA 65B (Touvron et al., 2023) and LLAMA 7B trained on Super Natural Instructions (Inst-LLAMA, Wang et al., 2022) as the former, and ChatGPT (OpenAI, 2022) as the latter.For computing nonparametric probabilities, we use a single-mask variant of NPM with BM25 as in the original paper (Min et al., 2023), and use 0.3 as a thresholding hyperparameter.
For passage retrieval, we use Generalizable T5based Retrievers (GTR, a large variant), an unsupervised dense passage retrieval system (Ni et al., 2022).We restrict retrieved passages to be from the topic entity's page, and use k = 5.We find our estimator is not sensitive to the choice of a retrieval system (ablations provided in Appendix B.3).As a retrieval corpus, we use the English Wikipedia from 04/01/2023 which is around the time the data annotation was completed, and split each page into passages with up to 256 tokens.
Additional baselines.We also compare with Self-check LM, a method from a concurrent work by Manakul et al. (2023).Self-check LM needs multiple samples generated from the LM SUBJ .It validates the given atomic fact by prompting LM EVAL conditioning on each generated sample,17 making judgment (Supported or not) from each, and aggregates the results through a majority vote.This method assumes (1) the LM SUBJ is available at the time of evaluation and (2) the outputs from the LM SUBJ are nondeterministic, which makes it not applicable to PerplexityAI.developing evaluation metrics in machine translation (Ma et al., 2019;Thompson and Post, 2020) and summarization (Bhandari et al., 2020;Deutsch et al., 2021).
Results.Results on F1 MICRO are reported in Table 9. Self-check LM outperforms no-context LM by 4-11%, which confirms findings from Manakul et al. (2023).However, both significantly underperform methods that use retrieval.This is in contrast to Manakul et al. (2023) that reports that Self-check without retrieval achieves performance that is close to that with retrieval, likely because the data in Manakul et al. (2023) contains more frequent entities.The fact that retrieval significantly helps is consistent with findings in Section 4.2 with an ER as a metric.
Adding NP improves Retrieve→LM by 2-9%, again consistent with findings in Section 4.2.This is likely because Retrieve→LM often makes incorrect predictions when there is a strong bias from an LM or there are distracting passages, and considering nonparametric probabilities makes the model more robust to these factors.For instance, given an unsupported fact Samuel Oboh is Nigerian, Nocontext LM, Self-check LM and Retrieve→LM predict Supported due to a strong name-nationality bias.NPM correctly predicts Not-supported based on a passage Samuel Oboh ... is a Canadian architect, manager, ....It is also worth noting that this is different from findings in Section 4.2 that ChatGPT is not necessarily better than LLAMA+NP based on ER.
Using a stronger LM EVAL significantly improves F1 MICRO .It is worth noting that these results are somewhat different from findings in Section 4.2 that ChatGPT is not necessarily better than LLAMA+NP.This is becauase, although ChatGPT is better in validating each individual atomic fact, most errors from ChatGPT are incorrectly assigning Supported to Not-supported facts, resulting in an overestimation of FACTSCORE.In contrast, LLAMA+NP is not biased toward overestimation or underestimation of the factual precision, resulting in an aggregated factual precision to be closer to the ground truth.This is similar to the trade-off between system-level and segment-level correlations in summarization evaluation (Bhandari et al., 2020;Deutsch et al., 2021).

B.3 Ablations
QA Prompting vs. TF Prompting As described in Section 4.1, we use True or False as part of the prompt, so-called TF Prompting.An alternative is QA Prompting, which generates a question and the expected answer, obtains the answer for the generated question independent from the expected answer, and compares the expected answer and the predicted answer.This approach has been widely studied in the summarization literature and recent work in factual precision (Kryscinski et al., 2020;Wang et al., 2020;Gao et al., 2022;Manakul et al., 2023).Table 11 provides a comparison between two types of prompting.The TF approach significantly outperforms the QA approach, consistently over all methods.Our further analysis finds that this is due to generated questions often being overly vague or ambiguous.For instance, given a supported fact Samuel Oboh is an architect, the LM generates What is Samuel Oboh's job? as a question and Architect as an expected answer, and the obtained answer is Vice President.
Although both Architect and Vice President are correct, they are not the same, thus the model incorrectly predicts Not-supported.Such cases make the model overpredict Not-supported, leading to many incorrect predictions.
Impact of the choice of retrieval.cate that all retrieval systems are equally good and Retrieve→LM is not sensitive to the choice of the retrieval system.
Qualitative analysis.+ Atomic Facts.Additionally, we explore whether adding atomic facts and their labels assist a model with fine-grained editing.Specifically, after the input sentence we add information to the prompt of the form Fact 1 (True/False): <atomic fact 1> Fact 2 (True/False): <atomic fact 2> ... This data is also provided in the exemplars.
Non-edit baselines.Finally, we add some trivial baselines to lower-bound our editing metrics.Specifically, we measure the performance of input copying (no edits), as well as an editor with random token dropping / replacement on a random 25% subset of tokens.

C.2 Evaluation
In our data collection process (Section 3.3), along with our verification data we also collected goldstandard human written edits.Let X = x 1 , ...x N X be the input sentence and G = g 1 , ...g N G be the gold edited sentence.We evaluate the quality of the model-generated edit (E = e 1 , ..., e N E ) using three automatic metrics, (1) Error Localization (ErrLoc): Our first metric measures how well the editor identifies errors within the input sentence.Specifically, we first create a "token preservation string", marking token x i in the input sentence X as "Preserved" or "Not Preserved".We then compute the macro-averaged F1 score between the token preservation strings derived from the gold edit and the model-generated edit.We remove stopwords, punctuation and lowercase all words before performing this calculation.
To equally weigh every sentence, F1 scores are independently computed for each sentence before a final averaging.
(2) Edit Correctness (EditCorr): Our second metric assesses the quality of the additional tokens added by the model-generated edit.Specifically, we check the token-level F1 score (Rajpurkar et al., 2016) comparing the new tokens added by the gold edit G and the new tokens added by the modelgenerated edit E.More concretely, where || • || is the set cardinality and HM denotes a harmonic mean.For this metric, we discard data points where the gold edit did not add new tokens.Similar to ErrLoc, we also remove stopwords, remove punctuation and lowercase strings before calculating EditCorr scores.
(3) SIM alignment (SimAl): Finally, due to the large output space of possible edits, we also adopt a metric which rewards paraphrases of the gold edits.We use semantic similarity embeddings from Wieting et al. (2022) which map paraphrases to a similar part of a vector space.We check the similarity between the model edit E and the gold edit G, normalizing it by the similarity between G and the original input X.18 Specifically, where s(A, B) is the semantic similarity score (normalized to [0, 1]) from the model in Wieting et al. (2022).Intuitively, this metric measures how much closer G and E are compared to G and X.

C.3 Results
We present our editing results in Table 14.Overall, we find that: All editing models perform better than trivial lower bounds.Overall, we find that all editor models outperform lower-bound baselines like random noise.This even happens in the no-context LM setting, where ChatGPT is editing its own output (or search engine augmented Perplexity AI's outputs), but can still perform non-trivial corrections (6.8 ErrCorr for ChatGPT correcting its own outputs vs 0.1 for a random noise editor baseline).
Retrieval significantly helps with editing performance.Across all base language models and metrics, augmenting the editor with retrieved paragraphs boosts performance (6.8 → 16.8 ErrCorr, 4.0 → 9.5 SimAl for ChatGPT correcting its own outputs).We hypothesize that the internal parametric knowledge in ChatGPT has insufficient information about the topic (as we also observed in Section 3.4) to perform fine-grained editing, and using external knowledge from Wikipedia greatly simplifies error localization and correction.This also corroborates with our findings in Section 4.2.
Please breakdown the following sentence into independent facts: He made his acting debut in the film The Moon is the Sun's Dream (1992), and continued to appear in small and supporting roles throughout the 1990s.
-He made his acting debut in the film.
-He made his acting debut in The Moon is the Sun's Dream.
-The Moon is the Sun's Dream is a film.-The Moon is the Sun's Dream was released in 1992.
-After his acting debut, he appeared in small and supporting roles.
-After his acting debut, he appeared in small and supporting roles throughout the 1990s.
Please breakdown the following sentence into independent facts: He is also a successful producer and engineer, having worked with a wide variety of artists, including Willie Nelson, Tim McGraw, and Taylor Swift.

Figure 1 :
Figure 1: An overview of FACTSCORE, a fraction of atomic facts (pieces of information) supported by a given knowledge source.FACTSCORE allows a more fine-grained evaluation of factual precision, e.g., in the figure, the top model gets a score of 66.7% and the bottom model gets 10.0%, whereas prior work would assign 0.0 to both.FACTSCORE can either be based on human evaluation, or be automated, which allows evaluation of a large set of LMs with no human efforts.

Figure 2 :
Figure 2: FACTSCORE across varying frequency levels of human entities (top) and relative positions in a generation (bottom).FACTSCOREs are lower as the rarity of the entities increases and the position of the fact is later.
shows factual precision over varying frequency levels of topic entities (humans) in the pretraining corpora (see Appendix A.1).
-He is successful.-He is a producer.-He is a engineer.-He has worked with a wide variety of artists.-Willie Nelson is an artist.-He has worked with Willie Nelson.-Tim McGraw is an artist.-He has worked with Tim McGraw.-Taylor Swift is an artist.-He has worked with Taylor Swift.Please breakdown the following sentence into independent facts: In 1963, Collins became one of the third group of astronauts selected by NASA and he served as the back-up Command Module Pilot for the Gemini 7 mission.-Collins became an astronaut.-Collins became one of the third group of astronauts.-Collins became one of the third group of astronauts selected.-Collins became one of the third group of astronauts selected by NASA.-Collins became one of the third group of astronauts selected by NASA in 1963.-He served as the Command Module Pilot.-He served as the back-up Command Module Pilot.-He served as the Command Module Pilot for the Gemini 7 mission.Please breakdown the following sentence into independent facts: In addition to his acting roles, Bateman has written and directed two short films and is currently in development on his feature debut.-Bateman has acting roles.-Bateman has written two short films.-Bateman has directed two short films.-Bateman has written and directed two short films.-Bateman is currently in development on his feature debut.Please breakdown the following sentence into independent facts: Michael Collins (born October 31, 1930) is a retired American astronaut and test pilot who was the Command Module Pilot for the Apollo 11 mission in 1969.-Michael Collins was born on October 31, 1930.-Michael Collins is retired.-Michael Collins is an American.-Michael Collins was an astronaut.-Michael Collins was a test pilot.-Michael Collins was the Command Module Pilot.-Michael Collins was the Command Module Pilot for the Apollo 11 mission.-Michael Collins was the Command Module Pilot for the Apollo 11 mission in 1969.Please breakdown the following sentence into independent facts: He was an American composer, conductor, and musical director.-He was an American.-He was a composer.-He was a conductor.-He was a musical director.Please breakdown the following sentence into independent facts: She currently stars in the romantic comedy series, Love and Destiny, which premiered in 2019.-She currently stars in Love and Destiny.-Love and Destiny is a romantic comedy series.-Love and Destiny premiered in 2019.Please breakdown the following sentence into independent facts: During his professional career, McCoy played for the Broncos, the San Diego Chargers, the Minnesota Vikings, and the Jacksonville Jaguars.-McCoy played for the Broncos.-McCoy played for the Broncos during his professional career.-McCoy played for the San Diego Chargers.-McCoy played for the San Diego Chargers during his professional career.-McCoy played for the Minnesota Vikings.-McCoy played for the Minnesota Vikings during his professional career.-McCoy played for the Jacksonville Jaguars.-McCoy played for the Jacksonville Jaguars during his professional career.Please breakdown the following sentence into independent factsTable 15: A prompt given to InstructGPT to generate atomic facts for a given sentence.Model generated atomic facts were revised by human editors.

Figure 6 :
Figure6: Instructions for data annotation in Section 4. We also provided a demonstration video, and gave feedback 1-1 during the qualification task.

Figure 7 :
Figure 7: An interface for data annotation in Section 4. Annotators were able to navigate Wikipedia on the left.They annotate three pieces of generations from three LMs for the same prompt in one HIT since it saves time.Since completing one HIT takes considerable amount of time (25min), we added a function that allows saving their work at any stage in the middle of the HIT.
that shows generative search engines like PerplexityAI copy incorrect search results and generate text that is irrelevant to the input query.Gen William Waldegrave's grandfather was James II and VII.Wiki His father's title was created ... for the diplomat and ambassador James Waldegrave, 1st Earl Waldegrave, whose grandfather was James II and VII.Gen Some of [Julia Faye's] notable films include ..."Cleopatra" (1934).Comment No mention of Cleopatra on the Julia Faye page, and no mention of Julia Faye on the Cleopatra page.Gen [Kang Ji-hwan] has donated money to various charities and organizations over the years.Comment No such mention on the Kang Ji-hwan page.Gen His achievements, as an actor and as a cultural force, will surely prove to be as heroic as those of the characters he portrayed.Wiki Culture writer Steve Rose, in The Guardian, wrote: "Chadwick Boseman began his career playing African American icons and pioneers; he ends it as one himself.His [...] achievements, as an actor and as a cultural force, will surely prove to be as heroic as those of the characters he portrayed."Gen [Zamfir Arbore]'s life is not well-documented, and there is little information available about him.Gen Kick (2014) that brought [Sajid Nadiadwala] various debutant director awards.Wiki 2015, IIFA Award for Debut Director, Kick.(...) Kick brought him various debutant director awards.Comment The first text is from a table that indicates he won one award (accurate).The second is inaccurate, incorrectly citing a news article.
Annotation error10.0Gen[Zamfir Arbore] was part of the staff of Românul.Wiki The Românul staff came to include Zamfir Arbore.Comment Mentioned in the Românul page but not in the Zamfir Arbore page.

Table 2 :
Categorization of precision errors (Not-supported) from PerplexityAI (Section A.5). Gen indicates the generation from PerplexityAI, and Wiki indicates evidence text from Wikipedia.Comment indicates our comments.

Table 3 :
Ni et al. (2022) Rate (ER) along wWang et al., 2022)imated by each model (FS).'retrv'indicateswhether or not retrieval is being used, and 'ranking' ✓ indicates whether the ranking between three LM SUBJ s rated by the model is consistent to the ground truth ranking.+and−respectivelyindicate the estimation is an overestimation and an underestimation by more than 5% in absolute.Red Bold indicates the best (lowest) ER.See Appendix B.2 for the results in other metrics that consider individual judgments instead of aggregated ones.We use LLAMA 7B trained on Super Natural Instructions(Inst-LLAMA, Touvron et al., 2023;Wang et al., 2022)and ChatGPT as an LM EVAL , and Generalizable T5-based Retrievers (GTR,Ni et al. (2022)) for passage retrieval.See Appendix B.1 for more implementation details.

Table 4 :
A set of twelve LMs evaluated in Section 4.3.All models are tuned for instruction following or chat.Use other LMs indicates whether the model is trained on any data that includes outputs of another model.Open indicates model weights are publicly available.

Table 5 :
Statistics of 500 model-generated bios in our unlabeled data from 12 LMs as well as human-written bios.%respondingindicates % of generations that do not abstain from responding.#facts / res indicates # of atomic facts per response.LMs are sorted based on # of facts per response.See Figure3for their FACTSCOREs.

Table 5
Ranking between 13 subjects (human and 12 LMs), rated by the two best variants of our estimator: ChatGPT (left) and LLAMA+NP (right), both with retrieval.Scores from two metrics have a Pearson's r of 0.99.See Table5for % of responding and # of atomic facts per response of each LM.The variance in estimation based on different subsets of prompts is reported in Figure5of Appendix B.4.

Table 6 :
An example whose factual precision is high but recall is low.The generation does not mention how Mary I of England got back to the line of succession and eventually became a queen.The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.Sewon Min is supported by a J.P. Morgan fellowship, and Kalpesh Krishna was supported by the Google PhD Fellowship.Shiyue Zhang and Mohit Bansal.2021.Finding a balanced degree of automation for summary evaluation.In Proceedings of Empirical Methods in Natural Language Processing.

Table 7 .
The second case rarely happens with InstructGPT and ChatGPT, but happens considerably with Perplex-ityAI, i.e., 24.7% of generations of PerplexityAI have ≥ sentences marked as irrelevant without dependencies to other facts, compared to 0.5% and Gen Gerhard Fischer is an inventor.Wiki Gerhard Fischer (inventor).... was first patented by Dr. Gerhard Fischer in 1931.A metal detector had been invented some forty years earlier (1881) by Alexander Graham Bell ... Gen Chadwick Boseman was a producer.Comment Chadwick Boseman is not known as a producer, but produced one music video.Gen Leach has since become a member of the England Test team.Comment Leach is a member of the England Test team, but since when is less clear.Gen He made his Test debut for England in March 2018.Wiki On 16 March 2018, he was called up to England's Test squad (...)He made his debut in the second Test in Christchurch.Gen The building was the first LEED-certificated building in Edmonton.Wiki (..) became the first project in the City of Edmonton to achieve a LEED Gold status.Subjective 21 Gen Chadwick Boseman became an African American pioneer.Wiki Culture writer Steve Rose, in The Guardian, said that Boseman's career was revolutionary and he "leaves behind a gamechanging legacy" (...) Rose wrote: "Chadwick Boseman began his career playing African American icons and pioneers; he ends it as one himself."Gen [Tim Fischer] was an Ambassador to the Holy See from 2009 to 2012.Wiki ... was later Ambassador to the Holy See from 2009 to 2012.(...) Australian Ambassador to the Holy See 2008-2012 Comment The plain text and the table of the Tim Fischer page as well as the Australian Ambassador to the Holy See page are inconsistent in his start year.Gen Jack Leach is a left-handed batsman.Comment mentioned in the England cricket team page, Table Current Squad.

Table 8 :
Categorization of disagreement cases.Gen indicates the generation from PerplexityAI, and Wiki indicates evidence text from Wikipedia.Comment indicates our comments.

Table 9 :
Results in F1 MICRO using Inst-LLAMA 7B as an LM EVAL .'retrv' indicates whether or not retrieval is used.Self-check is not applicable to PerplexityAI whose outputs are semi-deterministic.Bold indicates the best performance.

Table 10 :
Table 10 reports a comparison across Ablation in F1 MICRO on the choices of LM EVAL .'retrv' indicates whether or not retrieval is used.Bold and Red bold indicate the best F1 within open-access LMs and commercial LMs, respectively.

Table 11 :
Results on F1 MICRO , comparing between the QA prompting and TF Prompting.We use Inst-LLAMA 7B as an LM EVAL .Self-check is not applicable to Perplex-ityAI since PerplexityAI outputs are semi-deterministic.Bold indicates the best F1 MICRO .

Table 12 :
Results on F1 MICRO , comparing different retrieval systems: BM25, GTR Large and GTR xLarge, all with Retrieve→LM based on Inst-LLAMA 7B.Bold indicates the best F1 MICRO .

Table 13 :
Categorization of 30 samples incorrectly predicted by Retrieve→LM based on ChatGPT.
Table13categories errors made by Retrieve→LM based on ChatGPT, the evaluator with the best F1 MICRO .70% of the errors are due to retrieved passages not providing direct evidence (either support or contradiction).These are difficult even for state-of-the-art retrieval systems and language models because validating facts often requires reading the entire page rather than a single passage, e.g., an actor not appearing in a particular film.17% of errors are made because ChatGPT is being distracted by other passages, although it assigns a correct label if only a particular, correct passage is given.