Evaluating Verifiability in Generative Search Engines

Generative search engines directly generate responses to user queries, along with in-line citations. A prerequisite trait of a trustworthy generative search engine is verifiability, i.e., systems should cite comprehensively (high citation recall; all statements are fully supported by citations) and accurately (high citation precision; every cite supports its associated statement). We conduct human evaluation to audit four popular generative search engines -- Bing Chat, NeevaAI, perplexity.ai, and YouChat -- across a diverse set of queries from a variety of sources (e.g., historical Google user queries, dynamically-collected open-ended questions on Reddit, etc.). We find that responses from existing generative search engines are fluent and appear informative, but frequently contain unsupported statements and inaccurate citations: on average, a mere 51.5% of generated sentences are fully supported by citations and only 74.5% of citations support their associated sentence. We believe that these results are concerningly low for systems that may serve as a primary tool for information-seeking users, especially given their facade of trustworthiness. We hope that our results further motivate the development of trustworthy generative search engines and help researchers and users better understand the shortcomings of existing commercial systems.


Introduction
Generative search engines fulfill user information needs by directly generating responses to input queries, along with in-line citations (Figure 1). 1xisting generative search engines are rapidly gaining users-in March 2023, Microsoft reported that "roughly one third of daily preview users are using The James Webb Space Telescope is designed to peer into the dusty clouds of gas where stars and planetary systems are born.Webb has captured the first direct image of an exoplanet, and the Pillars of Creation in the Eagle Nebula [1][2].Additionally, the telescope will be used to study the next interstellar interloper [3].Studying the Next Interstellar Interloper with Webb ...Scientists have had only limited ability to study these objects once discovered, but all of that is about to change with NASA's James Webb Space Telescope...The team will use Webb's spectroscopic capabilities in both the near-infrared and mid-infrared bands to study two different aspects of the interstellar object.

Cited Webpages
[1]: [2]: [3]: (*Some generated statements may not be fully supported by citations, while others are fully supported.) Figure 1: Generative search engines answer user queries by generating a tailored response, along with in-line citations.However, not all generated statements are fully supported by citations (citation recall), and not every citation supports its associated statement (citation precision).
[Bing] Chat daily", and that Bing Chat served 45 million chats in the first month of its public preview (Mehdi, 2023).Generative search engines have the potential to transform how people find information online, but generated responses from existing large language model-backed generative search engines may not always be accurate (Maynez et al., 2020).Given their potential and rapid mainstream adoption, it is critical to evaluate these systems to better understand their potential limitations (akin to prior work in algorithmic auditing; Metaxas and Pruksachatkun, 2017;Buolamwini and Gebru, 2018;Kiritchenko and Mohammad, 2018;Robertson et al., 2018;Metaxa et al., 2019;Green and Chen, 2019;Birhane et al., 2022, inter alia).
A prerequisite trait of a trustworthy generative search engine is verifiability,2 that is, each generated statement about the external world should be fully supported by a set of in-line citations, and each provided citation should support its associated statement.Verifiability enables readers to easily check that any generated statement is supported by its cited source.
We conduct a human evaluation to audit four popular commercial generative search engines (Bing Chat, NeevaAI, perplexity.ai, and YouChat) across a diverse set of information-seeking queries (e.g., various types of historical Google user queries from NaturalQuestions (Kwiatkowski et al., 2019), dynamically-collected open-ended questions from Reddit; see Appendix A for examples).
For each query-response pair, we use human evaluation to measure a variety of dimensions: 1. fluency (whether the generated text is fluent and cohesive; §2.2); 2. perceived utility (whether the generated answer is helpful and informative; §2.2); 3. citation recall (the proportion of generated statements about the external world that are fully supported by their citations; §2.3); and 4. citation precision (the proportion of generated citations that support their associated statements; §2.4).A trustworthy generative search engine should achieve high citation recall and precision, indicating that its generated citations are comprehensive (every generated statement is fully supported by citation) and correct (every citation supports its associated statement).
We find that existing generative search engine responses often have high fluency and perceived utility ( §4.1), but frequently contain unsupported statements or inaccurate citations (low citation recall and precision; §4.2).On average, merely 51.5% of generated sentences are fully supported with citations (citation recall), and only 74.5% of citations support their associated sentence (citation precision).Furthermore, citation precision is inversely correlated with perceived utility (r = −0.96); the responses that seem more helpful are often those with inaccurate citations ( §4.3).This facade of trustworthiness increases the potential for existing generative search engines to mislead users.For example, in Figure 1, a user with little background knowledge about the James Webb Space Telescope (motivating a query about its recent discoveries) will likely struggle to identify unsupported statements in the generated response.We hypothesize that citation precision is inversely correlated with perceived utility because generative search engines often copy or closely paraphrase from their cited webpages ( §4.4).This improves citation precision because copied text is often supported by the cited webpage, but decreases perceived utility when copied statements are irrelevant to the query or the rest of the generated response.
We make the following contributions: first, we define the citation recall and citation precision evaluation metrics, which aim to encourage the development of systems that cite comprehensively and correctly.Second, we conduct a human evaluation of four popular generative search engines, finding that responses are broadly fluent and appear useful, but frequently contain unsupported statements and inaccurate citations, increasing their potential to mislead users.Third, we observe that perceived utility is inversely correlated with citation precision in existing generative search engines, and hypothesize that this inverse correlation occurs when some systems copy or closely paraphrase from cited webpages.To facilitate further work on developing trustworthy generative search engines, we have released our human evaluation annotations.3 2 Human Evaluation of Fluency, Perceived Utility, and Verifiability In this section, we formalize the inputs and outputs of the generative search engines we study, describe the evaluation of fluency and perceived utility, and define and describe the evaluation of citation recall and precision.Citation recall and precision are designed to reward systems that cite comprehensively (i.e., high recall; all statements are fully supported by citations) and accurately (i.e., high precision; every cite supports its associated statement).We also define citation F 1 , a metric that combines citation precision and citation recall.

Task Formulation
Given a user query q as input, a generative search engine produces a text response r, which is a string with embedded in-line citations.For the example in Figure 1, the query q is "What are the latest discov-eries from the James Webb Space Telescope?" and the response r is the string paragraph "The James Webb Space Telescope ... used to study the next interstellar interloper [3].", with embedded citations "[1]", "[2]", and "[3]".
To evaluate citation precision and recall, we first segment the r into a set of n statements S = {s 1 , . . ., s n }.In this work, the segmentation S is set of sentences in the response r.For each statement s i ∈ S, we construct a (possibly empty) set C i = {c i,1 , . . ., c i,k } of k citations associated with the statement s i , where c i,j is the jth citation associated with the ith response statement.For each citation c i,j , we have a URL u i,j and its contents p i,j .In this work, C i is set of citations that occur in s i (e.g., for s i = "Blueberries[1], cherries[2], and grapes[3] grow on trees.[4]", In practice, a sentence may contain multiple independently-verifiable claims (e.g., conjuncts such as "Cups can be made of glass [1] or plastic [2]."), and a single in-line citation's scope is often ambiguous (e.g., a cite marker after two statements could be interpreted as either supporting both statements, or merely the final one); we leave finer-grained evaluation to future work.

Measuring Fluency and Perceived Utility
To measure response fluency, annotators were shown the user query, the generated response, and the claim "The response is fluent and cohesive".We ask annotators to rate their level of agreement with the claim on a five-point Likert scale from Strongly Disagree to Strongly Agree.We use a similar process to measure perceived utility, asking annotators to rate their level of agreement with the claim "The response is a helpful and informative answer to the query".

Measuring Citation Recall
Citation recall is the proportion of verificationworthy statements that are fully supported by their associated citations (see Figure 2 for several examples).Thus, computing citation recall requires (i) identifying the verification-worthy statements in a response and (ii) evaluating whether each verification-worthy statement is fully supported by its associated citations.

Identifying verification-worthy statements.
Given the statements S in a response r, we first ask annotators to remove statements in the response that are not verification-worthy.We take the position that every generated statement about the external world is verification-worthy, even those that might seem obvious, trivially true, or "common sense".Generated statements may be incorrect, and statements that seem obvious to some readers may be less than obvious to others (e.g., "The Pope is Catholic").We believe that systems should aim to provide a source for all generated statements about the external world, enabling readers to easily verify any statement in a generated response.
In practice, almost all system-generated statements are verification-worthy-notable exceptions include statements about the speaker (the system) itself (e.g., "As a language model, I do not have the ability to ban books.")and questions posed to the user (e.g.,"Would you like to learn more?", generated by systems like Bing Chat and YouChat that are deployed in conversational settings).Evaluating whether a verification-worthy statement is fully supported by its associated citations.Given the verification-worthy statements in a response r, annotators evaluate whether each statement is fully supported by its associated cita-tions (see the sentences of generated response in Figure 1 for examples).To collect these binary judgments, we use the attributable to identified sources (AIS) evaluation framework of Rashkin et al. (2022).In particular, a statement s i is fully supported by its associated citations C i if a generic hearer would affirm the statement "According to cited webpages C i , s i ", within the context of the query q and response r, and unsupported otherwise.

Measuring Citation Precision
Citation precision is the proportion of generated citations that support their associated statements (Figure 2).In contrast to citation recall, citation precision rewards systems for citing accurately-a response that cites every webpage on the Internet for each generated statement would have high citation recall, but low citation precision (since many articles are irrelevant and do not support their associated statement).To measure citation precision for a response r, we first ask annotators to judge whether each citation c i,k contributes full, partial, or no support for its associated statement s i (see cited webpages in Figure 1 for examples): • Full support: all of the information in the statement is supported by the citation.• Partial support: some of the information in the statement is supported by the citation, but other parts are not supported (e.g., missing or contradictory).• No support: the citation does not support any part of the statement (e.g., the cited webpage is completely irrelevant or contradictory).For statements that have multiple associated citations, we additionally ask annotators whether the union of its associated cited webpages collectively provides full support for the statement (a binary judgment).Similar to citation recall, we use the AIS evaluation framework of Rashkin et al. (2022) to collect these binary judgments.
To calculate citation precision, let T f s be the number of citations that fully support its associated statement, and let T ps be the number of citations that partially supports its associated statement, where the associated statement is fully supported by the union of its associated citations and no associated citation fully supports the statement by itself. 4Let N be the total number of citations in the response.Then, the citation precision is (T f s + T ps )/N .

Citation F 1
Citation F 1 is a metric that combines citation precision and citation recall by taking their harmonic mean: To achieve a high citation F 1 , systems must have high citation precision and high citation recall.

Evaluation Setup
In this section, we describe the evaluated generative search engines ( §3.1), the diverse query distributions we use for evaluation ( §3.2), and the details of our human evaluation protocol ( §3.3).

Evaluated Generative Search Engines
We evaluate four existing commercial generative search engines: Bing Chat, NeevaAI, perplexity.ai, and YouChat.5These systems pattern after prior work (e.g., Nakano et al., 2021;Menick et al., 2022;Glaese et al., 2022;Thoppilan et al., 2022, inter alia) and generate responses by conditioning large language models on the input query and retrieved content (e.g., search results from a conventional search engine).For each input, we save the system's first complete response (i.e., single-turn).Responses were scraped between late February and late March 2023.Note that evaluated generative search engines have differing abstention rates (Table 1), which can make direct comparison difficult-one might expect that systems with higher abstention rates might also have higher evaluation performance, since they can simply abstain from generating responses to difficult queries (we do not find this to be the case in practice).NeevaAI abstains from responding on nearly 23% of evaluated queries, since its response its associated citations [1] and [2].Suppose that these citations each contribute partial support for the entire statement-the first citation [1] only states that "Health benefits of cycling include improved cardiovascular health", and second citation [2] only states that "Health benefits of cycling include lowered cholesterol levels".Taken together, the citations offer full support for the statement.Although these citations do not fully support the statement on their own, they still meaningfully contribute to its verifiability-systems should not be penalized for aggregating information from multiple citations.is displayed within a conventional search engine results page.In contrast, Bing Chat, perplexity.ai, and YouChat respond to almost every user query.

Evaluated Query Distributions
To gain a broader understanding of the strengths and weaknesses of existing commercial generative search engines, we evaluate on a diverse set of queries from a variety of sources (e.g., Google user queries, open-ended Reddit questions, how-to queries) requiring knowledge from several different answer types (e.g., short textual spans, long-form paragraph, lists, or tables).See Appendix A for example queries from each distribution.Each system is evaluated on 1450 queries-150 randomlysampled queries from each of AllSouls, davincidebate, ELI5 (KILT / Live), and WikiHowKeywords, and 100 randomly-sampled queries for each of the seven NaturalQuestions subdistributions.
AllSouls.We evaluate systems on open-ended essay questions taken from the entrance exam (general paper component) for All Souls College, Oxford University.These questions cover topics including the arts, science, politics, literature, current events, and issues in education and sport.
davinci-debate.We evaluate systems on debate topics generated from text-davinci-003.To generate debate queries, we follow the procedure of Bakker et al. (2022); see Appendix B.1 for details.
ELI5.We take queries from the "Explain Like I'm Five" (ELI5) subreddit, where users provide long-form layperson-accessible answers to submitted questions.Submitted questions are required to admit objective explanations, and answering them often requires long-form textual responses.We consider two subdistributions of ELI5 queries: ELI5 (KILT) and ELI5 (Live).ELI5 (KILT) uses historical queries from the KILT ELI5 dataset (Fan et al., 2019;Petroni et al., 2021), drawn from posts created before July 2018.A retrieval-based system could hypothetically perform well on ELI5 (KILT) by simply identifying the query's source Reddit ELI5 post and copying its content.As a result, we also evaluate generative search engines on the ELI5 (Live) subdistribution, which increases ecological validity by evaluating systems on real user queries at their time of creation and reducing the incidence of search results with the query's exact keywords. 6We continuously listen to the stream of new Reddit ELI5 posts and immediately query generative search engines for responses whenever a new post is created.This ensures that the source ELI5 post will not have been indexed (and thus, cannot be retrieved) by conventional search engines.minimizing the possibility that the generative search engine has access to the source ELI5 post.WikiHowKeywords.We evaluate systems on queries derived from WikiHow articles.We found that directly querying generative search engines with WikiHow article titles yields responses that largely paraphrase or copy text directly from Wik-iHow.As a result, we use text-davinci-003 to paraphrase article titles (e.g., "How to Cut An Avocado") into keyword queries (e.g., "cut avocado").NaturalQuestions.We evaluate generative search engines on NaturalQuestions (Kwiatkowski et al., 2019) queries, stratified by their answer type.NaturalQuestions contains historical queries issued to the Google search engine coupled with long and short answers extracted from Wikipedia.We evaluate on queries from 7 NaturalQuestions subdistributions: queries with paragraph-type long answers (i) with and (ii) without short answers, queries with list-type long answers (iii) with and (iv) without short answer, queries with table-type long answers (v) with and (vi) without short answers, and finally (vii) queries with no long answer (and thus no short answer either).Summary.In total, we evaluate existing generative search engines on 12 total query distributions.Eight query distributions are taken from prior work (ELI5 (KILT) and the seven NaturalQuestions query distributions), while four query distributions were constructed for this work: AllSouls, davinci-debate, ELI5 (Live), and WikiHowKeywords.These diverse settings provide broad coverage of several potential use cases and information needs, helping us gain a comprehensive understanding of systems' strengths and weaknesses.

Human Evaluation Protocol
Annotation process.Evaluating a single queryresponse pair requires human annotators to complete a three-step The first step measures the response's fluency and perceived utility ( §2.2), and the second and third step provide the judgments necessary to measure citation recall ( §2.3) and precision ( §2.4).See Appendix C for screenshots of the annotation interface and Appendix D for the annotation guidelines.
Annotator recruitment and training.Annotation was performed on Amazon Mechanical Turk.Annotators were pre-screened with a qualification study, which required them to read an annotation guidelines document and evaluate five representative query-response pairs.We individually reviewed submitted annotations for qualification study and provided annotators with personalized feedback to help correct any misconceptions or confusion about the task.Annotators who performed well on the qualification study and demonstrated thorough understanding of the task and annotation guidelines were permitted to participate in the main round of human evaluation.We remained in constant contact with annotators throughout the human evaluation process to answer questions about corner-cases and clarify intended behavior.In total, 34 annotators participated in human evaluation.
Annotator compensation.Annotators were compensated $1.00 per query-response pair for responses with citations, and $0.38 per queryresponse pair for responses without citations ($15.00 per hour, by conservative time estimates).On average, annotators took approximately four minutes to complete all three steps for a single query-response pair for responses that contained at least one citation.
Annotation agreement.Each query-response pair is annotated once in the human evaluation process.To measure inter-annotator agreement, we collected three annotations for 250 randomlysampled query-response pairs, finding high agreement rates (greater than 82.0%pairwise agreement and 91.0 F1 for all judgments; see Appendix E).

Results and Analysis
This section presents the results of our human evaluation study and discusses our main observations and analyses.We see that fluency and perceived utility are generally high across different generative search engines ( §4.1), while citation recall and precision are quite low ( §4.2), though performance certainly varies by system and query distribution-the low citation recall and precision, when combined with the facade of trustworthiness from fluency and high perceived utility, increase the potential for existing generative search engines to mislead users.Our results also show that citation precision is inversely correlated with perceived utility in existing generative search engines ( §4.3).We hypothesize that this is a byproduct of systems' propensity to copy or closely paraphrase text from cited webpages, which may increase citation precision and decrease perceived utility ( §4.4).

Fluency and Perceived Utility
See Appendix F for full fluency and perceived utility results for every generative search engine on each of our query distributions.
Generated responses are fluent and appear helpful.Averaging across all systems and responses yields an average rating of 4.48 for fluency and 4.50 for perceived utility, indicating that annotators generally found generated responses fluent and helpful for answering the user's input query.

Comparing fluency across query distributions.
Comparing average fluency ratings across different query distributions, we see similar ratings between NaturalQuestions queries that have a long answer (i.e., an extractive answer of some length exists on Wikipedia) and non-NaturalQuestions distributions (4.50 vs. 4.47, respectively).Comparing average fluency ratings between NaturalQuestions subdistributions, we see that generated responses to queries that have a short extractive answer are generally more fluent (4.55) than responses to queries with only a long answer (4.46) or those without a long answer (4.46), perhaps because responses to questions with short answers are generally shorter and often only require factoid knowledge.
A notable outlier distribution is NaturalQuestions queries with table-type long answers and no short answers, where system responses are dramatically less fluent (average of 4.36 across systems vs. average of 4.48 across all query distributions).These challenging queries often require aggregating information across table cells or retrieved sources, since the lack of a short answer implies that no single Wikipedia table cell directly answers the question (e.g., the query "how many grammys does beyonce have without destiny's child").When the retrieved webpages do not contain a clear extractive answer to the query, but contain facts that seem relevant (e.g., information about Destiny's Child's first Grammy, or Beyonce's total number of career Grammy awards), the generated response is often a stilted agglomeration of statements from various sources, reducing overall fluency.
Comparing perceived utility across query distributions.In contrast to fluency, perceived utility can differ substantially between different query distributions.Perceived utility is much higher for NaturalQuestions queries containing a long answer (4.59), as opposed to non-NaturalQuestions queries (4.43).Comparing between different NaturalQuestions subdistributions, we see that perceived utility is highest for queries that have a short answer (4.62), followed by queries that have only a long answer (4.55), and finally by queries that have no long (or short) answer (4.52).Overall, perceived utility decreases as queries require longer-form and lessextractive answers (e.g., factoid NaturalQuestions queries with short answers versus ELI5 queries).

Citation Recall and Precision
See Appendix G for full citation recall and precision results for every generative search engine on each of our query distributions.
Existing generative search engines often do not cite comprehensively or correctly.When averaging across all systems, a mere 51.5% of generated statements are fully supported with citations (recall), and only 74.5% of citations fully support their associated statements (precision).We believe these results are unacceptably low for systems that are quickly becoming a popular tool for answering user queries and already have millions of users, especially given that generated responses often ap-pear informative and useful.
Comparing citation recall across query distributions.Modifying the evaluation query distribution appears to affect citation recall more than citation precision.For example, the gap in citation recall between NaturalQuestions queries with a long answer and non-NaturalQuestions queries is nearly 11% (58.5 vs. 47.8, respectively).Similarly, the difference in citation recall between Nat-uralQuestions queries with and without short answers is nearly 10% (63.4 for queries with a short answer, 53.6 for queries with only a long answer, and 53.4 for queries with no long or short answer).
We hypothesize that citation recall is driven by the relevance of retrieved webpages.In the absence of retrieved evidence that directly answers the input user query, systems generate statements that are unsubstantiated by citations, resulting in lower recall.For example, generative search engines struggle with citation recall when evaluated on the open-ended AllSouls essay questions (average recall of 44.3), because these queries generally have no extractive answer on the Internet.
Comparing citation precision across query distributions.Precision on NaturalQuestions queries with long answers is higher than non-NaturalQuestions distributions (76.1 vs. 72.3,respectively).Precision is highest on NaturalQuestions queries with paragraph answer types (precision of 81.5 when a short answer exists and 78.7 when only a long answer exists).On the other hand, citation precision is lowest when systems are evaluated on AllSouls open-ended essay questions (67.8) and davinci-debate queries (70.3).Comparing between NaturalQuestions subdistributions, average system precision is higher on queries with short answers (77.4) than those with only long answers

ai YouChat
Figure 3: Averaged perceived utility plotted against averaged citation F 1 for each evaluated generative search engine.Different systems make different trade-offs between perceived utility and citation F 1 .Note that these systems are difficult to directly compare since they may have different abstention rates (Table 1).(74.8) or no long answer (73.5).
Summary.To summarize our human evaluation results, Figure 3 plots average perceived utility against average citation F 1 .Existing systems make different trade-offs between citation recall, citation precision, and perceived utility.See Appendix H for full citation F 1 results for every generative search engine on each of our query distributions.

Citation Precision is Inversely Related to Perceived Utility
We find that citation precision is inversely correlated with perceived utility in existing generative search engines (r = −0.96).For example, Bing Chat achieves the highest precision, but has the lowest perceived utility.In contrast, YouChat has the lowest citation precision, but its responses attain the highest perceived utility ratings.This inverse relationship between citation precision and perceived utility is symptomatic of a trade-off between faithfulness and abstractiveness (Ladhak et al., 2022).In particular, we find that system-generated statements often closely paraphrase or directly copy from their associated citations (see §4.4 for further analysis).This results in high citation precision (since extractively copied text is almost always fully supported by the source citation), but lower perceived utility (since the extractive snippets may not actually answer the user's input query).In contrast, systems that frequently deviate from cited content (resulting in low citation precision) may have greater freedom to generate fluent responses that appear relevant and helpful to the user's input query.

Cited Webpages
(*Some generated statements may not be fully supported by citations, while others are fully supported.)

Bing Chat (higher citation precision, lower perceived utility)
[2]: [1]: This tradeoff is especially apparent on the All-Souls query distribution, which contains openended essay questions.AllSouls queries often cannot be answered via extraction from a single webpage on the Internet.For example, given the query "Is cooperation or competition the driving force guiding the evolution of society?", conventional search engine results focus on biological evolution, rather than societal evolution.Bing Chat simply copies irrelevant statements directly from the cited sources, resulting in high citation precision but low perceived utility (Figure 4).

Generative Search Engines Closely Paraphrase From Cited Webpages
To better understand how generative search engines use citations to support their responses, we analyze the similarity between generated statements and their supporting cited webpages.For citations that provide full or partial support for their associated statement, annotators were asked to provide evidence by copy-pasting the minimal set of sentences from the cited webpage that support their judgment (if any such sentences exist).We compute the BLEU (Papineni et al., 2002) and BERTScore  (Zhang et al., 2020) between each generated statement and the annotator-provided evidence from the associated citation.For statements with multiple associated citations, we take the maximum similarity with any associated citation's evidence.Table 2 presents similarity metrics between generated statements and extracted evidence from supporting webpages-when statements are fully or partially supported by their citations, they often copy or closely paraphrase from their cited articles.Furthermore, systems with higher similarity between their generated statements and cited webpages also have higher average citation precision (r = 0.80 between each of BLEU and BERTScore with average citation precision), indicating that their improved precision may largely be a byproduct of their increased tendency to copy or paraphrase from cited webpages.

Related Work
Existing work has proposed a variety of techniques for building language models that provide references to support generated text.Nakano et al. (2021) use reinforcement learning from human preferences to train language models to answer questions and provide supporting evidence.Similarly, Menick et al. (2022) also use reinforcement learning from human preferences to train language models to answer user questions, but their system generates responses by conditioning on evidence retrieved from a Google search for the given user query.Finally, the LaMDA system of Thoppilan et al. (2022) is trained to provide URLs that support its generated statements.In contrast to the aforementioned line of work on training systems to generate citations, Gao et al. (2022) propose a method for post-editing generated output to reflect and cite retrieved evidence.
Existing work has also proposed evaluation protocols and benchmarks for improving verifiability in language generation systems.Rashkin et al. (2022) propose the attributed to identified sources (AIS) evaluation framework to assess whether a particular statement is supported by provided evidence and validate their guidelines on conversational question answering, summarization, and table-totext systems.Bohnet et al. (2023) introduce the task of attributed question answering, where systems are given an input question and must output an answer string with a pointer to evidence text supporting the answer, and propose a reproducible evaluation setup with NaturalQuestions queries (only paragraph answer type containing long and short answers) with Wikipedia as the evidence corpus.
In contemporaneous work, Peskoff and Stewart (2023) have domain experts evaluate ChatGPT and YouChat responses to 100 expert-written questions.They find that generated responses are coherent and concise, but frequently undersourced and inaccurate; our results also show that YouChat responses frequently lack citations for generated statements (i.e., low citation recall).

Conclusion
In this work, we used human evaluation to audit the verifiability of four popular commercial generative search engines-Bing Chat, NeevaAI, perplexity.ai, and YouChat.We find that responses from existing generative search engines are generally fluent and often appear informative, but frequently contain unsupported statements and inaccurate citations (low citation recall and precision)-a mere 51.5% of generated statements are fully supported by citations (recall), and only 74.5% of citations support their associated statements (precision).We believe that existing systems' citation recall and precision are unacceptably low, given that they are quickly becoming a popular tool for answering user queries and already have millions of users.Moreover, we find that citation precision is inversely correlated with perceived utility in existing generative search engines-the responses that seem more helpful are often those with more unsupported statements or inaccurate citations.Analysis suggests that this inverse correlation occurs in existing systems because of their propensity to copy or closely paraphrase from cited webpages, which inflates citation precision at the cost of lower perceived utility.We hope our results and insights further motivate the development of trustworthy generative search engines and help researchers and users better understand their current shortcomings.

Limitations
The primary goal of this work was to assess verifiability in generative search engine responses.However, note that verifiability is not factuality-rather than arbitrating if a generated statement is true (difficult for all but the simplest claims; Rashkin et al., 2022), verifiability enables users to easily check any generated statement's source, allowing them to draw their own conclusions about whether to trust the generated statement.Studying the factuality of generative search engines (that may or may not provide citations) is an important direction for future work-users may not necessarily bother to check the sources, especially given that responses often seem helpful and sound confident, and we'd thus like responses to be as factual as possible.
In our evaluation of verifiability, we consider sentence-level claims.However, sentences often have multiple claims (e.g., "Cats[1] and dogs [2] are common pets.").However, there is currently no clear linguistic definition on what constitutes a claim.As a result, we use sentences for simplicity and reproducibility.Proposing a concrete definition of a "claim" and performing a finer-grained evaluation is an interesting direction for future work.

C Annotation Interface
Figures 5-7 show the annotation interface used for human evaluation.
In the first step, annotators were shown the query and the generated response (without citations) and asked to rate response fluency and perceived utility on a five-point Likert scale.
In the second step, annotators were shown the statements in the generated response (including any generated citations) and asked to filter out statements are not verification-worthy.
Finally, in the third step, annotators were shown the statements that were previously judged to require verification (in the prior step), as well as each statement's associated system-generated citations.For each statement and associated citation, annotators judged whether the citation fully supports, partially supports, or does not support the statement, as interpreted within the broader context of the query and system response.For statements with multiple associated citations, annotators are asked to judge whether the citations, when taken together, fully support the statement; this captures cases where multiple citations support disjoint parts of a statement (e.g., "Health benefits of cycling include improved cardiovascular health[1] and lowered cholesterol levels [2].").  Figure 7: Third step of the annotation interface, where annotators provide judgments on whether each citation supports its associated statement, and whether each statement is supported by the union of its citations (only applicable when a statement has multiple associated citations).show the annotation guidelines we used for the task.We ask crowd annotators to read these guidelines as part of the qualification study.Only annotators that demonstrated a thorough understanding of the guidelines and task were permitted to participate in the main round of human evaluation.

Overview
Hi! We are a team of Stanford researchers interested in evaluating the trustworthiness of AI systems.
In this task, you will evaluate an AI system's response to a user query.The AI system outputs a paragraph that contains information relevant to the user's query, and we would like to evaluate whether the AI system can accurately cite sources for statements it makes about the external world.
At a high level, this task breaks down into three steps: 1. Evaluating response quality 2. Filtering sentences that do not require citation.3. Judging whether each statement is fully supported by its citation(s).
Please carefully read the guidelines below before starting on the task.The task compensation accounts for the time needed to read the guidelines.

Preliminaries: Logging In
When first entering the site, you will be prompted to select a username.Please use your worker ID as the username, so we can keep track of the examples you've annotated.The top of the interface displays your worker ID, the total number of examples submitted from this username, and will show a completion code when you have finished the task.
If something is wrong with the example, you may press the "Flag Example" button in the top-right corner to report the error.Please do not submit annotations for such examples.
Your task ends after you've completed 5 responses.A completion code will appear at the top of the interface---there is no need to complete more than 5 responses to receive credit for the study.
Step 1: Evaluating response quality You will be shown the user's original query, and the system's response to the query---please carefully read both of them.Then, you will be asked to rate your level of agreement with two questions: 1.The response is fluent and cohesive.2. The response is a helpful and informative answer to the query.Once you have finished selecting a response for each of the two questions, press the "Next Step" button in the top-right corner to continue.
Step 2: Filtering sentences that do not require citation.
The goal of this step is to filter the sentences in the system response by removing sentences that do not require citation (unchecking them in the interface).We expect the majority of sentences produced by the system to require citation, so don't worry if you find yourself rarely unchecking sentences.
In general, we take the position that all statements about the external world require citation, even if they are trivially true or "common sense" (since users may differ in their background, which affects their basic beliefs).For example, the following sentences require citation: In particular, note that sentences can require citation despite being nearly impossible to verify.Consider example (e) above.It's highly unlikely that anyone knows exactly how many breaths LeBron James took in February 2023, let alone that such information could be linked to in a citation.However, it's still a statement about the external world, and it's still possible to find out for certain whether the statement is true or false.Thus, the statement requires citation.
In contrast, consider the following examples of sentences that do not require citation: -(2a): I believe that the moon landing was staged.
-Explanation: In general, all sentences pertaining to "I" do not require citation.This statement expresses a belief held by the speaker.The speaker is unknown, so this statement does not require citation.Note that the similar-looking statement "The moon landing was staged" (example 1d) require citation and is verifiable.The response "Yes, it is wrong" is uninterpretable on its own, because it is not clear what "it" refers to.However, by using the context of the query, it becomes clear that the statement is equivalent to "Yes, [exaggerating in a letter of recommendation] is wrong".
For another example, consider: Query: how many characters are in the prologue of canterbury tales Response (statement highlighted): In Geoffrey Chaucer's The Canterbury Tales, 32 characters make the journey to Canterbury.This includes the narrator, the host, and the Canon's yeoman, who join the group later.
The statement "This includes the narrator, the host, and the Canon's yeoman, who join the group later." is uninterpretable on its own, because it is not clear what "This" refers to, or what "group" they join.The preceding sentence of the response is essential for realizing that this sentence is equivalent to "[The 32 characters that make the journey to Canterbury] include the narrator, the host, and the Canon's yeoman, who join the [32 characters] later".
In general, use your best judgment to determine the information provided by the system response.

(B): According to the citation(s), is this statement true?
Again, you should use your best judgment in determining whether all of the information provided by the statement is supported by the associated citation(s).
It may be helpful to ask yourself whether it is accurate to say "according to the citation" with a statement following this phrase.For example, is it accurate to say "according to the citation, in Geoffrey Chaucer's The Canterbury Tales, 32 characters make the journey to Canterbury"?Be sure to check all of the information in the statement.You will be given six options: -"Full Support": All of the information in the statement is supported in the document.
-"Partial Support": Only some of the information is supported in the document, but other parts of the information are missing from the document.-"No Support": This document does not support any part of the statement.
-"Article Not Accessible": Not able to access the document (e.g., paywall or the link is dead) -"Citation Has Support but also Refutes Statement": The citation has information that supports the statement, but also has information that refutes the statement.-"Statement is Unclear, Can't Make Judgment": The statement is so incomprehensible that it cannot be determined if the citation supports the statement.If the citation offers "full support" or "partial support" of a document, you will also be asked to copy and paste the minimal set of sentences from the article that support your judgment.In cases where you can't localize the judgment to particular sentence(s) (e.g., the entire article supports the statement, or the support comes from an image or graphic), feel free to leave this input blank.
When a statement has more than one associated citation, you will also judge whether the citations, when taken together, fully support the statement (Yes/No).In other words, if you merged all of these citations into one big webpage (and it became a single citation), would this citation fully support the statement?If the citations contradict each other (e.g., one fully supports the statement, whereas another refutes the statement), then select "Citations Contradict Each Other".

Questions or feedback?
If you have questions about the task, or any feedback about how we could make it better or what your experience was like with it, please email nfliu@cs.stanford.edu,and we'll get back to you promptly.Thanks!

E Annotation Quality
Table 4 presents inter-annotator agreement statistics, computed on a random sample of 250 query-response pairs that received annotations each.We measure the pairwise agreement between individual pairs of ratings and an F1 score comparing individual ratings to the majority consensus.We compute agreement on judgments of (i) fluency and perceived utility, (ii) whether a statement is verification-worthy, (iii) whether a citation supports its associated statement, and (iv) whether a statement is fully supported by the union of its citations (in the case where multiple webpages are cited).When calculating agreement on fluency and perceived utility judgments, we coarsen the 5-point Likert judgments into three options: "Disagree", "Neutral", and "Agree".Agreement rates between annotators are high (pairwise agreement greater than 82.0% and F1 greater than 91.0 for all judgments). Inter

F Fluency and Perceived Utility
nasa.gov ( citation does not support its associated statement) NASA's Webb Confirms Its First Exoplanet ... Researchers confirmed an exoplanet, a planet that orbits another star, using NASA's James Webb Space Telescope for the first time.... cnn.com ( citation partially supports its associated statement) Pillars of Creation: James Webb Space Telescope ... ... The Pillars of Creation, in the Eagle Nebula, is a star-forming region captured in a new image (right) by the James Webb Space Telescope that reveals more detail than a 2014 image (left) by Hubble ... nasa.gov ( citation fully supports its associated statement)

Figure 2 :
Figure 2: Examples of calculating citation recall and precision.Citation recall measures the proportion of generated statements that are supported by citations.Citation precision measures the proportion of citations that support their associated statements.Partially-supporting citations only improve citation precision when their associated statement is supported by the union of its citations and no other associated citation fully supports the statement by itself (middle example).
There are different opinions on this topic.Some researchers believe that cooperation, not struggle for survival, drives evolution[1].Others think that there are two driving forces of evolution: mutation (change) and competition[2].What do you think?Is cooperation or competition the driving force guiding the evolution of society?phys.org ( citation fully supports its associated statement) Cooperation, not struggle for survival, drives evolution Researchers believe that competition tends to differentiate ecological requirements ... A new conceptual evolutionary model ... reviewed the debated mechanism of speciation, suggesting that competition and a struggle for existence are not the main drivers of evolution.cooperation.cool( citation fully supports its associated statement) Game Theory -Cooperation is cool Lots of biologists consider evolution as a continuous struggle for survival.They think that there are two driving forces of the evolution: mutation (change), and competition.

Figure 4 :
Figure 4: Citation precision is inversely correlated with perceived utility in existing generative search engines.Bing Chat often achieves high citation precision because it closely paraphrases from cited webpages (bolded).However, since these citations are largely irrelevant to the user's input query (biological evolution vs. societal evolution), copying this contents results in lower perceived utility.

Figure 5 :
Figure 5: First step of the annotation interface, where annotators judge response fluency and perceived utility.

Figure 6 :
Figure6: Second step of the annotation interface, where annotators uncheck statements that are not verificationworthy.Statements that contain generated citations must be verification-worthy, so we automatically mark them as such in the interface (greyed-out checkboxes next to the 2nd and 4th sentences above).

Figure 8 :
Figure 8: First page of the annotation guidelines.

-
(1a): The House of Lords is a topic of ongoing debate in the UK.-(1b): However, there is still no consensus on what should replace the Electoral College.-(1c): The sky is blue.-(1d): The moon landing was staged.-(1e): In February 2023, LeBron James took 261,960 total breaths.-(1f): Patrick Henry once said "Give me liberty, or give me death".-(1g): Thanksgiving dinners usually taste bad.-(1h): Voting rights are controversial -(2b): Have you listened to that song?-Explanation: Questions do not have information to verify.-(2c): Pick up the ball on the floor.-Explanation: Commands do not have information to verify.-(2d): It is the year 2300.Robots rule the earth.

Figure 9 :
Figure 9: Second page of the annotation guidelines.
Response (statement highlighted): Yes, it is wrong.Letters of recommendation should reflect the author's honest perspective on the candidate.

Figure 11 :
Figure 11: Fourth page of the annotation guidelines.

Figure 12 :
Figure 12: Fifth page of the annotation guidelines.

Table 1 :
Generative search engines may be designed for deployment in different contexts.NeevaAI abstains from responding to 22.7% of our 1450 queries, since its response is designed for display within a conventional search results page.In contrast, the conversational interface of Bing Chat, and YouChat means that systems must generate a response for nearly every input user query (excepting, e.g., query character length limits).

Table 4 :
-Annotator Agreement (↑) Inter-annotator agreement statistics.Pairwise Agreement % computes the proportion of individual judgment pairs that agree, and F1 compares individual judgments to the majority consensus judgment.Interannotator agreement is high (greater than 82.0%pairwise agreement % and 91.0 F1 for all judgments).

Table 5 :
Table5presents the fluency of generative search engine responses on each of our query distributions, and Table6presents the perceived utility.Human evaluation results for generated response fluency (five-point Likert ratings).In general, existing generative search engines produce fluent text.Performance is notably lower on NaturalQuestions queries with table-type long answers and no short answers, which often require aggregating information within or across citations.