Automatic Evaluation of Attribution by Large Language Models

A recent focus of large language model (LLM) development, as exemplified by generative search engines, is to incorporate external references to generate and support its claims. However, evaluating the attribution, i.e., verifying whether the generated statement is fully supported by the cited reference, remains an open problem. Although human evaluation is common practice, it is costly and time-consuming. In this paper, we investigate the automatic evaluation of attribution given by LLMs. We begin by defining different types of attribution errors, and then explore two approaches for automatic evaluation: prompting LLMs and fine-tuning smaller LMs. The fine-tuning data is repurposed from related tasks such as question answering, fact-checking, natural language inference, and summarization. We manually curate a set of test examples covering 12 domains from a generative search engine, New Bing. Our results on this curated test set and simulated examples from existing benchmarks highlight both promising signals and challenges. We hope our problem formulation, testbeds, and findings will help lay the foundation for future studies on this important problem.

Incorporating external references for generation inherently implies that the generated statement is backed by these references.However, the validity of such attribution, i.e., whether the generated statement is fully supported by the cited reference, remains questionable. 2According to Liu et al. (2023), only 52% of the statements generated by state-of-the-art generative search engines such as New Bing and PerplexityAI are fully supported by their respective cited references. 3naccurate attribution compromises the trustworthiness of LLMs, introducing significant safety risks and potential harm.For instance, in healthcare, an LLM might attribute incorrect medical advice to a credible source, potentially leading users to make harmful health decisions.Similarly, in finance, faulty investment advice attributed to a reliable source may cause substantial financial losses.
To identify attribution errors, existing attributed LLMs (Nakano et al., 2021;Thoppilan et al., 2022) rely heavily on human evaluation, which is both expensive and time-consuming.For instance, the average cost of annotating a single (query, answer, reference) example is about $1 in Liu et al. (2023).In the actual use of attributed LLMs, it is the user who needs to be wary of the attribution and manually verify it, which puts a tremendous burden on their side.Therefore, effective and reliable methods to automatically evaluate attribution and identify potential attribution errors are highly desired.
Towards this goal, we take the first step by introducing AttrScore (Figure 1), a framework designed for automatic evaluation of attribution and identification of specific types of attribution errors.We propose a new problem formulation that categorizes attribution into three types: 1) attributable: the reference fully supports the generated statement; 2) extrapolatory: the reference lacks sufficient information to support the generated state-Q: What is the temperature range on the moon?A: The average temperature on the moon can range from -298 degrees F (-183 degrees C) at night to 224 degrees F (106 degrees C) during the day [1].

The temperature on the Moon
The average temperature on the Moon (at the equator and mid latitudes) varies from -298 degrees Fahrenheit (-183 degrees Celsius), at night, to 224 degrees Fahrenheit (106 degrees Celsius) during the day.Because the Moon has no... Figure 1: We make the first step towards automatically evaluating attribution and identifying specific types of errors with AttrScore.We explore two approaches in AttrScore: (1) prompting LLMs, and ( 2) fine-tuning LMs on simulated and repurposed datasets from related tasks.ment, and 3) contradictory: the generated statement directly contradicts the cited reference.Unlike existing work (Bohnet et al., 2022) that uses binary categorization (i.e., attributable or not) and Liu et al. (2023) that defines the degree of reference support for the generated statement as "full", "partial", or "no support", our fine-grained error categorization aids humans in better understanding the type of an attribution error made by an LLM.This not only enhances safe system usage but also provides valuable insights for future development of mechanisms tailored to correct specific errors.
We explore two approaches in AttrScore: 1) prompting LLMs and 2) fine-tuning LMs on simulated and repurposed data from related tasks such as question answering (QA), fact-checking, natural language inference (NLI), and summarization.For evaluation, unlike existing work (Liu et al., 2023;Gao et al., 2023) that only uses queries from existing benchmarks, we curate a set of test examples covering 12 different domains from a generative search engine, New Bing.This is the first evaluation set for measuring the attribution of LLMs with queries created based on real-life interactions, hence avoiding the data contamination issue.
Our results indicate that both approaches show reasonable performance on our curated and simulated test sets; yet there is still substantial room for further improvement.Major sources of evaluation failures include insensitivity to fine-grained information comparisons, such as overlooking contextual cues in the reference, disregard for numerical values, and failure in performing symbolic operations.In light of these findings, we discuss potential directions for improving AttrScore, including training models to be more strongly conditioned on the reference, and augmenting them with external tools for numerical and logical operations.
With the new formulation of attribution errors, the development of AttrScore, the introduction of new test sets, and the insights into challenges and potential directions for future work, we hope our work can help lay the foundation for the important task of automatically evaluating LLM attributions.
provides sufficient support for a generated answer to a user's query.Our task setting prioritizes one reference per statement, a unit task that more complex scenarios can be decomposed to.We study such a setting as it forms the basis for dealing with multiple references or distinct segments (Liu et al., 2023;Gao et al., 2023).
Prior work, such as Rashkin et al. ( 2021); Gao et al. (2022); Bohnet et al. (2022), mainly focuses on binary verification, i.e., determining if a reference supports the generated answer or not.We propose advancing this task by introducing a more fine-grained categorization.Specifically, we classify attributions into three distinct categories:4 • Attributable: The reference fully supports the generated answer.
• Extrapolatory: The reference lacks sufficient information to validate the generated answer.
• Contradictory: The generated answer contradicts the information presented in the reference.
To illustrate, consider a contradictory example (Figure 1).The query is "What was the unemployment rate in Germany in 2020?", and the generated answer is "4.31%".However, the reference states that the rate was "3.81%", contradicting the generated answer.An extrapolatory instance, on the other hand, would be a query about the "gas price in California".While the reference is relevant, it does not contain specific information to verify the correctness of the generated answer.
Following these examples, we see the importance of granularity in error classification.A finegrained classification allows us to pinpoint the nature of the errors, be it contradiction or extrapolation.Users can better understand the type of errors an LLM might make, enabling them to use the model more safely.Additionally, such an error identification system can guide future training processes of attributed LLMs, leading to specific mechanisms' development to correct such errors.
Our categorization also offers a departure from the existing approach (Liu et al., 2023), which emphasizes on degree of support ("full", "partial", or "none") rather than attribution error types.Our approach highlights specific issues in attribution evaluation for more effective error management and system improvement.
Formally, the task of attribution evaluation involves a natural language query q, a generated answer a, and a reference x from an attributed LLM.The goal is to develop a function, denoted as f , that inputs (q, a, x) and outputs a class label indicating whether "according to x, the answer a to the query q is attributable, extrapolatory or contradictory."5

Automatic Evaluation of Attribution
Following our problem definition, we introduce two approaches for automatic evaluation of attribution: prompting LLMs and fine-tuning LMs on simulated and repurposed data from related tasks.

Prompting LLMs
Recent research (Fu et al., 2023) has demonstrated the possibility of prompting LLMs to evaluate the quality of generated text using their emergent capabilities (Wei et al., 2022b), such as zero-shot instruction (Wei et al., 2022a) and in-context learning (Brown et al., 2020).Following this approach, we prompt LLMs, such as ChatGPT (OpenAI, 2023a), using a clear instruction that includes definitions of the two types of errors (as shown in Figure 1) and an input triple of the query, answer, and reference for evaluation.The complete prompt used in our study can be found in Appendix Table 6.

Fine-tuning LMs on Repurposed Data
The primary challenge in fine-tuning LMs for automatic attribution evaluation is the lack of training data.One potential approach is to hire annotators to collect real samples, but the cost can be prohibitive.
Here, we first repurpose datasets from three related tasks (fact-checking, NLI, and summarization).We then propose to further simulate more realistic samples from existing QA benchmarks.Repurpose data from fact-checking, NLI, and summarization tasks.Given the connections between our attribution evaluation task and the tasks of fact-checking, NLI, and summarization, we propose to utilize datasets from these fields to enrich our training examples.Fact-checking data and NLI data, with their emphasis on assessing the consistency and logical relationship between claims (hypothesis) and evidence (premise), mirrors our task's Query: Which apostle had a thorn in his side?
Long Ans: Paul was an apostle who had a thorn in his side [1].Long Ans: Phillip had a thorn in his side [1].

Thorn in the flesh
Query: Which apostle had a thorn in his side?
Long Ans: Paul was an apostle who had a thorn in his side [1].
Query: Which apostle had a thorn in his side?
Long Ans: The apostle who had a thorn in his side is Paul [1].
Examples simulated from open-domain QA.We 1) use the original (question, answer, context) pair as an attributable instance (A), 2) substitute the answer or the answer span in the context to simulate a contradictory error example (B, C), and 3) replace the context with alternatives to simulate an extrapolatory error example (D).In order for models trained the simulated data to generalize well to the long answer setting in real-life search engines like New Bing, we convert the short answer to a long one (using ChatGPT).
objective of checking the supporting relationship between reference documents and generated statements.Summarization datasets, especially those involving the detection of hallucinations (including both intrinsic and extrinsic (Maynez et al., 2020), could provide a useful starting point for identifying attribution inconsistencies.Nevertheless, these datasets would require suitable adaptation.We keep their original data sequences and modify their data label space to suit the specific needs of the attribution evaluation definition.Additional information on this can be found in Appendix A.
Simulate data from open-domain QA.QA benchmarks provide an ideal platform for data simulation, as they comprise questions, their corresponding ground truth answers, and reference contexts.These elements can be directly employed as attributable examples (Figure 2, A).In open-domain QA datasets, answers are typically brief text spans.
To cater to the long answer setting in most attributed LLMs, we convert these short answers into longer sentences using ChatGPT.For simulating contradictory errors, we propose two methods: (1) The first involves modifying the correct answer with an alternative candidate from an off-the-shelf QA model, an answer substitution model, or a random span generator (Figure 2, B).
(2) The second retains the original answer but replaces the answer span in the reference context with a comparable candidate (Figure 2, C).To emulate extrapolatory errors, we employ a BM25 retriever on the ques-tion, retrieving relevant external documents from resources such as Wikipedia, which do not contain the ground truth answers (Figure 2, D).More details regarding the simulation of these errors from QA datasets can be found in Appendix A.
4 Experimental Setup

Datasets
This section presents the datasets utilized for training and testing methods for automatic attribution evaluation.In particular, we develop two evaluation sets, AttrEval-Simulation and AttrEval-GenSearch, derived from existing QA datasets and a generative search engine, respectively.The dataset statistics are presented in Table 1.
Training data.To repurpose and simulate training examples, we follow the method in Section 3.2 based on four similar tasks' datasets.For QA, we consider NaturalQuestions (Kwiatkowski et al., 2019).For fact-checking, we include FEVER (Thorne et al., 2018), Adversarial FEVER (Thorne et al., 2019), FEVEROUS (Aly et al., 2021), VI-TAMINC (Schuster et al., 2021), MultiFC (Augenstein et al., 2019), PubHealth (Kotonya and Toni, 2020), and SciFact (Wadden et al., 2020).For NLI, we include SNLI (Bowman et al., 2015), MultiNLI (Williams et al., 2018), ANLI (Nie et al., 2020) and SciTail (Khot et al., 2018).For summarization, we include XSum-Halluc.(Maynez et al., 2020), XENT (Cao et al., 2022), and FactCC (Kryscinski  keywords from a specific domain are randomly generated using ChatGPT, and relevant facts within that domain are compiled from the Web. 7n the verification process, queries are sent to the New Bing search engine under a balanced mode following Liu et al. (2023), which balances accuracy and creativity.The validity of the output generated by New Bing is evaluated, where we consider only the first sentence that answers the question along with its reference.As we state in Section 2, our evaluation emphasizes the error type in a single reference per statement.In the case of a sentence having multiple references or distinct segments (for example, "XXX [1][2]" or "XXX [1] and YYY [2]"), each reference or segment is treated as a separate sample, and the attributions are verified individually.Finally, the samples are categorized by the annotators as attributable, contradictory, or extrapolatory.Detailed annotation guidelines can be found in Appendix D.

Implementation Details
In the configuration of "prompting LLMs", we test Alpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023), ChatGPT (OpenAI, 2023a) and GPT-4 (OpenAI, 2023b), where we use OpenAI's official APIs (gpt-3.5-turbo,gpt-4-0314)8 , and weights from Alpaca and Vicuna from the official repository9 .For Alpaca and Vicuna inference, documents are tokenized and truncated at a maximum of 2048 tokens.We generate text with a temperature of 0. The prompts for the task of evaluating attribution are provided in Appendix  and our main results are averaged over 4 different prompts.For the few-shot prompting setting, we manually write 3 examples as demonstrations for both test sets as shown in Table 7.If LLMs yield an attribution label with an explanation, we extract the predicted label with regular expression.
Metrics.For evaluation, we present the F1 score for each individual class as well as the micro-F1 score, which is equivalent to the overall accuracy.

Overall Performance
Table 2 presents an evaluation of different models on both the simulated dataset (AttrEval-Simulation) and the annotated dataset on New Bing (AttrEval-GenSearch). Our primary findings are as follows: GPT-4 achieves promising results, reaching an overall accuracy of 81-83% on AttrEval-GenSearch and significantly outperforming other models.This suggests a promising potential for employing GPT-4 for automatic attribution evaluation to alleviate human annotation workloads, aligning with the emerging trend that uses GPT-4 for different evaluation tasks (Chiang et al., 2023;Zheng et al., 2023).However, it may still not be sufficiently accurate for practical use.We also note some potential concerns of bias (see Limitations Section 8).Automatic attribution evaluation presents substantial challenges.This complex task requires not only understanding the reference information but also comparing it with the information in the statement, all of which can significantly vary across different datasets and test conditions.Against these challenges, models other than GPT-4 exhibit suboptimal performance in zero-shot and few-shot settings.Fine-tuning LMs on the simulated datasets from related tasks significantly improves the per-  formance.For instance, the Vicuna (13B) model sees the overall accuracy on the two test sets rise from 34.6% and 41.4% in the zero-shot setting to 66.0% and 71.3%, respectively.And the fine-tuned FLAN-T5 (770M) model can even surpass Chat-GPT on both test sets.Despite this, there is still a large room for further improvement.Some models that yielded better results on the simulated test set may be less effective on the annotated test set, indicating a lack of consistency across diverse testing settings, signaling generalizability challenges.Models struggle most notably with contradictory errors.Detecting contradictions is particularly complex because it requires the model to weigh one piece of information in the statement against another in the reference, a process that necessitates advanced fine-grained information comparison and reasoning capabilities.Consequently, even the best-performing model GPT-4 and the finetuned models often fail when faced with contradictory inputs, most often treating them as attributable (see qualitative analysis in Section 5.2).

Qualitative Analysis
To shed light on the space for future improvements in attribution evaluation, we qualitatively examine all the error examples of GPT-4 in the zero-shot setting.Representative examples are in Table 3.
Our first observation is that a significant portion (30.6%) of errors happen due to fine-grained information insensitivity: failure in comparing very fine-grained information such as numerical values, numbers, dates, and time.Besides, the model misunderstands task definition and misinterprets logical relations implied by labels (22.2%).The model also struggles with symbolic operators (13.9%).For example, it fails to distinguish 'equal to' (=) and 'approximately equal to' (≈) in numeric comparisons.In the left cases, the model tends to overlook the context clues and does not make judgments by conditioning on the reference (e.g., potentially relying on its own parametric knowledge).
Our observations point to two potential directions for improvement: 1) training or prompting models to be more faithful and strongly conditioned on the reference (Zhou et al., 2023), especially paying attention to fine-grained information; and 2) augmenting an LM-based evaluation method with external tools for different types of numerical and logical operations that are hard to be accurately performed only by the LM itself (Chen et al., 2020;Mialon et al., 2023).Similarly, we do qualitative analysis for ChatGPT in Appendix Section E.

Ablation Study
In this section, we perform an ablation study to test how each task influences the fine-tuned LMs' results and analyze the prompt sensitivity in zeroshot and few-shot settings for prompting LLMs.Contribution of individual task.We show the performance of models fine-tuned on individual task datasets and their combinations in Figure 4. We select a representative from each group of the models under the fine-tuned setting in Table 2. Our findings suggest that examples from our simulated QA and fact-checking task most significantly improve performance for the attribution evaluation task, hinting at a strong link between these tasks.Furthermore, integrating various related task datasets generally leads to better performance, particularly on out-ofdomain test instances in AttrEval-GenSearch.Sensitivity of prompts.The choice of prompts used to evaluate language models can have an impact on their performance.We evaluate the sensitivity of prompts for AttrScore under both zero-shot and few-shot settings of Alpaca (7B) and ChatGPT.
We show four types of prompts as mentioned earlier: a prompt designed specifically for our evaluation setting (Attri.),an NLI prompt, a fact-checking prompt (Fact.), and a summarization hallucination detection prompt (Sum.).These prompts are presented in Appendix Table 6.As shown in Table 4, fact-checking and NLI prompts generally perform better, as similar tasks may have been seen during their instruction tuning phase.
The prompts include a prompt for attribution (Attri.), a NLI prompt, a fact-checking prompt (Fact.), and a summarization hallucination detection prompt (Sum.).
validity of such attribution remains questionable.
Evaluation of attribution.To evaluate attribution, Liu et al. ( 2023) conduct a human evaluation to audit the verifiability of responses from generative search engines.They find that these engines frequently contain unsupported statements and inaccurate citations, which strengthen the need to carefully examine the attribution of generations (Rashkin et al., 2021).However, human evaluations are very expensive and time-consuming.Gao et al. (2022); Bohnet et al. (2022); Gao et al. (2023) propose to automatically evaluate attribution by levering NLI models (Honovich et al., 2022;Kamoi et al., 2023;Gekhman et al., 2023).We study this problem in a more comprehensive and realistic manner: 1) we explore how helpful other relevant tasks besides NLI are to attribution evaluation; 2) our evaluation setting is based on both benchmark examples and real examples.

Conclusion
In this paper, we investigate the important problem of automatically evaluating attribution given by LLMs.We begin by defining different types of attribution errors and then explore two approaches for automatic evaluation: prompting LLMs and fine-tuning smaller LMs.We experiment with both simulated test examples and manually curated test examples from a real-life generative search engine.
The results highlight both promising signals and remaining challenges for the automatic evaluation of attribution.We hope our work could lay the foundation for future studies on this important problem.

Limitations
Currently, smaller models in AttrScore are finetuned on the combination of simulated or repurposed datasets from related tasks.However, this dataset still has gaps from the real scenario.Moreover, the error patterns in these simulated datasets might be overly simplistic and lack diversity, which can limit the models' ability to effectively handle more complex and varied real-world errors.It is also worth noting that these simulated datasets may contain noise and erroneous labels, which could further impede the models' learning and subsequent performance.How to obtain higher-quality training data for attribution evaluation at scale can be a major focus for future development.
Our annotated evaluation set, AttrEval-GenSearch, is derived from New Bing, which uses GPT-4 as its backbone.It is crucial to note that we also use GPT-4 for evaluating attribution on AttrEval-GenSearch, which achieves the best performance with around 85% overall accuracy.Some bias might come from GPT-4 both generating the test examples and evaluating the attribution, which could potentially skew our understanding of the model's true performance.We therefore caution against over-optimism.We also acknowledge that the size of AttrEval-GenSearch is moderate, which may not fully represent the real use setting of attributed LLMs.
While acknowledging current limitations, several promising directions emerge for future research and enhancement.For example, one can diversify data sources to include examples from a variety of generative search engines, not just New Bing.In addition, it may be beneficial to annotate larger-scale queries that cover a broad spectrum of topics, styles, and perspectives.

Ethics Statement
This research project involves evaluating attribution given by attributed LLMs.We collect and annotate data for evaluation using publicly available information on the web, with the assistance of a generative search engine, New Bing.We acknowledge that LLMs have the potential to reproduce and amplify harmful information present in the data.We made an effort to mitigate this risk by carefully selecting our evaluation data and by conducting analyses to identify and mitigate potential risks in the process.

A Data Simulation
A.1 Simulation -QA Attributable.Since we have questions, and their ground truth answers and reference contexts, we can directly treat them as "Attributable" examples.
Contradictory.To simulate contradictory errors, we consider two methods.The first method involves modifying the correct answer by replacing it with a different candidate generated from an offthe-shelf QA model, an answer substitution model, or a random span generator.The second method involves keeping the original answer and replacing the answer span in the reference context with a similar candidate.The QA model, the answer substitution model, and the random span generator are all implemented by prompting a FLAN-T5-XL (3B) (Chung et al., 2022) with different task prompts in Appendix Table 5.
Extrapolatory.To simulate extrapolatory errors, we employ a BM25 retriever to retrieve external documents that do not contain ground truth answers from knowledge sources like Wikipedia or the Web.And then we replace the original paragraph with one of the retrieved documents.For the answer, we either keep the original ground truth answer or leverage a QA model to generate an answer.Here are more details for constructing negative retrieved documents in each dataset.
Following previous work (Karpukhin et al., 2020), we utilize the passages from Wikipedia dumps for constructing evidence for NaturalQuestions (Kwiatkowski et al., 2019), WebQuestions (Berant et al., 2013), andTREC (Baudis andSedivý, 2015) datasets.In particular, we regard the highest-ranked passage including answers from BM25 as positive evidence and the top passage without answers as negative evidence.
For TriviaQA (Joshi et al., 2017), we select the passage with the highest overlap with answers from web texts as positive evidence and the top-ranked wiki passage without answers from BM25 as negative evidence.We exclude examples where the positive evidence has an overlap ratio of less than 0.5 with answers.For HotpotQA (Yang et al., 2018), we combine the ground truth passages provided as positive evidence and randomly select two out of eight passages provided as negative evidence.Similarly, in PopQA (Mallen et al., 2022), we find positive evidence from Wikipedia content through the provided link and retrieve negative evidence from Wikipedia dumps using BM25.In En-tityQuestions (Sciavolino et al., 2021), we match positive evidence in Wikipedia texts searched by the question entity and retrieve negative evidence via BM25.
Converting short answers to long sentences.Since many of the attributed LLMs generate long sentences to the query, to make it our simulated data more realistic, we convert short answers to long answers using ChatGPT.Specifically, we prompt ChatGPT with the instruction "Convert a given question and answer pair into plain sentences.[Question] [Answer]".

A.2 Simulation -Fact Checking
With provided Wiki content as evidence in FEVER (Thorne et al., 2018) and Adversarial FEVER datasets (Thorne et al., 2019), we repurpose 'SUPPORTS' examples as attributable, 'RE-FUTES' as contradictory, and 'NOT ENOUGH INFO' as extrapolatory.Using the same label mapping, we apply this approach to the claim and evidence provided in VITAMINC (Schuster et al., 2021), after removing duplicated examples as shown in FEVER.For FEVEROUS (Aly et al., 2021), we concatenate all pieces of evidence, including tables and texts, and prepend an increasing index as the final evidence.We then ground the label into our three categories using the same label mapping.Regarding natural claim datasets with various label spaces, we keep the top 6 classes out of 117 in MultiFC (Augenstein et al., 2019) and map them to our defined three categories.In PUBHEALTH (Kotonya and Toni, 2020), we consider both 'unproven' and 'mixture' classes as extraplanetary.We also regard the abstract of the article as evidence.For SciFact (Wadden et al., 2020), we repurpose 'SUPPORT' as attributable and 'CONTRADICT' as contradictory.Additionally, we randomly select one sentence from the abstract of other articles as evidence for the 'Not enough information' class to construct extrapolatory examples.

A.4 Simulation -Summarization
Summarization involves condensing a given passage or article into brief sentences while preserving its original meaning.To simulate contradictory examples, we use datasets with annotations of hallucinations.In terms of XSum-Hallucination (Maynez et al., 2020), we merge examples with the same ID and consider those with the most intrinsic hallucination as contradictory and those with the most extrinsic hallucination as extrapolatory.Paired full articles and ground truth summaries are treated as attributable examples.For XENT (Cao et al., 2022), 'Non-factual Hallucination' and 'Intrinsic Hallucination' are seen as contradictory, 'Factual Hallucination' as extrapolatory, and 'Non-hallucinated' as attributable.Each article and reference are paired as attributable examples.Finally, we resplit the manually annotated dev and test sets for training and evaluation in FactCC (Kryscinski et al., 2020), with 'INCORRECT' labeled as extrapolatory and 'CORRECT' as attributable.

B Label and Subset Distributions of Training and Test Sets
We show the label and data sources' distributions of training and AttrEval-Simulation sets in Figure 5 and Figure 6.

C Prompts for LLMs as AttrScore
We show different kinds of prompts for using LLMs as AttrScore in Table 6.And we show the few-shot demonstrations in Table 7.

D Generative Search Engine Examples Annotation Protocol
We show the detailed annotation guidelines in the following.

Annotation Guidelines
Overview Thank you for participating in this annotation task.The goal of this task is to create a query and verify whether a given reference document fully supports the generation of the query.
There are two sub-annotation tasks: 1. Create a query based on a few given keywords under a topic.
2. Verify whether a given answer to a query is fully supported by its references.
Task 1: Create a query for a specific domain.
You will be shown a list of keywords (e.g., inflation rate, CPI, GDP, unemployment rate, etc.) from a specific domain or topic (e.g., economics) and a demo question (e.g., What was the unemployment rate in Germany in 2020?) as an inspiration.Then you will be asked to create a new query based on these keywords.
Task 2: Verify whether the generated statement is supported by its reference.
You will be shown a user query, a generative search engine's response, and associated references.You will need to read the query, response, and reference carefully and verify whether the cited evidence fully supports the generation of the query.

E Additional Qualitative Analysis
The qualitative results of ChatGPT are shown in Table 8.Our first observation is that a significant portion (79.4%) of errors happen due to ChatGPT overlooking the context clues and does not make judgments by conditioning on the reference (e.g., potentially relying on its own parametric knowledge).For the remaining error cases, they are: 1) fine-grained information insensitivity (13.8%): failure in comparing very fine-grained information such as numerical values, numbers, dates, and time; 2) failure in performing symbolic operations (6.8%): the model fails to verify the claim which requires performing symbolic operations over the reference, such as verifying set relationships.

F Author Contribution Statement
Xiang Yue conceived the project, conceptualized and designed the study, conducted experiments, wrote the manuscript, and annotated New Bing test examples.Boshi Wang provided critical feedback and edits, revised the manuscript, contributed to the

Figure 4 :
Figure 4: The influence of individual task data.Combining datasets generally improves model performance.
You will be given three options: a. ''Attributable'': The reference fully supports the generation of the query.b. ''Extrapolatory'': The generation contradicts the reference.c. ''Contradictory'': The reference does not contain sufficient information to infer the correctness of the generation.

Figure 6 :
Figure 6: Data source distribution of combined training and AttrEval-Simulation sets.

Table 1 :
Statistics of the training and test datasets for attribution evaluation.We include the distributions of the labels and data sources in Appendix B.

Table 2 :
The performance (F1 score) of AttrScore with different models on AttrEval-Simulation and AttrEval-GenSearch sets.The best-performing result in each setting is in bold.The results show both promising signals and challenges (e.g., all models struggle with contradictory errors) in automatic evaluation of attribution.
In the early 1800s , it was believed that ketchup could be used to treat indigestion, jaundice, and diarrhea Reference: According to John Cook Bennett, tomatoes could be used to treat diarrhea, ... Soon, Bennett was publishing tomato ketchup recipes in pill form ... According to Dr. Bennett, this tomato blend can help people with diarrhea, indigestion, rheumatism, and jaundice.▷ Ground Truth: Extrapolatory ▷ GPT-4: Attributable.The reference supports the claim that in the early 1800s , ketchup was believed to be used as a medicine to treat indigestion, jaundice, and diarrhea.It mentions John Cook Bennett's belief in the medicinal properties of tomatoes and his publication of tomato ketchup recipes in pill form, which were sold as patent medicines in supermarkets and pharmacies.Extrapolatory.The reference provides an estimated release date for GTA 6 based on a legal filing from Microsoft and insider leaks, but it does not confirm an official announcement from Rockstar Games regarding the release date.According to Indeed.com, the average salary for a software engineer working at Amazon in the United States is $131,930 per year.Reference: Average salary $132,147 Salary estimated from 3,612 employees, users, and past and present job advertisements on Indeed in the past 12 months.Last updated: April 18, 2023.
▷ Ground Truth: Contradictory ▷ GPT-4: Attributable.The reference does provide an average salary of $132,147 for a software engineer working at Amazon, which is very close to the claim's figure of $131,930.

Table 3 :
Qualitative analysis on misclassified examples from prompting GPT-4 in AttrScore.

Table 5 :
Please provide a related term or substitution for the given input, which should be different from the input.\n"Input: Biden; Output: Obama\n" Prompts for QA, answer substitution, and random span generation when simulating contradictory errors Figure 5: Label distribution of training and test sets.