Enabling Large Language Models to Generate Text with Citations

Large language models (LLMs) have emerged as a widely-used tool for information seeking, but their generated outputs are prone to hallucination. In this work, our aim is to allow LLMs to generate text with citations, improving their factual correctness and verifiability. Existing work mainly relies on commercial search engines and human evaluation, making it challenging to reproduce and compare different modeling approaches. We propose ALCE, the first benchmark for Automatic LLMs' Citation Evaluation. ALCE collects a diverse set of questions and retrieval corpora and requires building end-to-end systems to retrieve supporting evidence and generate answers with citations. We develop automatic metrics along three dimensions -- fluency, correctness, and citation quality -- and demonstrate their strong correlation with human judgements. Our experiments with state-of-the-art LLMs and novel prompting strategies show that current systems have considerable room for improvement -- For example, on the ELI5 dataset, even the best models lack complete citation support 50% of the time. Our analyses further highlight promising future directions, including developing better retrievers, advancing long-context LLMs, and improving the ability to synthesize information from multiple sources.


Introduction
Large language models (LLMs; Brown et al., 2020;OpenAI, 2023) have gained increasing popularity as a tool for information seeking.While they generate engaging and coherent responses, their outputs are prone to hallucination and often contain factually incorrect information (Ji et al., 2023).This makes it harder for users to trust and verify LLMgenerated outputs without any supporting evidence.
In this work, we study a new generation paradigm for LLMs, in which we require LLMs When did the US break away from England?
The US took the first step towards gaining independence from GB when it declared independence on July 2, 1776 (although the event is now commemorated on July 4, 1776, the date when the Declaration of Independence was officially adopted by Congress) to provide citations to one or a few text passages for any statement they generate (Figure 1).Incorporating citations brings several benefits: (1) users can easily verify LLMs' claims with the provided citations; (2) LLMs can generate text that faithfully follows cited passages, which has the promise to improve correctness and alleviate hallucination.
Multiple commercial systems have adopted this paradigm: Bing Chat2 and perplexity.ai3respond to user questions in natural language with references to Web pages.Nakano et al. (2021); Menick et al. (2022) share a similar motivation, but they mainly experiment with commercial search engines and closed-source models, making their results difficult to evaluate.Retrieval-augmented LMs (Borgeaud et al., 2022;Izacard et al., 2022) incorporate retrieved passages during both training and inference, but do not guarantee faithfulness to retrieved passages or explicitly provide citations.Additionally, previous studies mostly rely on human evaluation (Nakano et al., 2021;Menick et al., 2022;Liu et al., 2023), which is expensive and difficult to reproduce.We argue that the absence of automated evaluation hinders the advances of such systems.A: Student loans can affect the debt to income ratio [1], which is a key factor in determining the amount that ... [2][3] Table 1: The three datasets used in our ALCE benchmark.These datasets cover a wide range of question types and the corresponding corpora span from Wikipedia to Web-scale document collection.
We present ALCE, the first reproducible benchmark for automatically evaluating LLMs' generations with citations.ALCE assumes a naturallanguage question and a retrieval corpus, and requires building end-to-end systems to retrieve relevant passages from the corpus, generate a response to the question, and cite corresponding supporting passages.We compile three datasets that cover different types of questions and corpora-ASQA (Stelmakh et al., 2022), QAMPARI (Rubin et al., 2022), and ELI5 (Fan et al., 2019)-as shown in Table 1.Different from previous benchmarks (Lee et al., 2019;Bohnet et al., 2022), ALCE evaluates long-text generation, focusing on automatically evaluating citation quality, and allows citing multiple passages for individual statements.
We design automatic evaluation methods in three dimensions: fluency, correctness, and citation quality.Specifically, we use MAUVE (Pillutla et al., 2021) to measure fluency, propose tailored correctness metrics for each dataset, and adopt a natural language inference (NLI) model (Honovich et al., 2022) to measure citation quality.We showcase how the three dimensions together contribute to a robust evaluation, preventing systems from exploiting shortcuts.Additionally, we conduct human evaluation and demonstrate a strong correlation with our automatic metrics.
We experiment on multiple systems with stateof-the-art LLMs and retrievers and also propose novel prompting strategies to synthesize retrieved text into text generation.Although all systems are capable of providing fluent and coherent responses, there remains substantial room for improvement in terms of correctness and citation quality: For example, on the ELI5 dataset, around 50% generations of our ChatGPT and GPT-4 baselines are not fully supported by the cited passages.Additionally, we find that (1) a closed-book model (generating answers without accessing any retrieved documents) with post-hoc citing achieves good correctness but much worse citation quality; (2) although interactive retrieval approaches (Yao et al., 2023;Schick et al., 2023) offer more flexibility in when/what to retrieve, they do not improve the performance on this challenging benchmark; (3) summarizing the retrieved passages in a shorter text improves correctness but not citation quality; (4) reranking multiple generations boosts citation quality measured by human evaluation; (5) incorporating more retrieved passages in context does not help Chat-GPT but improves GPT-4 performance.
Our extensive analyses highlight three major challenges of building LLMs to generate text with citations: (1) the retrieval quality is crucial to the final performance and has substantial room for improvement; (2) LLMs' limited context window restricts the number of passages they can incorporate; (3) current LLMs struggle to synthesize multiple documents in context without being distracted by irrelevant ones, although better instruction tuning brings significant improvement.These challenges pose promising research directions for developing better systems integrating retrieval and LLMs.

Task Setup and Datasets
Our task is formalized as follows: Given a query q and a corpus of text passages D, the system is required to return an output S, which consists of n statements s 1 , ..., s n , and each statement s i cites a list of passages C i = {c i,1 , c i,2 , . ..}4 , where c i,j ∈ D. In this work, we segment LLMs' output into statements by sentence boundaries. 5While LLMs may include sentences that do not require a citation, such as "I'm happy to help", we observe that almost all sentences that LLMs output provide valuable information and require citations, similar to findings in Liu et al. (2023).In this work, citations are enclosed by box brackets such as [1][2].
We divide the corpus D into 100-word passages following previous works on open-domain question answering (Karpukhin et al., 2020;Petroni et al., 2021;Piktus et al., 2021), in contrast to commercial systems like Bing Chat, which cite entire Web pages.We take 100-word passages because it is easier for humans to verify, and allows for more retrieved passages to fit in LLMs' limited context.
We choose QA datasets so that (1) they contain factual questions, in which references are important; (2) questions require long-text answers that cover multiple aspects; (3) answering the questions requires synthesizing multiple sources.We select three datasets (Table 1) and introduce them below.See §B for additional statistics.
ASQA (Stelmakh et al., 2022) is a long-form factoid dataset.As shown in Figure 1, each question is an ambiguous question from AmbigQA (Min et al., 2020) that requires multiple short answers to cover different aspects, and the dataset provides a longform answer that covers all short answers.Since most questions can be answered by Wikipedia, we use the 2018-12-20 Wikipedia snapshot as D.
QAMPARI (Rubin et al., 2022) is a factoid QA dataset constructed from Wikipedia, where the answer is a list of entities that are drawn from different passages.Same as ASQA, we use the 2018-12-20 Wikipedia as the corpus.
ELI5 (Fan et al., 2019) is a long-form QA dataset built on the Reddit forum "Explain Like I'm Five".6Most ELI5 questions are how/why/what questions that require long answers and multiple passages as evidence.Due to the diverse topics discussed in the questions, we use Sphere (Piktus et al., 2021)-a filtered version of Common Crawl7 -as the corpus.The ELI5 dataset is widely used in related work due to its challenging nature (Nakano et al., 2021;Menick et al., 2022;Liu et al., 2023).
We randomly select 1,000 examples from the development set of each dataset for ALCE.Our benchmark primarily assesses the citation capabilities of existing LLMs and does not provide training data, as there are no available examples that provide supervision for citations in these datasets.

Automatic Evaluation
Our benchmark measures the following three dimensions of system responses: • Fluency: whether the model's generated text is fluent and coherent.
• Correctness: whether the answer is accurate and covers all aspects of interest.
• Citation quality: whether the answer is well supported by the cited passages and no irrelevant passages are cited.In the following, we present automatic metrics for each dimension and discuss why the combination of the three metrics provides a robust evaluation.

Fluency
We use MAUVE (Pillutla et al., 2021) to evaluate the fluency of the output ( §C).We deploy MAUVE for ASQA and ELI5 and omit it for QAMPARI, as QAMPARI only requires a list of short answers as the response and LLMs consistently adhere to the format in our experiments.As MAUVE is sensitive to output length and text style, and most LLMs are capable of producing fluent text, we mainly employ it as a sanity check as long as the MAUVE scores are high enough.

Correctness
Our objective is to measure the informativeness and utility of the generation to the question.Liu et al. (2023) propose to directly evaluate perceived utility by humans, a process difficult to automate.Therefore, we use correctness-whether the response is accurate compared to a ground truth answer-as a proxy.Evaluating the correctness of long-form generation is a challenging task (Krishna et al., 2021), and we describe our strategy for each dataset below.Figure 2 illustrates the metrics and we include additional implementation details in §C.
For ASQA, we follow Stelmakh et al. ( 2022) and calculate the recall of correct short answers by checking whether the short answers (provided by the dataset) are exact substrings of the generation (exact match recall; EM recall).
For QAMPARI, we follow Rubin et al. ( 2022) and calculate the precision and recall of the model prediction, by checking the exact match to the gold answer list.We add one additional adjustment: considering that users often want to know only a few example answers of the question, our evaluation considers recall to be 100% if the prediction includes at least 5 correct answers (recall-5).Unlike ASQA and QAMPARI, the ELI5 dataset does not provide short entity answers.Fan et al. (2019) use ROUGE for evaluation, which does not reflect the correctness well (Krishna et al., 2021;§A).Inspired by works in summarization evaluation (Zhang and Bansal, 2021;Kamoi et al., 2023;Wang et al., 2020), we use Instruct-GPT (text-davinci-003; Ouyang et al., 2022) to generate three "sub-claims".Then we use TRUE 8 (Honovich et al., 2022), a T5-11B (Raffel et al., 2020) model fine-tuned on a collection of natural language inference (NLI) datasets, to check whether the model output entails the sub-claims (claim recall).TRUE targets factual correctness and has been used by previous works in similar context (Bohnet et al., 2022;Gao et al., 2023).We demonstrate that claim recall provides a more accurate measure of correctness than existing metrics (more details in §A).

Citation Quality
We evaluate citation qualities using two metrics: (1) citation recall, which determines if the output is entirely supported by cited passages, and (2) citation precision, which identifies any irrelevant citations.Although we prioritize citation recall as it entails a well-supported and truthful answer, enhancing precision is crucial for better user satisfaction, reducing the need for human review of extraneous For this question, citation recall = 2 / 3 = 66%

Citation Precision
Detect "irrelevant" citation: one citation alone does not support the claim, and removing it does not affect other citations combined to support the claim.
When did the US break away from England?
If recall = 0, then precision = 0 Recall = 1 if the concatenation of all cited passages fully supports the segment.
We use an NLI model to determine "fully support".
We use an NLI model to verify whether a statement is supported by its citations.
passages.Figure 3 provides an illustrated example.
We use the NLI model TRUE (Honovich et al., 2022) again to automatically examine whether the cited passages entail the model generation.We conduct human evaluation ( §6) to demonstrate strong human correlation of our metric.
Citation recall.We calculate the citation recall of each statement (0 or 1) and average over all statements in the model response.For each statement s i , its citation recall is 1 if and only if there is at least one citation (C i ̸ = ∅) and ϕ(concat(C i ), s i ) = 1, where ϕ(premise, hypothesis) is the NLI model that outputs 1 if the premise entails the hypothesis, and 0 otherwise; concat(C i ) concatenates all passages in C i together (details in §C).The NLI evaluation is in accordance with the attributable to identified sources (AIS) framework (Rashkin et al., 2023) Citation precision.Our citation precision evaluation detects citations that are irrelevant, but it does not require citing a minimal set.We follow this design because human writing often cites redundant sources to enhance credibility; human readers may also appreciate multiple citations, especially when it pertains to critical claims such as medical advice.
We calculate the citation precision for each citation (0 or 1) and average over all citations in the  response.We first define if a citation is "irrelevant".Intuitively, a citation c i,j is "irrelevant" if (a) c i,j itself cannot support s i and (b) removing c i,j does not affect the rest of the citations to support s i .Formally, c i,j is "irrelevant" if and only if c i,j has a precision of 1 if s i has recall=1 and c i,j is not irrelevant.For example (Figure 3), when s 3 cites three references [2][4][5] and recall=1, [2] is "irrelevant" if ϕ([2], s 3 ) = 0 and ϕ ([4][5], s 3 ) = 1.For condition (b) to work, we set recall=1 as a prerequisite for precision= 1.Note that this algorithm overlooks the scenario when one citation partially supports the statement.We discuss the details in §E.

ALCE is Robust to Shortcut Cases
We showcase how the ALCE evaluation is robust to two possible shortcuts in §D: (1) using the top-1 retrieved passage as the response and citing itself, and (2) using the first two sentences of the top-1 passage.Both cases have almost-perfect citation scores, but (1) has low fluency due to its unnaturally long length compared to human answers, and (2) has low correctness due to low coverage.

Modeling
In this section, we discuss three major modeling components for an ALCE system-retrieval, synthesis, and post-editing.

Retrieval
We explore simple, off-the-shelf retrievers.We use dense retrievers for Wikipedia, including GTR (Ni et al., 2022) and DPR (Karpukhin et al., 2020); we use BM25 for Sphere.For each question, we retrieve the top-100 passages.

Synthesis
We focus on how to prompt an LLM to interact with the retriever, and synthesize and cite the evidence (without fine-tuning internal parameters).One noteworthy challenge is that existing LLMs all have limited context window and thus can only fit a handful of passages.
VANILLA.We simply provide the model with the top-k9 passages and instruct the model to cite accordingly (Table 2).We also use in-context learning (Brown et al., 2020) and prepend two demonstrations.The complete instruction is in Table 23.SUMM/SNIPPET.With a 4K context window, we can at most safely fit k = 5 passages.As shown in Figure 4, top-5 retrieved passages can only cover 56.8% percent of the answers in ASQA.
To tackle this limitation, we propose to provide summaries or snippets of passages instead of the full text (summaries are abstractive but snippets are spans from passages).We acquire summaries and snippets by prompting ChatGPT with instructions (prompts in Table 25 and 26). 10 Then we replace all passages with summaries/snippets.Summaries or snippets significantly reduce the passage length, allowing for more passages to fit in: for ASQA, they reduce passage length by 6× on average.
Though SUMM/SNIPPET allows for more retrieved passages, they are lossy compressions.To alleviate this problem, we propose INTERACT, an interactive prompting scheme to allow the model to check the full text of certain passages.At each step, the model can execute one of three actions: (1) "Check: Document [1][2]" to check the full text of the corresponding documents; (2) "Output:" to output a statement of the answer; (3) "End." to end the generation.§C provides more details.

INLINESEARCH.
The above methods all display retrieval results at the beginning.In INLI-NESEARCH, we allow LLMs to call "search" during the generation process (Yao et al., 2023;Press et al., 2022;Jiang et al., 2023).At each step, the model can execute one of three actions: "Search:  {query}" to search among the top-100 passages11 by using GTR; the "Output" and "End" actions are the same as INTERACT.For each "Search" action, we display the best retrieved passage in the context.The passage is removed after one action to save context space.Table 3 shows an example.
CLOSEDBOOK.We also add a simple closedbook baseline, where the model is only prompted with the instruction and the question, without any retrieved passages provided.Consequently, this variant does not cite any evidences.

Post-editing
In this section we discuss two strategies for refining the output to further improve its quality.
RERANK.We randomly sample n sample = 4 responses for each question, and select the best response using the automatic citation recall score.we expect RERANK to improve the citation quality.

POSTCITE.
For each statement, we find the best matching passage among the top-100 retrieved passages using GTR and cite it.We combine this with CLOSEDBOOK in our experiments.

Main Results
We present the main results on three datasets in Table 4, 5, and 6 respectively (full results in §G.6).
We first note that all models achieve good fluency scores (except some models on ELI5 mainly due to their longer generations).We summarize the main takeaways from the experiments below.
VANILLA achieves strong performance.GPT-4 brings limited improvement but is better at using long context.We evaluate GPT-4 with VANILLA and different numbers of passages (more results in §G.6).GPT-4 brings consistent (but limited) improvement on correctness, but often at a cost of citation quality.GPT-4 can also incorporate more passages due to its longer context window, which boosts both correctness and citation quality.On the contrary, including more passages with ChatGPT-16K does not improve the results (Table 7), suggesting that processing more passages is non-trivial and GPT-4 is better at synthesizing information from its long context than ChatGPT.

Comparison of Different LLMs
Table 7 compares different LLMs on ASQA using VANILLA (more results in §G.6).Notably, instruction-tuned models (Vicuna-13B and LLaMA-2-Chat) outperform the original LLaMA models in correctness and considerably enhance the citation quality.We observe that while the original LLaMA models are able to copy facts from the context, they struggle with accurately citing the sources or simply do not cite.Notably, the best open-source model, LLaMA-2-70B-Chat, achieves comparable correctness score as the OpenAI models, but still lags behind in citation quality.

Retrieval Analysis
The retrieval results play a crucial role to the correctness and the citation quality.Figure 4 presents the retrieval recall@k with different datasets and  Figure 4: Retrieval recall@k on ASQA (EM recall), QAMPARI (recall-5), and ELI5 (claim recall).Retrieval recall serves as an upper bound for model performance, and we compare them with two models' correctness results in the figure (dashed lines): "Vanilla (5-psg)" is ChatGPT VANILLA with top-5 passages in context; "Oracle" is the same model except that it uses 5 gold passages ( §G.1), whose recall matches Recall@100 on all three datasets.4 shows the correctness performance of two models: (1) ChatGPT VANILLA with top-5 passages (our primary baseline); (2) an oracle version of the same model employing 5 gold passages ( §G.1; the 5 gold passages match the retrieval recall@100).Notably, both models' correctness lags behind the corresponding retrieval recall (except for ELI5 top-5).The discrepancy suggests that despite the presence of accurate answers in context, LLMs struggle to utilize them in their outputs.
We compare the impact of different retrievers and different numbers of passages to LLMs. Figure 4 (right) shows that GTR outperforms DPR in both correctness and citation quality, emphasizing the importance of deploying better retrievers.Contrary to the retrieval recall trend in Figure 4, more passages in context do not yield substantial improvement for ChatGPT.Specifically, correctness plateaus at top-1 passage and citation quality plateaus at top-3.GPT-4 (Table 7) exhibits an increasing trend with more passages, but the improvement is not proportional to the retrieval performance.This indicates the limited ability of LLMs in utilizing multiple passages within context.

Other Ablations
We provide additional ablations in §G.In summary, we find that (1) using comprehensive instructions enhances the citation quality of instruction-tuned models ( §G.2); (2) including at least one demonstration improves the performance ( §G.3); (3) finetuned models (FiD; Izacard and Grave, 2021) with POSTCITE lag behind LLMs in both correctness and citation quality and fail to generalize ( §G.4).

Human Evaluation
To verify that our automatic evaluation correlates with human judgement, we conduct human evaluation on selected models and request workers to judge model generations on three dimensions similar to Liu et al. ( 2023)-(1) utility: a 1-to-5 score indicating whether the generation helps answer the question; (2) citation recall: the annotator is given a sentence and all passages that the sentence cited, and is asked to judge whether the passages fully support the sentence; (3) citation precision: given a sentence and one of its citations, the annotator is asked to judge whether the citation "fully supports", "partially supports", or "does not support" the sentence.Each citation gets a precision score 1 if the output sentence has a citation recall of 1 and this citation at least "partially supports" it.See Appendix F for more details.
Model outputs score high utility.The utility scores do not differ significantly between models, ranging 3.7-3.9for ASQA and 3.5-3.6for ELI5.Upon inspection, all tested models are mostly able to output fluent answers that are related to the question, despite differences in factual correctness.
Our automatic evaluation of citation quality strongly correlates with human judgements.As shown in Table 8 (ASQA) and Table 9 (ELI5), the relative rankings induced by human and our automatic metrics are consistent.The absolute citation scores from human and ALCE are very close except for RERANK (which uses the automated citation recall for reranking).This suggests that an improvement on ALCE citation metrics translates to improvement on human preferences.Furthermore, the Cohen's kappa coefficient between human and ALCE suggests substantial agreement for citation recall (0.698) and moderate agreement for citation precision (0.525).We also show in §G.5 that our automatic evaluation achieves high accuracy when treating human annotations as gold labels (85.1% for citation recall and 77.6% for citation precision).2022) augment LLMs' output by interpolating it with a kNN module; though none of them explicitly provide citations to the retrieved sources.Other works prompt or fine-tune LLMs to "retrieve on-the-fly" (Parisi et al., 2022;Schick et al., 2023;Shuster et al., 2022;Jiang et al., 2023;Yao et al., 2023;Press et al., 2022), which offers flexibility of when and what to search.Gao et al. (2023); He et al. (2022) propose to first generate text without accessing external documents and then retrieve relevant documents and revise the generation to be consistent.
Among previous explorations, Nakano et al. ( 2021); Menick et al. (2022) are the closest to our setting, where LLMs are trained to answer questions while providing citations.However, they do not explore retrieval strategies and simply use commercial search engines, which are not reproducible, and their models and training data are closedsource.To the best of our knowledge, we are the first to implement end-to-end systems that retrieve, synthesize, and cite documents with LLMs.

Conclusion
We propose ALCE, the first automatic benchmark for evaluating LLM generations with citations.We deploy automatic metrics to measure fluency, correctness, and citation quality, and verify their efficacy via human evaluation.We explore a variety of strategies for incorporating citations in LLMs and demonstrate that current systems have considerable room for improvement on ALCE.
Our experiments highlight a number of promising research directions, including (1) enhancing retrieval and refining retrieval integrations in LLMs, (2) developing long-context LLMs, and (3) advancing LLMs' ability to synthesize multiple sources.What's even more intriguing is that these research proposals extend beyond the ALCE setup (for example, long-context LLMs have numerous exciting applications), and ALCE can serve as a valuable testbed for their development.

Limitations
Our evaluation still has room for improvement: (1) MAUVE is found to be sensitive to output length and may provide unstable results; (2) for the ELI5's correctness evaluation, the automatically generated claims may not cover all possible answers due to the open-ended nature of the questions; (3) our citation quality evaluation is limited by the accuracy of the NLI model; for citation precision, the NLI model cannot detect the case of "partially support" and thus leads to a lower citation precision score than the human evaluation.
Although we believe our curated datasets closely resemble the distribution of real-world user questions, we acknowledge that they do not cover more challenging scenarios, such as multi-hop reasoning, math reasoning, and code completion.
In our experiments, we focus on prompting LLMs without updating their model weights.Training a model directly to incorporate citations remains challenging due to the lack of supervised data.However, we observe that certain humaninstruction datasets contain examples similar to our task setup.We leave the exploration of training LLMs to generate citations for future work.

A Generating Claims for ELI5
We elect not to use ROUGE-L as our main correctness metrics since it does not account for the different ways of expressing the same answer and it can be easily gamed (Krishna et al., 2021).We further illustrate this issue in Table 10.A system can easily achieve high ROUGE-L score by retrieving and returning the top passage from a BM25 index.However, the claims evaluation metric does not reward this approach since the output often lacks different aspects of the answers.Instead, we leverage the original answers to generate sub-claims and use them to serve as an estimate of the different aspects of the answers that we expect the model to cover.This approach is inspired by works in summarization evaluation and claim verification (Zhang and Bansal, 2021;Kamoi et al., 2023;Wang et al., 2020).
Specifically, we use text-davinci-003 to generate the sub-claims.We first manually annotate three question and answer pairs from the original ELI5 training set with 3 sub-claims each.Then, we prompt text-davinci-003 with these pairs as demonstrations.The full prompt with an example is shown in Table 22.
InstructGPT generates coherent and faithful sub-claims.To ensure that the generated subclaims are of good quality, we manually inspect a random sample of 40 answers and their generated sub-claims (totaling to 120 sub-claims).For each sub-claim, we assign a score of 1 if it is relevant to the question and faithful to the facts presented in the ground truth, and 0 otherwise.We found that 112 out of the 120 (93.33%) sub-claims received a score of 1, meaning that our generated sub-claims are of high quality and faithful to the ground truth.Furthermore, the average number of words in the generated sub-claims is 14 words, and they are typically just one sentence long.This is aligned with the intent behind the metric-to capture short factual claims made by the original answer.
NLI model accurately predicts the entailment of sub-claims.We further analyze our sub-claim evaluation metrics by checking the error rate of the final prediction of the NLI model.To this end, we first manually annotate the entailment scores between 40 outputs and their sub-claims (in total of 120 pairs; these are the same questions from the previous analysis).We then use the NLI model to obtain the entailment scores for the output and sub-claims.Using the human annotations as the ground truth label, we found that the NLI model achieved an accuracy of 80.0%.

B Dataset Statistics
For ASQA, human answers have an average length of 65 words.For QAMPARI, each question has on average 13 answers.For ELI5, human answers have an average length of 131 words.

MAUVE.
When running MAUVE, we concatenate the question and the model output (or human answer) by space.We truncate both the references and the model generations to 100 words, as we found MAUVE results are unstable beyond this length for ELI5 (this is due to that ELI5 has a lot of extremely long human answers).
Exact match for ASQA and QAMPARI.Both ASQA and QAMPARI provide aliases for their short answers.We normalize the response and the short answers similarly to Rajpurkar et al. (2016) and report the score with the best-matching aliases.For ASQA, Stelmakh et al. (2022) also propose a QA-based evaluation which we found to be not as stable, and thus we do not report it in our paper.
Output truncation.Before evaluation, we trun-cate model output by new lines, as non-instructiontuned models may generate more content after new lines that are irrelevant.
INTERACT.Empirically, we found that models tend to execute too many consecutive "check" actions, so we force the model to always "output" after each "check".We limit the maximum number of passages to check as 3 to avoid exceeding the length limit.The full passages are removed from the context after one action to save context space.Table 27 provides an example for INTERACT.

Main experiments. For all experiments except
ChatGPT RERANK, we run each model three times with different seeds and each time we sample two demonstrations from a pool of four.We report the averaged scores for all experiments in the main paper and we report the standard deviations in Appendix G.6.
Decoding methods.Based on preliminary experiments we choose the following decoding methods: For ChatGPT and GPT-4, we use sampling with temperature 0.5; for all open-source models, we use Nucleus sampling (Holtzman et al., 2020)  Table 11 demonstrates the experiments to show that ALCE is robust to shortcut cases.Using the top-1 passages or first two sentences of the top-1 passages induces almost perfect citation quality, but fluency and correctness are dramatically lower.

E Citation Recall Discussion
Our citation precision evaluation cannot detect a citation that partially supports the statement and hence will falsely penalize it.Consider a statement s 3 and its citations [2] will be counted as "irrelevant" while it should not be penalized.Liu et al. (2023) conduct human evaluation on citation precision in a different way: For each citation, they ask annotators to judge whether the citation (1) fully support, (2) partially support, or (3) does not support s i .One citation c i,j is precise if (a) c i,j fully supports s i or (b) C i fully supports s i , c i,j partially supports s i , and no c ∈ C i alone fully supports s i .This evaluation solved the corner case we mentioned in the main paper (one citation partially supports the claim but is identified as "irrelevant").However, it is challenging to conduct such evaluation automatically, as there is no existing model that can judge whether a citation "partially" supports a claim.We also explore prompting ChatGPT to conduct such a task, which yields poor results.We defer it to future work to collect supervised data to train a better ϕ that can detect "partial support".

F Human Evaluation
We employ Surge AI (https://www.surgehq.ai/) for our human evaluation.The average pay to workers is 20 USD per hour.We randomly sample 100 examples from ASQA and ELI5 and annotate outputs of selected models: ChatGPT VANILLA, ChatGPT RERANK, and Vicuna-13B VANILLA.

F.1 Utility
To check if the model output is useful to downstream users, we measure the utility of the response S. We first show the query q and model response S to the worker and ask them to rate their agreement with the statement "The response is a helpful and informative answer to the query" on a Likert scale of 1-5, corresponding to Strongly Disagree, Disagree, Neutral, Agree, and Strongly Agree.

F.2 Citation Recall
The annotators are shown the question q, the statement s i , and all of its citations C i , and they rate if the joint set of citations fully support the statement (recall=1) or if they do not support all the claims (recall=0).We calculate the overall recall score for the generation by taking an average of all the statements' recall scores.

F.3 Citation Precision
We show the question q and a pair of a statement s i and one of its citation c i,j ∈ C i to the annotator.We ask the annotator if the citation fully supports, R@1 R@3 R@5 R@20 R@100 DPR 29.6 44.R@1 R@3 R@5 R@20 R@100 DPR 6.7 13.5 partially supports, or does not support the factual claims in s i .Citation c i,j has a citation precision of 1 if s i has a recall of 1, and c i,j fully or partially supports s i .Finally, we take an average of precision scores of all citations in the statement S to obtain the citation precision score.

G.1 Retrieval Analysis
Oracle.Since the original datasets do not contain gold passages at the same granularity level as our setting (100-word passages), we approximate gold passages by running the following algorithm on the top-100 retrieved passages.We first calculate recall score for each passage.Then, we sort the passages using their recall score and take the top 5 passages as our initial oracle set.Finally, we iterate through all passages that were not initially in the oracle set and try to replace the passages in the oracle set in a greedy fashion: we calculate the change in the recall score of the oracle set for every possible replacement and proceed with the replacement that results in the largest recall improvement.The set of 5 oracle passages were able to match the recall scores of the top-100 retrieved passages.
Detailed retrieval results.We show detailed retrieval results in Tables 12, 13, and 14.

G.2 Effect of Instructions
Table 15 shows results of using a full instruction (Table 23) and a short version of the instruction (Table 24).We see that the full version induces stronger correctness and citation recall, while the two instructions lead to similar citation precision.

G.3 Effect of Demonstrations
Table 16 shows results on effect of different numbers of demonstrations.We see that numbers of demonstrations do not affect ChatGPT's correctness but using at least one demonstration ensures high citation recall.For the original LLaMA model, Table 16 shows the trend that more demonstrations lead to better performance.

G.4 Fine-tuned Models
To better understand the differences between finetuned models and prompted large language models, we train state-of-the-art question answering model, Fusion-in-Decoder (FiD; Izacard and Grave (2021)), and evaluate it in conjunction with POSTCITE.Due to the lack of training data with citation annotation, we first train a T5-base FiD model for 5 epochs on the ASQA training set with a batch size of 64 and a learning rate of 1e-4.During evaluation, we use POSTCITE to add citations to the output.We also use k = 5 passages during both training and evaluation of the FiD model.
Then, we evaluate this model on both ASQA (in-domain) and ELI5 (out-of-domain), and the results can be found in Tables 17 and 18.Note that this is not a direct comparison, as ALCE assumes only evaluation data available and uses only fewshot data for prompting.As the results show, the FiD baseline still significantly lags behind prompting ChatGPT in both correctness and citation quality (even though it is trained on 4000+ examples).When tested on another dataset (ELI5), FiD performs even worse, showing that it is challenging to solve the problem by fine-tuning a small pretrained model.

G.5 More Human Evaluation
We evaluate the accuracy of our automatic metrics by treating the human annotations as gold labels.
For citation recall, ALCE achieves an accuracy of 85.1%; for citation precision, ALCE has an accuracy of 77.6%.Regarding detecting insufficient citations, ALCE has a recall of 82.3% and a precision of 84.2%; regarding detecting "irrelevant" citations, ALCE has a recall of 75.6% and a precision of 66.1%-ALCE is effective in detecting "irrelevant" citations, but due to the limitation of the NLI model (cannot detect "partial support"), it has a relatively high false positive rate.

G.6 Main Results
We show full results of our experiments along with the standard deviation in Tables 19, 20, and 21.We repeat all experiments with three different random seeds.However, for ChatGPT RERANK, we use only one seeded run since each run repeats the generation step four times, and more experiments would incur significant costs.
Original question: How do we hear differences in sound besides volume and pitch?Passage: Pitch refers to the frequency of soundwave, and volumn refers to the amplitude of the soundwave.Besides volumn and pitch, we can also tell the difference between sounds based on the tone of sound.For example, we can differentiate the sound of different instruments based on the tone of the sounds.Claim 1: Volume of sound is the amplitude of the soundwave.Claim 2: Pitch is the frequency of soundwave.Claim 3: We can use the tone of the sounds to differentiate the sound of different instruments.
Original question: How are we able to discern whether a sound is coming from in front of us or behind us?Passage: There are multiple explanations for why we can localize sounds.One explanation is that sounds travelling to the corresponding side of one's ear will be slightly louder.Another explanation is that there is a slight difference in the hitting time to one's left and right ear based on the sound's direction.However, these explanation means that when a sound is exactly in front of someone or exactly behind someone, he or she can not tell the difference.Claim 1: We can localize sounds by recognizing that the sound travelling to the corresponding side of one's ear will be slightly louder.Claim 2: We can also localize sounds by recognizing the difference in hitting time to one's left and right ear based on the sound's direction.Claim 3: We cannot tell the difference between a sound that is exactly in front of us or exactly behind us.
Table 22: Prompt used to generate the sub-claims for ELI5 questions.Blue text is model generation.Brown text is the ELI5 example that we want to generate sub-claims for.We construct the prompt by manually writing the sub-claims for three questions from the training set.
Instruction: Write an accurate, engaging, and concise answer for the given question using only the provided search results (some of which might be irrelevant) and cite them properly.Use an unbiased and journalistic tone.Always cite for any factual claim.When citing several search results, use [1][2][3].Cite at least one document and at most three documents in each sentence.
If multiple documents support the sentence, only cite a minimum sufficient subset of the documents.
Table 23: Instruction for VANILLA.
Instruction: Write a high-quality answer for the given question using only the provided search results and cite them properly using Figure 1: The task setup of ALCE.Given a question, the system generates text while providing citing passages from a large retrieval corpus.Each statement may contain multiple citations (e.g., [1][2]).
Generating text with citations is closely related to attribution.Rashkin et al. (2023) define the "attributable to identified sources" (AIS) score to measure how faithful a generated text is to its sources.Bohnet et al. (2022) apply AIS scores on a single-document short-answer QA dataset.Honovich et al. (2022);Yue et al. (2023) study automatic evaluations for the AIS score.A concurrent work(Liu et al., 2023)  conduct human evaluation on commercial generative search engines to examine their citation qualities.Scientific citation text generation(Funkquist  et al., 2022)  is a related task to ALCE where the model is provided the papers-to-cite and context and is required to recover the citing text.It is different from ALCE as all citations are provided and the model only needs to perform the summarization.Retrieval-augmented LMs.Many studies have explored augmenting LMs with externally retrieved information.Guu et al. (2020);Borgeaud et al. (2022);Izacard et al. (2022) pre-train language models with retrieved passages, whileKhandelwal et al. (2020);Zhong et al. ( [1][2][3].
When did the US break away from England? A: The US declared independence on July 2, 1776 [1][2] ... The Treaty of Paris was later signed on September 3, 1783 [3].

Table 2 :
An example of our VANILLA method.Different colors represent prompt, model generation, and <actions>.We also provide two in-context demonstrations before the test example.

Table 3 :
An example of INLINESEARCH.

Table 4 :
, and Experiments on ASQA.For CLOSEDBOOK, we use POSTCITE to get citations.k-psg: putting topk passages from the retrieval results into the context.Chat-13B and Chat-70B refer to LLaMA-2-Chat.

Table 5 :
Experiments on QAMPARI."Rec.-5":we set the recall to be 100% if the prediction includes at least 5 correct answers.
rent LLMs are not proficient in an interactive usage.Retrieving text on the fly does not improve performance.All datasets show that VANILLA outperforms INLINESEARCH on citation quality (and on correctness for ASQA and ELI5).By manually examining the examples, we find that it is challenging to ask detailed questions without seeing any passages.To improve INLINESEARCH, one may need to provide more context about the questions in advance or encourage the model to call retrievers with more detailed and diverse queries.

Table 6 :
Experiments on ELI5.We use claim recall for the correctness evaluation.Chat-13B and Chat-70B refer to LLaMA-2-Chat.

Table 8 :
Human citation quality evaluation vs. ALCE citation quality evaluation on ASQA.

Table 9 :
Human citation quality evaluation vs. ALCE citation quality evaluation on ELI5.

Table 15 :
Effect of different instructions on ASQA.

Table 17 :
Comparison of Fusion-in-Decoder with Chat-GPT on ASQA.Both models use top-5 GTR passages.

Table 18 :
Comparison of Fusion-in-Decoder with Chat-GPT on ELI5.Both models use top-5 GTR passages.

Table 21 :
ELI5 full results.theoriginal question and passage, and generate 3 additional claims that are supported by the passage and answer the question.Original question: What's the difference between Shia vs. Sunni Islam?Passage: The main difference between Shia and Sunni Muslim is related to ideological heritage and issues of leadership.This difference is first formed after the death of the Prophet Muhammad in 632 A.D. The ideological practice of the Sunni branch strictly follows Prophet Muhammad and his teachings, while the Shia branch follows Prophet Muhammad's son-in-law Ali.Nowadays, Sunni and Shia are the major branches of Islam.Claim 1: The major branches of Islam are Sunni and Shia.Claim 2: Prophet Muhammad died in 632 A.D. Claim 3: The ideological practice of the Sunni branch strictly follows Prophet Muhammad and his teachings.

Table 24 :
Short instruction for VANILLA.: Write an accurate, engaging, and concise answer for ... Document[1](Title: How to Treat and Prevent Food Poisoning -MsPrepper) just a typical gastro upset.Salmonella is most commonly caused by eating undercooked or raw foods like eggs or meat.You know how your mom always warned you not to eat raw cookie dough?This is why.Most people do eat cookie dough and they are fine, but salmonella is a risk.If you do contract salmonella, you could start to feel bad within in a couple of hours after eating contaminated food, and sometimes it could take a day or two.Common symptoms are nausea and vomiting, loose stools (sometimes bloody), flu like symptoms, and stomach cramps.To treat Document [2](Title: FDA Issues Warning About Eating Raw Cookie Dough, But Not For Salmonella Risks) FDA Issues Warning About Eating Raw Cookie Dough, But Not For Salmonella Risks Used to licking the spoon or placating yourself with full-on chunks of raw cookie dough?The Food and Drug Administration issued a warning on Tuesday that strongly advises against continuing the habit.The agency asserted that consuming raw batter of any kind, whether for bread, cookies or pizza, could make a person sick.While you may have been warned in the past against eating raw dough due to the risk of contracting salmonella from raw eggs, the FDA is citing raw flour as the culprit for a Document [3](Title: It's Probably OK to Eat Raw Cookie Dough -As Long As You're Smart About It -The Crux -Very Top Secret Information) First, when most people think about health risks and cookie dough, they think about raw egg.Eggs can be contaminated with salmonella bacteria, and food safety recommendations encourage people to cook eggs until the white and yolk are firm in order to kill any bacteria.However, anyone making cookies can do things to reduce this risk by using pasteurized egg products.When my kids and I make cookie dough, we never use regular eggs.Instead, we use shell eggs that have been pasteurized to kill any harmful bacteria without actually cooking the egg itself.(A great public health innovation, if Document [4](Title: How Dangerous Is It to Eat Raw Cookie Dough?| Men's Health) Can Eating Raw Cookie Dough Really Make You Sick?Scientists reveal the truth about this supposedly dangerous delicacy By Katherine Dempsey There are few things more tempting in life than eyeing a bowl of cookie dough and deciding whether or not to stick your finger in for a scoop.It's a bit like playing Russian roulette.You could get lucky and enjoy the delicious dough without conseqence, but there's always the risk of getting serously sick with a food-borne illness.That's because multiple ingredients within the dough could be contaminated by pathogens such as Salmonella and E. coli, says Soohyoun Document [5](Title: How Dangerous Is It to Eat Raw Cookie Dough?| Men's Health) is usually pasteurized, so it's not likely that the egg would make you sick.However, other ingredients in the dough could potentially harbor pathogens.Experts say that a prime suspect in a 2009 E. coli outbreak linked to prepackaged cookie dough was actually the flour.Nuts and chocolate have also been linked to Salmonella outbreaks.Bottom line: You're better off skipping raw cookie dough.At least you'll have killer cookies as a consolation.The article How Bad Is It To Eat Raw Cookie Dough?originally ran on Prevention.comThe 9 Foods Most Likely to Make You Sick Mmm, Just Don't ... Question: Why is it bad to eat cookie dough for risk of salmonella but things like Cookie Dough Bites are ok?Answer: Raw cookie dough is not recommended to be eaten due to the risk of salmonella [1][2].Eating raw flour is also a risk for food poisoning [2].The egg in cookie dough is usually pasteurized, so it's not likely that the egg would make you sick, but other ingredients in the dough could potentially harbor pathogens [4][5].However, prepackaged cookie dough like Cookie Dough Bites is safe to eat because the dough is made with pasteurized egg products and heat-treated flour [2][3].. Cookie Dough Bites are safe to eat since they do not contain eggs.2. Salmonella is a bacteria found in eggs.3. Eating raw cookie dough carries the risk of salmonella infection. Instruction