RARR: Researching and Revising What Language Models Say, Using Language Models

Language models (LMs) now excel at many tasks such as question answering, reasoning, and dialog. However, they sometimes generate unsupported or misleading content. A user cannot easily determine whether their outputs are trustworthy or not, because most LMs do not have any built-in mechanism for attribution to external evidence. To enable attribution while still preserving all the powerful advantages of recent generation models, we propose RARR (Retrofit Attribution using Research and Revision), a system that 1) automatically finds attribution for the output of any text generation model, and 2) post-edits the output to fix unsupported content while preserving the original output as much as possible. When applied to the output of several state-of-the-art LMs on a diverse set of generation tasks, we find that RARR significantly improves attribution while otherwise preserving the original input to a much greater degree than previously explored edit models. Furthermore, the implementation of RARR requires only a handful of training examples, a large language model, and standard web search.

Despite these incredible advances, state-of-theart TGMs still frequently produce biased, misleading or unsupported content (colloquially called The input x is a text passage produced by a generation model. A Research & Revision model searches for evidence over a document corpus, and outputs an attribution report A, containing evidence snippets, along with a revised passage y whose content can be attributed to the evidence in A while preserving the intent of x as much as possible. "hallucinations") (Maynez et al., 2020;Menick et al., 2022). To trust the output of a TGM, we would like each generation to be justified by an attribution report (Rashkin et al., 2021) that contains one or more pieces of supporting evidence from a trusted source (e.g., encyclopedia or articles) where appropriate.
Most existing TGMs, such as those based on sequence-to-sequence architectures, lack a builtin mechanism for attribution. Recent work on retrieval-augmented models (RAMs) (Guu et al., 2020;Lewis et al., 2020) first retrieve a document (or other source of information), and then condition on it to generate text. While this often improves end-task performance, it does not necessarily guarantee attribution. For example, prior work has shown that RAMs generate text that either includes additional information not present in the retrieved document (Dziri et al., 2022), ignores the document altogether (Krishna et al., 2021), or even contradicts the document (Longpre et al., 2021). In fact, occasionally ignoring or overriding retrieved information can make RAMs more robust to bad retrievals (e.g., in language modeling (Khandelwal et al., 2020), semantic parsing (Pasupat et al., 2021), or tasks requiring creativity), illustrating that end-task performance and attribution are not always aligned.
To augment existing and future generation models with attribution, we are inspired by recent successes in the fact-checking 1 community where simple research-and-revise workflows are increasingly effective in attributing and/or correcting unattributable claims made by people (Thorne et al., 2018;Schuster et al., 2021;Thorne and Vlachos, 2021).
Instead of constraining TGMs to generate attributed text, we propose a model-agnostic approach to improve the attribution of existing TGMs: Retrofit Attribution using Research and Revision (RARR). RARR first generates text using any TGM, then does research to retrieve relevant evidence and finally revises the text to make it consistent with the evidence found while ideally preserving qualities like style, structure or argumentation, enabling the revised output to be seamlessly used in place of the original. This can be viewed as a RAM where retrieval happens after generation, rather than before.
Although TGMs often struggle to accurately memorize "factoid" knowledge within their parameters, they have developed increasingly good "procedural" knowledge of what should be discussed and how it should be presented, which only continues to improve with recent model scaling efforts. RARR can "stand on the shoulders" of the latest generation models while targeting attribution.
In our effort to expand the scope of researchand-revise models to handle the output of arbitrary TGMs, we found it necessary to develop new evaluation metrics, benchmarks, and modeling strategies. First, we propose new metrics that evaluate editing models not just on their ability to produce well-attributed revisions, but also on their ability to otherwise preserve original properties of the text. We then use these metrics to study how editors perform on outputs from state-of-theart language models, such as knowledge-intensive statements, reasoning chains (Geva et al., 2021) and dialog responses (Anantha et al., 2021). Finally, we find that previously published editors do not always generalize across many tasks (and were not originally intended to), and therefore propose a new editing model that leverages the power of large language model few-shot prompting to robustly generalize to new domains.

Task formulation
We propose the task of Editing for Attribution as follows. As shown in Figure 1, the input to the system is a text passage x produced by a TGM. 2 The output is a revised text passage y along with an attribution report A = {e 1 , . . . , e M }, which is a collection of evidence snippets e i retrieved from a document corpus that support the content in y.
We propose to measure the quality of the revised text y and attribution report A by a combination of two dimensions: • Attribution: how much of the revised text y can be attributed to the evidence in A.
• Preservation: how much the revised text y preserves aspects of the original text x.
Note that attribution depends on both 1) retrieving the right evidence for the attribution report A and 2) revising x so that it is consistent with A.

Measuring attribution
Previously, Rashkin et al. (2021) proposed Attributable to Identified Sources (AIS), an evaluation framework for human annotators which considers a binary notion of attribution. Roughly speaking, a text passage y is attributable to a set A of evidence if a generic hearer would affirm the statement "According to A, y" under the context of y. (See the original paper for the formal definition.) A system either receives full credit (1.0) if all content in y can be attributed to A, and no credit (0.0) otherwise.
We propose a more fine-grained extension of AIS: we separately measure AIS for each sentence of y, and then report the average AIS score across sentences. 3 The resulting metric Attr AIS (y, A) measures the percentage of sentences in y that are fully attributed to A. When judging each sentence, we also give annotators access to the surrounding sentences and other necessary context (e.g., a question that the original x was a response to, or in dialog, any previous utterances leading up to the currently edited utterance). See Appendix C for annotator instructions.
For an attribution report to be useful, it should be concise enough for a person to review. Consequently, we impose a maximum number of evidence snippets returned by the system. We found M = 5 snippets to be sufficient for most examples in our benchmarks.
During model development, we need an automatic metric that approximates human AIS judgments. For this, we use a version of the E2E NLI model from Honovich et al. (2022). 4 We refer to this model as auto-AIS. To improve its accuracy, we first decontextualize (Choi et al., 2021) each sentence in y based on the entire context of y before measuring its auto-AIS score. We found that the resulting metric correlates well with human judgments (Appendix B).

Measuring preservation
To measure preservation, we first ask annotators to decide if the revision preserves the text's original intent (completely, somewhat, or not at allsee Appendix C for annotator instructions). Similar to the AIS evaluation, we give annotators the necessary surrounding context. We define the binary metric Pres intent (x, y) to be 1.0 if the revision 3 While each sentence may contain multiple claims that could be attributed independently, there is currently no linguistic consensus on what constitutes a claim. Instead of depending on a particular definition of claims, we use sentences for simplicity and reproducibility. 4 We obtained a newer version of the E2E NLI model from the authors, which was trained on MNLI, SNLI, FEVER, PAWS, SciTail and VitaminC (Williams et al., 2018;Bowman et al., 2015;Thorne et al., 2018;Zhang et al., 2019;Khot et al., 2018;Schuster et al., 2021). completely preserves the original intent, and 0.0 otherwise.
Even if a revision preserves intent, it may still make superfluous modifications, such as word reordering, changing textual style, or including unnecessary additional information (Thorne and Vla chos, 2021). Different tasks have different requirements for what should be preserved. Here, we desire a simple metric that can be readily computed for many tasks and that generally penalizes unnecessary change. We thus define a metric based on the Levenshtein edit distance (Levenshtein, 1965) 5 between x and y: This metric is 1.0 if x and y are the same, and 0.0 if y completely overwrites all parts of x. Pres Lev is generally sensitive to any kind of change, but certainly does not capture all notions of preservation that could be relevant (e.g. preserving a particular rhyme scheme or pun). We want the revision to preserve the original intent while avoiding superfluous edits. To reflect this, we finally combine the two metrics as Pres comb (x, y) = Pres intent (x, y) · Pres Lev (x, y).
(2) which is 0.0 if the revision changes the intent and equal to Pres Lev (x, y) otherwise. Since Pres intent requires human annotation, we used Pres Lev as an automated metric for model development.

Discussion
Optimizing for attribution alone cannot ensure a good revision: for example, an adversarial editor could ensure 100% attribution by simply replacing the input x with the text of any arbitrary retrieved document, which is trivially attributable to itself. Ideally, we want to maximize both attribution and preservation, while navigating any tradoffs between the two. In our experiments, we report both metrics, as well as their harmonic mean (F1 AP , analogous to how recall and precision are combined in F1).
We wish to highlight that this evaluation scheme does not require any "gold" or "reference" edits (unlike many prior evaluations of edit models), which are often only available for specialized domains such as Wikipedia. This enables us to broaden the scope of research to a much wider range of generation tasks.

Approach
We now present a method for solving the task of Editing for Attribution. Figure 2 presents an overview of our system, RARR. Given an input passage x (e.g., "Millie Inbetween premiered on 24 February 2014 on CBBC."), the research stage first generate a set of queries {q 1 , ..., q N }, each investigating one aspect of x that potentially requires attribution (e.g., "When did Millie Inbetween premiere?"). Then for each query q i , we retrieve documents from the web and select the best evidence snippets {e i1 , e i2 , . . . } (e.g., ". . . the first series premiered on 1 October 2014."). Next, the revision stage takes the original text x and the retrieval results {(q 1 , e 11 ), . . . } and produces a revised text y (e.g., "Millie Inbetween premiered on 1 October 2014 on CBBC.").
To develop many of the components for RARR, we use few-shot prompting with a large language (a) Comprehensive query generation x → {q 1 , . . . , q N } You said: Your nose switches back and forth between nostrils. When you sleep, you switch about every 45 minutes. This is to prevent a buildup of mucus. It's called the nasal cycle. To verify it, a) I googled: Does your nose switch between nostrils? b) I googled: How often does your nostrils switch? c) I googled: Why does your nostril switch? d) I googled: What is nasal cycle?
(b) Agreement model (y, q, e) → {0, 1} You said: Your nose switches . . . (same as above). . . nasal cycle. I checked: How often do your nostrils switch? I found this article: Although we don't usually notice it, during the nasal cycle one nostril becomes congested and thus contributes less to airflow, while the other becomes decongested. On average, the congestion pattern switches about every 2 hours, according to a small 2016 study published in the journal PLOS One. Your nose's switching time is about every 2 hours, not 45 minutes.
This disagrees with what you said.
(c) Revision model (y, q, e) → new y You said: Your nose switches . . . (same as above). . . nasal cycle. I checked: How often do your nostrils switch? I found this article: Although we . . . (same as above). . . One. This suggests 45 minutes switch time in your statement is wrong. My fix: Your nose switches back and forth between nostrils. When you sleep, you switch about every 2 hours. This is to prevent a buildup of mucus. It's called the nasal cycle. model (LLM), also known as in-context learning (Brown et al., 2020). In particular, we use PaLM as our language model (Chowdhery et al., 2022). Figure 3 shows some few-shot examples we use (explained below), while Appendix A lists the full prompts. The development cycle for few-shot prompting is extremely fast because it involves no model training -we believe this is important for quickly adapting RARR to new and future tasks. In the rest of this section, we provide the details of each component.

Research
The research stage of RARR performs two steps: generating questions covering different aspects of the input x, and retrieving evidence for each question.
Query generation Question generation has proven to be a useful tool for many tasks, such as open domain information extraction (Lewis et al., 2021;Chen et al., 2022), summarization (Weiss et al., 2021;Narayan et al., 2022), document retrieval (Ma et al., 2021;Hui et al., 2022), factchecking (Honovich et al., 2021) and dialog (Dai et al., 2022). In these tasks, questions represent the individual "units" of information present in text. Since this representation is based on natural language, it is both interpretable and highly flexible (Michael et al., 2017;Weiss et al., 2021).
The goal of query generation is to identify claims in x which need to be verified and attributed. A human might do this by scanning through x and asking a question about each assertion encountered. We refer to this as comprehensive question generation (CQGen), since we aim to produce a single sequence of questions covering all aspects, rather than one question targeting a single aspect (as done in most prior work on question generation). A similar strategy has been employed to train text-planning models (Narayan et al., 2022). We found that a prompt with six human demonstrations of CQGen, as shown in Figure 3a, was sufficient for PaLM to adequately learn the task. To increase diversity and coverage, we sample from our CQGen model three times and take the union of the resulting queries.
Evidence retrieval We submit the queries generated by CQGen to a retrieval system. In this work, we use Google Search to retrieve web documents, but any retrieval system that accepts text queries could also be used. For each query q i , we take the top K = 5 web pages returned by the search engine, and for each page, we extract its title and snippets. We construct snippets by running a sliding window of four sentences across the document, breaking at document headings. The snippets from all web pages are then pooled together, forming a set of candidate evidence snippets for the query.
We then select the evidence snippets that are most relevant to the query. For this, we use an existing query-document relevance model trained following Ni et al. (2021), which computes a relevance score S relevance (q, e) between a query q and an evidence snippet e. We then keep the top J = 1 evidence for each query. The final retrieval result is [(q 1 , e 11 ), . . . , (q 1 , e 1J ), . . . , (q N , e N 1 ), . . . , (q N , e N J )], where e ij denotes the j th evidence for the i th query, and N denotes the total number of queries from CQGen (which can be different for each input x). In theory, we could have used few-shot LLM prompting instead of the fine-tuned ranking model for relevance scoring, but this would have been expensive for inference due to the large number of snippets to be scored.

Revision
After retrieving relevant evidence, some parts of x may now be properly attributed, but other parts may still remain unattributed. Revision is necessary to increase the total percentage of attributed sentences (either by removing or editing assertions).
The revision stage initializes y = x and iterates through each pair (q, e) = (q i , e ij ) in the retrieval result. For each pair, we perform two steps: 1. Check for disagreement: we run the agreement model to determine whether the evidence e disagrees with the passage y regarding the issue in query q.

2.
Edit: if disagreement is detected, an edit model edits y to agree with e using the revision model. If not, the editor does nothing.
Agreement model The agreement model takes the partially edited passage y, a query q, and the evidence e as input. It then decides whether both y and e imply the same answer to the question in q. This form of question-guided agreement was previously explored by Honovich et al. (2021). Following our strategy for CQGen, we implement this by few-shot prompting PaLM. We employ a chainof-thought (Wei et al., 2022) style prompt, where we ask the model to explicitly state the implied answers for both y and e before producing its judgment about their agreement ( Figure 3b).
Edit model If a disagreement is detected, we run the edit model. It takes y, q and e as input, and outputs a new version of y that aims to agree with e while otherwise minimally altering y.
We again use few-shot prompting and chain-ofthought, where we ask the model to first identify a particular span in y that needs to be edited before generating a completely new version of y (Figure 3c). We found that this beneficially reduces the editor's deviation from the current y. We also found that the editor occasionally produces large edits that bring the new revision close to e but far from the current y. Since this is rarely desirable, we simply reject edits with edit distance above τ = 0.5 times the original text length.

Attribution report
Finally, we need to select at most M evidence snippets to form an attribution report A. Note that during evidence retrieval and revision, we may have encountered and used more than M snippets. Our goal now is to find a smaller subset of snippets that maximizes coverage over the potentially attributable points in the passage y -represented by the questions q 1 , . . . , q N . We use the relevance model from Section 3.1 as a proxy for measuring how much an evidence e covers the point raised by a query q. Then, we exhaustively search for A ⊆ {e 11 , . . . , e N J } of size at most M that maximizes (3)

Related work
Fact-checking Most models for fact-checking perform the following task: given a (statement, evidence) pair as input, they output a label indicating whether the statement is supported or refuted by the evidence (Thorne et al., 2018;Wang, 2017;Karadzhov et al., 2017;Augenstein et al., 2019;Wadden et al., 2020). In many real-world situations, relevant evidence for a statement is not readily available, so some works have explored how to automatically retrieve evidence that may help support or refute a statement (Fan et al., 2020;Piktus et al., 2021), just as we do.
More recent works have also explored how to "correct" statements so that they are more factually consistent with retrieved evidence (Shah et al., 2020;Thorne and Vlachos, 2021;Schuster et al., 2021;Logan IV et al., 2022;Schick et al., 2022). Our approach and Thorne and Vlachos (2021) both implement a full research-and-revise workflow, so we compare to them in our experiments (we refer to them as EFEC, short for their paper's title, "Evidence-based Factual Error Correction"). EFEC used human-authored edits from FEVER (Thorne et al., 2018) as both their training and automatic evaluation data, as this is one of the few resources where non-noisy human edits are available in significant quantities. FEVER's claims were extracted from Wikipedia.
Our ultimate goal is to post-edit the output of any generation model across many domains, so we did not wish to be constrained by Wikipediaspecific resources for either training or evaluation. This motivated both our new evaluation setup and new modeling approach. For evaluation, our attribution and preservation metrics notably do not depend on any human-authored "gold edit", making it possible to cheaply evaluate on any domain. For modeling, we utilize few-shot prompting with large language models, which removes any dependence on Wikipedia-specific edit data and helps us generalize to other domains better. Schick et al. (2022) also shares our goal of generalizing editing beyond Wikipedia, and developed an innovative strategy for bootstrapping Wikipedia data into synthetic data for other domains. While their approach incorporates a method for using evidence, they do not specifically study or measure attribution, and also assume access to gold evidence in their experiments. As we found in our experiments, being robust to irrelevant evidence or lack of evidence is a key challenge to address. Most RAMs can display their retrievals as a form of attribution, but they typically cannot be used to revise an existing output. LaMDA, a language model for dialog (Thoppilan et al., 2022), is an example of a RAM that does perform revision. LaMDA's generation process consists of three steps: 1) it first generates a "base response" using a standard language model, 2) it performs a sequence of "research" steps, where it can query web search for information, 3) it generates a "final response" using any retrieved information to improve the base response. Steps 2 and 3 in LaMDA implement a research-and-revise workflow that is similar to RARR, so we compare to it in our experiments. While we use few-shot prompting with LLMs to enable editing, LaMDA learns to edit from specially collected human-annotated data. When collecting this data, annotators were asked to rewrite any statements that required attribution, but there was not an emphasis on preserving the original text as much as possible. Consequently, we found in our experiments that LaMDA's regeneration step often "overrides" the original response with something different, rather than doing what one would consider a small "edit". This was not a problem for LaMDA's original application, but makes it less useful for our desired goal of improving the attribution of existing generation outputs while preserving as much as possible.

Models
We compare RARR to several systems that have a research-and-revise workflow.
EFEC EFEC (Thorne and Vlachos, 2021) finetunes a T5-based model to revise text conditioned on multiple evidence snippets using both semisupervised and fully-supervised approaches. We compare against their fully-supervised approach, which performed best in their experiments. EFEC uses a neural retrieval model (Karpukhin et al., 2020) to retrieve evidence from Wikipedia; however, not all passages in our experiments are supported by evidence from Wikipedia. To more fairly compare the editing capabilities of EFEC, we use the evidence snippets retrieved by our research stages (CQGen and retrieval), and format the evidence to match what is expected by EFEC. Note that the EFEC editor conditions on multiple pieces of evidence at once, while our editor iteratively conditions on one at a time.
LaMDA LaMDA (Thoppilan et al., 2022) generates responses in three steps: 1) it generates a "base response", 2) it generates search queries from the base response, 3) it generates a "revised response" conditioned on the base response and retrieved evidence. To apply LaMDA on a given text x, we simply set the base response in step 1 to x, and then run steps 2 and 3 (we call these latter two stages "LaMDA Research"). LaMDA was trained as a dialog system, and always expects a dialog context where the user speaks first. So, for non-dialog tasks, we insert an artificial user utterance as dialog history: "Tell me something interesting." For the attribution report, we take all evidence documents retrieved by LaMDA during its research process (step 2).
RARR As outlined in Section 3, our model uses few-shot prompting on PaLM 540B for query generation, the agreement model, and the edit model. We use the same prompts for all tasks except when the context comes from a dialog, where we slightly modify the prompts to use the dialog context (e.g., CQGen now maps dialog context + x to queries).

Evaluation setups
RARR aspires to be a general-purpose method for improving the attribution of any text generation model in any text domain. We thus evaluate RARR on the outputs generated by several wellstudied generation models across multiple tasks. In each of the following setups, we construct test examples x for RARR by taking the task input from the corresponding dataset, 6 and prompting the generation model to produce long-form outputs which may contain "hallucinations". We generate 150 development and 150 test passages for each setup. Figure 4 shows examples of the test passages.
Factoid statements We prompt PaLM 540B to generate long-form answers to questions from the Natural Questions dataset (NQ; Kwiatkowski et al., 2019). The resulting passages are mostly coherent but often contain factual errors. This setup examines RARR's ability to attribute a diverse range of factoid knowledge.
Reasoning chains Recent works showed that language models are capable of generating reasoning chains (Wei et al., 2022) to effectively answer complex questions. We use PaLM 540B with chain-of-thought prompts to generate reason- Dashed lines indicate the highest attribution score obtained by any of the models before editing: points above the line have better attribution after revision. The contours are F1 AP level curves: points along a contour have equivalent F1 AP . Different models make very different trade-offs between attribution and preservation, with only RARR having a robust F1 AP across all tasks.
ing chains for the StrategyQA dataset (Geva et al., 2021). This setup tests whether RARR can provide better attribution for intermediate steps of reasoning, while preserving the overall reasoning process.
Knowledge-intensive dialogs Finally, we test RARR on its ability to handle dialog. We consider the conversational question answering task from the QReCC dataset (Anantha et al., 2021). Given the previous dialog turns, which are rounds of questions and answers (q 1 , a 1 , q 2 , a 2 , . . . , q k ), we use LaMDA to generate the next utterance to answer the question q k in the latest dialog turn, conditioned on the gold dialog history. As mentioned earlier, LaMDA produces an initial "base response" from a standard language model, and then performs research and revision to produce a "final response". We feed LaMDA's base response as the input to RARR and other baselines, making them directly comparable to the research-andrevise stage of LaMDA (a.k.a. LaMDA Research). Table 1 and Figure 5 show attribution and preservation results for each model and dataset. We also report F1 AP , the harmonic mean of the two metrics -it will be low if either attribution or preservation is low, signifying that both metrics are important. The F1 AP score is shown as level curves in Figure 5. We observe that RARR significantly improves attribution while preserving most of the original text. In terms of F1 AP , RARR is the only method that performs robustly across all three datasets, and significantly outperforms prior methods on NQ and StrategyQA.

Main Results
It is interesting to note that RARR is the only method that preserves the original intent of x over 90% of the time-prior systems (EFEC and LaMDA) only manage to preserve the original intent 6-40% of the time. We also see that editing is crucial to improve attribution: if we only retrieve evidence to support the original response (x) but do not edit, attribution ranges from the low 10s to mid 30s. After editing, RARR can increase attribution by up to 13.7% absolute, while changing only 10-20% of the text.
As noted in Section 2, one can sacrifice preservation for higher attribution. EFEC is able to obtain strong F1 AP on QReCC by making larger changes to the text in exchange for a higher attribution score. However, it occupies a very different point from RARR on the attribution-preservation trade-off curve, as visualized in Figure 5.
6 Analysis 6.1 Qualitative analysis Human oracle To understand the headroom remaining in our task, we must ask: what is the minimal amount of editing needed to make a passage of text fully attributed? This question depends on the quality of the original generation model and the difficulty of the task. To approximate "optimal" performance, we (the authors) manually edited 30 examples in the NQ dataset until we judged them to be 100% attributable. We achieved a preservation score of 88%, which (when combined with 100% attribution) translates to 93.6 F1 AP , indicating that significant headroom remains.
Analyzing the baselines We found that EFEC frequently attempts to summarize the entire passage into one sentence, or drops later sentences, as exemplified in Figure 6. This is likely due to EFEC's training data, which was limited to single sentences. This behavior generally increases the attribution score, because it is usually easier to make one sentence fully attributable than many sentences. However, in datasets where the claim contains multiple sentences (NQ and StrategyQA), such a behavior yields low preservation scores, and also results in outputs that are less informative and "interesting". We expect that EFEC could perform much better if its train-  Table 1: Main evaluation results. For attribution, we report the AIS scores of the texts both before and after editing (before → after). For preservation, we report intent preservation Pres intent , Levenshtein similarity Pres Lev , and the combined Pres comb . We summarize the combination of AIS and Pres comb using their harmonic mean (F1 AP ), illustrating that RARR is the most robust across all three tasks.  Example model outputs together with human judgement of their attribution and preservation scores. EFEC reduces the input passage x into a single sentence. LaMDA changes the writing style. RARR preserves the structure of the input passage. We show one evidence retrieved by RARR to help explain the example.
ing data were augmented to include multiple sentences. LaMDA Research achieves similar attribution scores to RARR. But as mentioned in Section 5.1, the intent and linguistic style of the output tend to deviate from the input, resulting in lower preservation scores (Figure 6). We emphasize that this is not a purely apples-to-apples comparison since LaMDA was not optimized for preservation. Overall, these experiments are mainly meant to illustrate that prior models were simply not designed for the task of Editing for Attribution, rather than to mark RARR as the best method.
Next, we analyze RARR's behavior on the development sets.
Analyzing RARR's research step We found that the model was strongest at researching content involving distinct entities (e.g., a movie, a major historical event, or a person). In contrast, we found significant headroom for better attribution of statements involving generic objects and more abstract claims (e.g. "Video games require electricity."since this is obvious to most humans, retrieved articles from the web tend to address related but different topics). We suspect that a significant amount of attribution headroom on our benchmarks would benefit from better research.
Analyzing RARR's revision steps We found that RARR was able to revise many unattributed claims, especially those involving entities and numbers (Figures 7a and 7b). It can also perform larger revisions when necessary (Figure 7c). However, the system also has several shortcomings. Some erroneous edits arise from being misled by evidence that seems relevant, but is actually discussing a different event or topic (Figure 7d). We observed another interesting challenge when revising reasoning chains from StrategyQA, where the model successfully revised an incorrect claim, but did not revise subsequent reasoning steps that depended on the earlier claim. (Figure 7e). In this case, further editing to improve logical coherence (rather than attribution) could help.

Ablations
To evaluate the impact of different components of RARR, we explore several model variations and  Table 2: Ablation results. We report the automatic metrics: auto-AIS, Pres Lev , and harmonic mean between the two (F1 AP ). We show auto-AIS scores both before and after editing (before → edit), with respect to the attribution report A. Even though sentence-as-queries may achieve similar F1 AP as RARR, it is less robust to corpus shifts and tends to retrieve passages that may encourage confirmation bias (see discussions in Section 6.2).
report the results in Table 2. 7 Impact of model scale Many components in RARR work by few-shot prompting PaLM, a large 540B parameter LM. To assess the benefit of LM scaling, we replaced PaLM 540B with a smaller 62B parameter PaLM. We found that 540B outperforms 62B by a large margin, suggesting that RARR could potentially further improve with even more scaling. We also experimented with keeping the editor stage at 540B while shrinking the query generation stage to 64B -this yielded a relatively small performance drop, suggesting that model scaling is more important for the editor.
Agreement model We try removing the agreement model from the editor (Section 3.2), which effectively forces the model to revise the passage based on every retrieved evidence. As expected, more revision leads to less preservation score and spurious changes to the text passage, as demonstrated in Figure 8.
Query generation RARR uses generated questions as search queries for evidence retrieval. We consider two natural alternatives: using the entire input passage as a single search query, or using each sentence as a search query. Using the entire input passage as the query gives poor results, as the retrieved evidence tends to not focus on potentially unattributed parts in the passage. Using sentences as queries gives attribution numbers closer to the full CQGen, but a closer analysis reveals two caveats.  First, we hypothesize that sentences-as-queries are more effective when such sentences "mimic" content on the Web, and are less effective otherwise. In Table 3, we test this by excluding all of Wikipedia from web search results (since many PaLM outputs for NQ have a Wikipedia style). The attribution performance of sentencesas-queries subsequently drops significantly, while CQGen is more robust.
Second, sentence-as-queries tends to retrieve passages that may encourage undesirable confirmation bias. Consider the example "Georgia is called the Peach State, but California actually produces the most peaches." Retrieval using sentences-as-queries found an article echoing that California produces the most peaches, while CQ-Gen generated the more impartial query "Which state produces the most peaches?" and found a newer article saying that South Carolina replaced California as the top peach producer. In this case, RARR using CQGen needs to sacrifice more preservation score to edit the text, leading to a lower F1 AP score. This underscores that attribu- (b) Correctly revising a number y old : God Save the Queen became the British national anthem in 1745. . . . e: [www.britannica.com] The oldest national anthem is Great Britain's "God Save the Queen," which was described as a national anthem in 1825, although it had been popular as a patriotic song and used on occasions of royal ceremonial since the mid-18th century. ynew: God Save the Queen became the British national anthem in 1825. . . .
(The year 1745 was when the song was first performed.) (c) Performing a necessary larger revision y old : "It's My Party" is a song written and composed by American singersongwriter and producer Walter Gold.    tion alone cannot measure "correctness" since not all evidence is up-to-date or reliable.

Impact on downstream task performance
So far, we have measured preservation using the metric defined in Section 2.2. However, another measure of preservation is whether the revised text can still be used to perform the task that it was originally generated for. On NQ and StrategyQA, it is possible for us to quantitatively evaluate this, and we summarize the result in Figure 9. For NQ, each original text x is a long-form response to a factoid question. To determine whether the revised text y still serves this purpose, we feed the factoid question and y back into PaLM and prompt it to extract a short answer from y. Since a gold short answer is known for each factoid question, we can compute answer accuracy, as is commonly done for open-domain question answering (Chen et al., 2017). We find that RARR not only preserves short-answer accuracy but actually improves it: accuracy for the revised y is roughly 5% higher than for the original x.
For StrategyQA, each original text is a reasoning chain that helps to answer a yes/no question. To check whether the revised y still serves this purpose, we feed the StrategyQA question and y back into PaLM and prompt it to output a yes/no answer, and evaluate answer accuracy. Here, we find that increasing attribution comes at a slight cost in downstream task performance. Answer accuracy drops modestly for all models (up to 2.6%). We suspect that this may be due to noisy retrievals, which sometimes provide misleading evidence (exemplified in Figure 7d).
Furthermore, even though revisions can address factoid errors in the passage (e.g., "Homer Simpson has 5 fingers" from Figure 7e), RARR currently does not try to modify subsequent reasoning  steps which may no longer be logically entailed (e.g., "He only needs one hand to count to 5").

Challenging domains
Here, we report results on several more tasks where attribution was particularly hard, and significant future work is needed. We considered news article summaries produced by summarization models from SummEval (Fabbri et al., 2021) (e.g., "John Doe was left homeless when the storms hit Staten Island, New York . . . "). Results are shown in Table 4. First, we note that the before-edit auto-AIS scores for all models are low. These news article summaries are often about less widely known people and events, which is challenging for retrievers, leading to low attribution. For example, our query generator may ask "where does John Doe live" but get results for a different John Doe. EFEC and LaMDA also face this issue, but instead trade preservation for attribution and rewrite the text to a different topic. This result suggests that using web search with standard question generation methods may fail to capture important context from the input, and is not sufficient for the attribution task.
We also considered long-form explanations generated by PaLM for the ELI5 dataset (Fan et al., 2019) (Table 4). ELI5 was collected from online forums, so many answers tend to have subjective opinions instead of specific entities and facts (e.g., "How do our brains interpret scary music? To me, scary music often sounds a little bit like a person . . . "), and are thus difficult to attribute. Sometimes the whole output is based on a false premise and needs to be completely rewritten, in which case RARR cannot satisfactorily edit due to our revision threshold (Section 3.2).
Finally  to questions from the MMLU dataset (Hendrycks et al., 2021) which covers diverse subjects from social science, humanities, STEM, and others. 8 An example input looks like "Every time you remove an edge from a complete graph, you divide it into two connected components. So, a complete graph with 13 vertices must have 12 connected components." Results are shown in Table 2. RARR improves attribution of the explanations on all four categories of MMLU, although the increases are relatively small. We also found that RARR's performance is low on examples with mathematical reasoning, as these are beyond the capability of the edit model with our current prompt.

Discussion
In this work, we have presented Editing for Attribution, a new task for improving the attribution of existing and future generation models, by retroactively researching and revising their outputs. As shown in our results, major headroom remains and we consider RARR to be only an initial baseline for future work in this area. Editing for Attribution was primarily inspired by two emerging trends within language model research. First, progress in language modeling has been accelerating, with stunning new capabilities emerging every month. Second, state-of-the-art LMs may require months to train and major investments in compute. Therefore, we sought an attribution recipe that could "sit on top" of these rapid advances and major investments, without needing to re-design LMs from scratch. Since the researchand-revise paradigm requires minimal change to existing LMs, it may serve as a natural baseline for future approaches that seek to integrate attribution more deeply into LM architectures. This be-ing said, we wish to discuss several possible limitations of our approach below.
Limitations of our task definition Depending on the application, attribution and preservation may not deserve equal weight. For instance, if there are multiple acceptable options for the output, such as in a dialog system, we might tradeoff preservation for attribution, similar to how LaMDA behaves in our experiments. Our evaluation metrics also do not measure all aspects of attribution and preservation. For instance, some sentences are self-evident and do not require attribution (e.g., "I agree.") but would be penalized in our evaluation. For preservation, we wish to explore other properties that should be preserved, such as discourse or logical coherence. Finally, if the original output of a generation model is completely misguided or flawed, it can be difficult to revise the text without significant changes.
Limitations of our model While we aspire to improve attribution for arbitrary text, it is clear that RARR is not yet fully general. For example, the current implementation of RARR would not be well-prepared to edit poetry (where preserving rhyme matters) or long documents, primarily because we do not provide examples of such inputs in our few-shot LLM prompts. However, we do believe that future developers may be able to quickly adapt RARR to such tasks by simply changing the prompts. Second, RARR tends to preserve rather than delete claims that it cannot attribute. Some of these claims genuinely do not require attribution, but others are hallucination and should be removed. Judging whether a claim requires attribution can be subjective and challenging. Finally, our model is computationally costly, since it is based on prompting a large language model. One potential solution is to leverage recent synthetic data generation recipes to train a smaller model (Lee et al., 2021;Schick et al., 2022).
Ethical considerations RARR seeks to improve attribution for the output of any generative model. However, even if RARR can attribute content to a particular source, the user must still consider whether the source itself is trustworthy. Even for sources that are traditionally considered "authoritative" (such as an encyclopedia), there may still be factual inaccuracies or biases. This work does not address the question of whether a source is trustworthy, or the related topic of misinforma-tion. 9 While we do not provide a means for judging trustworthiness, the design of RARR does allow for the research stage to restrict its search over a user-specified corpus, based on what the user deems trustworthy.
There is also the possibility that some content may be simultaneously supported by certain sources, while contradicted by others. This can easily occur for content involving subjective or imprecise claims. The current implementation and evaluation for RARR does not explicitly address this issue -we adopted a "permissive" definition of attribution, where we consider content to be attributed if there exists any source that supports it. For some applications, users may prefer a more restrictive definition that requires both existence of supporting sources and absence of contradicting sources.
It is also necessary to note that linguistic assertions have varying scope: for example, there is a difference between "Frozen is a scary movie" and "I got scared watching Frozen" -while expressing a similar sentiment, the former makes a more general statement that many would disagree with, while the latter is scoped to the speaker's own experience. In some applications, one could even argue that the latter case does not require attribution, since the speaker is their own source-of-truth. This is a subtle but important issue that we plan to explore in future work. In addition to varying scope, utterances can also make assertions with varying levels of directness. For example, according to standard linguistics, "John ate some of the cookies" yields the implicature that John did not eat all of the cookies, even though it is not logically entailed. This raises the question of which implicatures or implied assertions should be detected and attributed, which should be explored in future work. For more nuances, we refer to Rashkin et al. (2021).
Beyond subtle issues, there is also the more obvious limitation that RARR is not 100% successful in making text consistent with retrieved evidence, resulting in text that may still lack full attribution. RARR and its evaluation are also currently implemented only for English -we plan to apply RARR to other languages in future work, which can be made easier by the promising multilingual generalization of large language models.

Acknowledgments
We wish to thank Raphael Hoffmann, Slav Petrov, Dipanjan Das, Michael Collins, Sundeep Tirumalareddy, Samer Hassan, Quoc Le and Heng-Tze Cheng for their research mentorship, feedback and support. We are grateful to Hao Zhou and Petr Pilar for helping us experiment with LaMDA and motivating our dialog experiments. We also wish to thank Tal Schuster for pointing us to relevant work in the fact checking literature, and helping us reproduce it. We thank Vitaly Nikolaev, David Reitter and Roee Aharoni for helping us use AIS and auto-AIS. We also wish to thank Jianmo Ni and Honglei Zhuang for developing the queryevidence relevance model we use, Daniel Andor for developing the sentence decontextualization model we use, and Ran Tian for the initial prototype of CQGen. Finally, we thank Kathy Meier-Hellstern, Philip Parham and Diane Korngiebel for their thoughtful feedback on ethical considerations.

Contributions
Luyu Gao: Designed RARR's few-shot prompting strategies and implemented the first PaLMbased prototype. Analyzed results, and advised on the design of human and automatic evaluation.
Zhuyun Dai: Proposed the evaluation setup of editing long-form generations from PaLM/LaMDA on various QA datasets. Hosted and mentored Luyu Gao (student researcher) in prototyping RARR. Implemented the final models, designed overall experiments, and obtained main results and ablations (together with Ice Pasupat). Contributed many parts of the writing.
Ice Pasupat: Implemented the final models, designed overall experiments, and obtained main results and ablations (together with Zhuyun Dai). Automated experimental infrastructure, conducted error analyses, and oversaw many parts of the paper writing.
Anthony Chen: Developed the automatic evaluation for attribution and preservation, and worked with Arun Chaganty to design human evaluation.
Arun Chaganty: Led and implemented all human evaluation. Proposed the two-dimensional attribution + preservation metric (together with Kelvin Guu). Advised on model design and contributed many parts of the writing.
Yicheng Fan: Worked with Kelvin Guu to develop the first prototype of RARR. Proposed multiple retrieval strategies and implemented the EFEC baseline.
Vincent Zhao: Co-hosted and mentored Luyu Gao (student researcher) in prototyping RARR. Enabled bulk inference for PaLM. Proposed the downstream task evaluation.
Ni Lao: Research mentorship, advising and contributed many parts of the writing.
Hongrae Lee: Research mentorship and advising. Helped integrate RARR with Google Search and evaluate LaMDA.
Da-Cheng Juan: Research mentorship and early design discussions.
Kelvin Guu: Proposed the original researchand-revise concept, implemented the first prototype, initiated the project and involved all collaborators. Implemented baselines (together with Yicheng Fan). Research mentorship, oversaw project coordination and paper writing. Figure 10: Violin plot illustrating the strong correlation between human AIS and auto-AIS labels on our NQ benchmark. Pearson correlation is 74.0 (N=450). y-axis is auto-AIS score, the two violins correspond to a human label of 0 or 1.

A Few-shot prompting with LLMs
We implement many sub-tasks within RARR using few-shot prompting of LLMs (also known as incontext learning (Brown et al., 2020)) as follows: 1. For each sub-task, we manually author a small number of training examples: (input j , output j ) for j = 1, . . . , J, where J ranges between 5 and 10 and where both the input and output are strings.
2. We form the following prompt: where denotes a newline character and ⊕ denotes a double newline character.
3. To perform inference on a new input, we condition the LLM on the prompt and sample continuations of the prompt up until the next double newline character. All of our prompts are included in Figures 13, 14, and 15. The contextual version used for QReCC are in Figures 16, 17, and 18.

B Comparing human and automated evaluation
We visualize the correlation between human and automated evaluation scores on NQ and StrategyQA in Figure 10. We found that for attribution, AIS scores from human correlate well with auto-AIS scores, with some bias for non-attributed sentences to be judged as attributed by auto-AIS.

C Evaluating attribution and preservation
To end-goal of RARR is to improve the attribution of generation models through post-editing while preserving the original intent. Attribution and preservation are both subjective properties that may change with even small edits. In the main paper, we present two automatic metrics to conveniently gauge these properties, but rely on a human evaluation as the gold standard. In this section, we describe how we conducted the human evaluation and what instructions and examples annotators were provided.
Rater recruitment and training We engaged with a vendor supplier of full-time crowd workers to recruit human annotators for our task. Annotators were asked to review the instructions below and were provided direct feedback on their responses during the pilot annotation runs. We had 3 annotators rate each example in the pilot phase to measure inter-annotator agreement, and had a single rater annotate each example afterwards.

C.1 Instructions: Overview
In this task you will evaluate the quality of text generated by a system (the "passage") based on how well it represents information from multiple pieces of "evidence".
We will be using two categories to evaluate the quality of the passage: Attribution and Intent Similarity. You will evaluate these categories in succession. In some tasks, you will only evaluate Attribution. The task interface will guide you through the flow; you can also see the overall task flow in the diagram below.
Note: The passage may appear very fluent and well-formed, but still contain slight inaccuracies that are not easy to discern at first glance. Pay close attention to the text. Read it carefully as you would when proofreading.

C.2 Instructions: Attribution
Figure 11: Screenshot of interface to annotate attribution at the sentence level. annotators were asked to mark sentences as being fully attributable or not fully attributable by clicking each sentence, and rating each piece of evidence as being useful or not in helping determine attribution of the passage. Annotators were also presented with the context of the generation.
In this step, you will evaluate how much of the passage is attributable to one or more pieces of evidence ( Figure 11).
In the interface, the passage of text and the context in which it was generated is shown on the left, and each piece of evidence is shown on the right. You will use all three (context, passage, evidence) to answer the following question for each sentence in the passage: Is all of the information provided by this sentence fully supported by at least one piece of evidence?
Determining the information provided by the sentence. Three points are key when determining information provided by the sentence: 1. The context and the other sentences of the passage are often critical in understanding the information provided by the sentence.
2. The context should only be used to understand the information provided by the sentence.
3. The evidence should be completely ignored for this step.
Consider the following example: Figure 12: Screenshot of the preservation interface. Annotators are asked to read compare two passages and rate how similar the intent conveyed by the two passages is.
Context: who plays doug williams in days of our lives Passage: In the American daytime drama Days of Our Lives, Doug Williams and Julie Williams are portrayed by Bill Hayes and Susan Seaforth Hayes.
In the above example, the meaning of the passage is clear even without seeing the query. But consider another example: Context: who plays doug williams in days of our lives Passage: he is played by Bill Hayes Passage (interpreted): Doug Williams is played by Bill Hayes in days of our lives In this case the pronoun "he" depends on the context, but it is clear that the intended meaning of the passage can be reasonably interpreted as "Doug Williams is played by Bill Hayes in days of our lives". This interpretation is the "information provided by the passage".
Pronouns such as he/she/it/they etc. are one case where context is needed to figure out the intended meaning of the system response. Here's another example (given with paraphrases of the information highlighted below): The context should only be used to determine the information provided by the passage; at times, the passage may be about a slightly different topic than the context, for example: Context: the south west wind blows across nigeria between Passage: The Harmattan is a dry and dusty northeasterly trade wind that blows across West Africa from December to March. It is very dusty because it blows across the Sahara.
Here, the passage talks about a northeasterly wind, while the context asks about a south-west wind, but the passage can be fully understood.
In general, use your best judgment to determine the information provided by the passage. If the passage is hard to understand and you are unsure what the intended meaning of the passage is, mark the sentences as not attributed and enter a comment with an explanation. As one example, take the following: Context: how many NBA championships did Michael Jordan win? Passage: it is the best team in the NBA Determining if the information accurately represents the evidence. Two points are key when determining whether the information accurately represents the evidence: When interpreting a piece of evidence, use only the title and text of that specific evidence. Completely ignore the context, passage and all other evidence. Check all the information in a sentence. If only some information is supported by the evidence, mark the sentence as not fully attributable.
Consider the following example: In this case, while the evidence also mentions the "South Carolina Gamecocks", it isn't clear that the national championship being mentioned is indeed the 2017 NCAA Women's Division I Basketball Tournament. The passage should be marked as not attributable.
Finally, when the passage contains multiple sentences, evaluate whether each sentence can be fully attributed to one or more pieces of evidence-it is possible for one sentence to be attributed while another is not. For example: The first two sentences cannot be attributed to either evidence for the same reason as the previous example, but the last sentence is fully supported by Evidence 2 and should be marked as attributed.
In general, you should use your best judgment in determining whether all of the information provided by the passage is "an accurate representation of information in at least one evidence". See Table 6 for additional examples.
We give the following final notes of guidance:  • Marking evidence as useful. When reviewing each piece of evidence, mark it as useful if it helps you judge the attributability of any sentence, and mark it not useful if not. In the above example Evidence 1 is not useful because it didn't contain enough context to actually help you assess if the passage was attributable, but Evidence 2 was useful.
• Contradicting evidence. Mark a sentence as being attributed if any piece of evidence supports it: if two pieces of evidence contradict each other, but one of them supports the passage, mark the sentence as fully attributable.
• More on the concept of "accurate representation". We take as inspiration the journalist's conception of "accurate representation". For example, take this excerpt on Accuracy in the NPR Ethics Handbook: "When quoting or paraphrasing anyone . . . consider whether the source would agree with the interpretation..." In other words, if you had written the source document, consider whether you would view the system response as an accurate representation of information in that source document.

C.3 Instructions: Intent Similarity
In this step, you will evaluate how much similar the passage is to another passage ( Figure 12). In the interface, the passage A and passage B are both text generated by a system-given the same context in which it was generated. You will use all three (context, passage A, passage B) to answer the following question: How similar is the intent expressed by Passage A and Passage B? Please ignore any differences in details.
Two points are key when determining whether the two passages convey the same intent: (6) You said: Social work is a profession that is based in the philosophical tradition of humanism. It is an intellectual discipline that has its roots in the 1800s. 30 To verify it, 31 a) I googled: What philosophical tradition is social work based on? 32 b) I googled: What year does social work has its root in? 33 34 (7) You said: {text} 35 To verify it, 36 _____ Figure 13: Few-shot prompt for query generation. To increase diversity and coverage, we sample the model three times and combine the resulting lists of queries.