Active Retrieval Augmented Generation

Despite the remarkable ability of large language models (LMs) to comprehend and generate language, they have a tendency to hallucinate and create factually inaccurate output. Augmenting LMs by retrieving information from external knowledge resources is one promising solution. Most existing retrieval augmented LMs employ a retrieve-and-generate setup that only retrieves information once based on the input. This is limiting, however, in more general scenarios involving generation of long texts, where continually gathering information throughout generation is essential. In this work, we provide a generalized view of active retrieval augmented generation, methods that actively decide when and what to retrieve across the course of the generation. We propose Forward-Looking Active REtrieval augmented generation (FLARE), a generic method which iteratively uses a prediction of the upcoming sentence to anticipate future content, which is then utilized as a query to retrieve relevant documents to regenerate the sentence if it contains low-confidence tokens. We test FLARE along with baselines comprehensively over 4 long-form knowledge-intensive generation tasks/datasets. FLARE achieves superior or competitive performance on all tasks, demonstrating the effectiveness of our method. Code and datasets are available at https://github.com/jzbjyb/FLARE.

Retrieval augmented LMs commonly use a retrieve-and-generate setup where they retrieve documents based on the user's input, and then generate a complete answer conditioning on the retrieved documents (Chen et al., 2017;Guu et al., 2020;Lewis et al., 2020;Izacard and Grave, 2021;Sachan et al., 2021;Lee et al., 2021;Jiang et al., 2022;Izacard et al., 2022;Nakano et al., 2021;Qian et al., 2023;Lazaridou et al., 2022;Shi et al., 2023).These single-time retrieval augmented LMs outperform purely parametric LMs, particularly for shortform knowledge-intensive generation tasks such as factoid question answering (QA) (Kwiatkowski et al., 2019;Joshi et al., 2017), where the information needs are clear in the user's input, and it is sufficient to retrieve relevant knowledge once solely based on the input.
Increasingly powerful large LMs have also demonstrated abilities in more complex tasks that involve generating long-form output, such as longform QA (Fan et al., 2019;Stelmakh et al., 2022), open-domain summarization (Cohen et al., 2021;Hayashi et al., 2021;Giorgi et al., 2022), and (chain-of-thought; CoT) reasoning (Wei et al., 2022;Ho et al., 2020;Geva et al., 2021;Hendrycks et al., 2020).In contrast to short-form generation, long-form generation presents complex information needs that are not always evident from the input alone.Similar to how humans gradually gather information as we create content such as papers, essays, or books, long-form generation with LMs would require gathering multiple pieces of knowledge throughout the generation process.For example, to generate a summary about a particular topic, the initial retrieval based on the topic name Joe Biden (born November 20, 1942) is the 46th president of the United States.
He graduated from the University of Delaware in 1965 with a Bachelor of Arts in history and political science.
Joe Biden attended the University of Pennsylvania, where he earned a law degree.

Input
Step 1 Search results: Step 2 Joe Biden announced his candidacy for the 2020 presidential election on April 25, 2019.(FLARE).Starting with the user input x and initial retrieval results D x , FLARE iteratively generates a temporary next sentence (shown in gray italic) and check whether it contains low-probability tokens (indicated with underline).If so (step 2 and 3), the system retrieves relevant documents and regenerates the sentence.

Joe Biden announced his candidacy for the 2020 presidential election on
(e.g., Joe Biden) may not cover all aspects and details.It is crucial to retrieve extra information as needed during generation, such as when generating a certain aspect (e.g., Joe Biden's education history) or a specific detail (e.g., the date of Joe Biden's presidential campaign announcement).
Several attempts have been made to retrieve multiple times throughout generation.These attempts include methods that passively use the past context to retrieve additional information at a fixed interval (Khandelwal et al., 2020;Borgeaud et al., 2022;Ram et al., 2023;Trivedi et al., 2022) which might not accurately reflect what LMs intend to generate in the future or retrieve at inappropriate points.Some works in multihop QA decompose the full question into sub-questions, each of which is used to retrieve extra information (Press et al., 2022;Yao et al., 2022;Khot et al., 2022;Khattab et al., 2022).
We ask the following question: can we create a simple and generic retrieval augmented LM that actively decides when and what to retrieve throughout the generation process, and are applicable to a variety of long-form generation tasks?We provide a generalized view of active retrieval augmented generation.Our hypothesis regarding when to retrieve is that LMs should retrieve information only when they lack the required knowledge to avoid unnecessary or inappropriate retrieval that occurs in passive retrieval augmented LMs (Khandelwal et al., 2020;Borgeaud et al., 2022;Ram et al., 2023;Trivedi et al., 2022).Given the observation that large LMs tend to be well-calibrated and low probability/confidence often indicates a lack of knowledge (Kadavath et al., 2022), we adopt an active retrieval strategy that only retrieves when LMs generate lowprobability tokens.When deciding what to retrieve, it is important to consider what LMs intend to generate in the future, as the goal of active retrieval is to benefit future generations.Therefore, we propose anticipating the future by generating a temporary next sentence, using it as a query to retrieve relevant documents, and then regenerating the next sentence conditioning on the retrieved documents.Combining the two aspects, we propose Forward-Looking Active REtrieval augmented generation (FLARE), as illustrated in Figure 1.FLARE iteratively generates a temporary next sentence, use it as the query to retrieve relevant documents if it contains low-probability tokens and regenerate the next sentence until reaches the end.
FLARE is applicable to any existing LMs at inference time without additional training.Con-sidering the impressive performance achieved by GPT-3.5 (Ouyang et al., 2022) on a variety of tasks, we examine the effectiveness of our methods on text-davinci-003.We evaluate FLARE on 4 diverse tasks/datasets involving generating long outputs, including multihop QA (2WikiMul-tihopQA), commonsense reasoning (StrategyQA), long-form QA (ASQA), and open-domain summarization (WikiAsp) (Ho et al., 2020;Geva et al., 2021;Stelmakh et al., 2022;Hayashi et al., 2021).Over all tasks, FLARE achieves superior or competitive performance compared to single-time and multi-time retrieval baselines, demonstrating the effectiveness and generalizability of our method.

Retrieval Augmented Generation
We formally define single-time retrieval augmented generation and propose the framework of active retrieval augmented generation.

Notations and Definitions
Given a user input x and a document corpus D = {d i } |D| i=1 (such as all Wikipedia articles), the goal of retrieval augmented LMs is to generate the answer y = [s 1 , s 2 , ..., s m ] = [w 1 , w 2 , ..., w n ] containing m sentences or n tokens leveraging information retrieved from the corpus.
In retrieval augmented LM, the LM typically pairs with a retriever that can retrieve a list of documents D q = ret(q) for a query q; the LM conditions on both the user input x and retrieved documents D q to generate the answer.Since we focus on examining various methods of determining when and what to retrieve, we follow existing methods (Ram et al., 2023;Trivedi et al., 2022) to prepend the retrieved documents before the user input to aid future generation for both baselines and our method for fair comparisons: y = LM([D q , x]), where [•, •] is concatenation following the specified order.

Single-time Retrieval Augmented Generation
The most common choice is to directly use the user input as the query for retrieval and generate the complete answer at once y = LM([D x , x]).

Active Retrieval Augmented Generation
To aid long-form generation with retrieval, we propose active retrieval augmented generation.It is a generic framework that actively decides when and what to retrieve through the generation process, resulting in the interleaving of retrieval and generation.Formally, at step t(t ≥ 1), the retrieval query q t is formulated based on both the user input x and previously generated output y <t = [y 0 , ..., y t−1 ]: where qry(•) is the query formulation function.At the beginning (t = 1), the previous generation is empty (y <1 = ∅), and the user input is used as the initial query (q 1 = x).Given retrieved documents D qt , LMs continually generate the answer until the next retrieval is triggered or reaches the end: where y t represents the generated tokens at the current step t, and the input to LMs is the concatenation of the retrieved documents D qt , the user input x, and the previous generation y <t .We discard previously retrieved documents ∪ t ′ <t D q t ′ and only use the retrieved documents from the current step to condition the next generation to prevent reaching the input length limit of LMs.

FLARE: Forward-Looking Active REtrieval Augmented Generation
Our intuition is that (1) LMs should only retrieve information when they do not have the necessary knowledge to avoid unnecessary or inappropriate retrieval, and (2) the retrieval queries should reflect the intents of future generations.We propose two forward-looking active retrieval augmented generation (FLARE) methods to implement the active retrieval augmented generation framework.The first method prompts the LM to generate retrieval queries when necessary while generating the answer using retrieval-encouraging instructions, denoted as FLARE instruct .The second method directly uses the LM's generation as search queries, denoted as FLARE direct , which iteratively generates the next sentence to gain insight into the future topic, and if uncertain tokens are present, retrieves relevant documents to regenerate the next sentence.

FLARE with Retrieval Instructions
Inspired by Toolformer (Schick et al., 2023), a straightforward way of expressing information needs for retrieval is to generate "[Search(query)]" when additional information is needed (Schick et al., 2023), e.g., "The colors on the flag of Ghana have the following meanings.Red is for [Search(Ghana flag red meaning)] the blood of martyrs, ..." When working with GPT-3.5 models that the University of Pennsylvania, where he earned a law degree.
Generate a summary about Joe Biden.offer only API access, we elicit such behavior by few-shot prompting (Brown et al., 2020).
Specifically, for a downstream task, we place the search-related instruction and exemplars at the beginning as skill 1, followed by the instruction and exemplars of the downstream task as skill 2. Given a test case, we ask LMs to combine skills 1 and 2 to generate search queries while performing the task.The structure of the prompt is shown in Prompt 3.1, and full details can be found in Prompt D.3.Prompt 3.1: retrieval instructions Skill 1.An instruction to guide LMs to generate search queries.Several search-related exemplars.
Skill 2. An instruction to guide LMs to perform a specific downstream task (e.g., multihop QA).Several task-related exemplars.
An instruction to guide LMs to combine skills 1 and 2 for the test case.The input of the test case.
As shown in Figure 2, when the LM generates "[Search(query)]" (shown in gray italic), we stop the generation and use the query terms to retrieve relevant documents, which are prepended before the user input to aid future generation until the next search query is generated or reaches the end.Additional implementation details are included in Appendix A.

Direct FLARE
Since we cannot fine-tune black-box LMs, we found queries generated by FLARE instruct through retrieval instructions might not be reliable.Therefore, we propose a more direct way of forwardlooking active retrieval that uses the next sentence to decide when and what to retrieve.

Confidence-based Active Retrieval
As shown in Figure 1, at step t, we first generate a temporary next sentence ŝt = LM([x, y <t ]) without conditioning on retrieved documents.Then we decide whether to trigger retrieval and formulate queries based on ŝt .If the LM is confident about ŝt , we accept it without retrieving additional information; if not, we use ŝt to formulate search queries q t to retrieve relevant documents, and then regenerate the next sentence s t .The reason we utilize sentences as the basis of our iteration is due to their significance as semantic units that are neither too short nor too lengthy like phrases and paragraphs.However, our approach can also utilize phrases or paragraphs as the basis.
Since LMs tend to be well-calibrated that low probability/confidence often indicates a lack of knowledge (Jiang et al., 2021;Kadavath et al., 2022;Varshney et al., 2022), we actively trigger retrieval if any token of ŝt has a probability lower than a threshold θ ∈ [0, 1].θ = 0 means retrieval is never triggered, while θ = 1 triggers retrieval every sentence.
where the query q t is formulated based on ŝt .

Confidence-based Query Formulation
One way to perform retrieval is to directly use the next sentence ŝt as the query q t .This shares a similar spirit with methods that use generated hypothetical titles or paragraphs from LMs as retrieval queries or evidences (Gao et al., 2022;Sun et al., 2022;Yu et al., 2022;Mao et al., 2021).We generalize such techniques to long-form generation where active information access is essential.We found retrieving with the next sentence achieves significantly better results than with the previous context, as shown later in subsection 6.2.However, it has a risk of perpetuating errors contained in it.For example, if the LM produces the sentence "Joe Biden attended the University of Pennsylvania" instead of the correct fact that he attended the University of Delaware, using this erroneous sentence as a query might retrieve mislead-Ask a question to which the answer is "the University of Pennsylvania" Ask a question to which the answer is "a law degree" What university did Joe Biden attend?What degree did Joe Biden earn?ing information.We propose two simple methods to overcome this issue as illustrated in Figure 3.
Masked sentences as implicit queries.The first method masks out low-confidence tokens in ŝt with probabilities below a threshold β ∈ [0, 1], where a higher β results in more aggressive masking.This removes potential distractions from the sentence to improve retrieval accuracy.Generated questions as explicit queries.Another method is to generate explicit questions that target the low-confident span in ŝt .For example, if the LM is uncertain about "the University of Pennsylvania", a question like "Which university did Joe Biden attend?" can help retrieve relevant information.Self-ask (Press et al., 2022) achieved this by manually inserting follow-up questions into downstream task exemplars as shown later in Prompt D.2, which requires task-specific annotation efforts.Instead, we developed a universal approach that generates questions for low-confidence spans without additional annotation.Specifically, We first extract all spans from ŝt with probabilities below β.For each extracted span z, we prompt gpt-3.5-turbo to generate a question q t,z that can be answered with the span: Prompt 3.2: zero-shot question generation User input x.Generated output so far y ≤t .
Given the above passage, ask a question to which the answer is the term/entity/phrase "z".
We retrieve using each generated question and interleave the returned documents into a single ranking list to aid future generations.In summary, queries q t are formulated based on ŝt as follows: ∅ if all tokens of ŝt have probs ≥ θ mask(ŝ t ) or qgen(ŝ t ) otherwise

Implementation Details
Base LM We validate our method on one of the most advanced GPT-3.5 LMs text-davinci-003 by iteratively querying their API.2 Document corpus and retrievers.Since we focus on the integration of retrieval and generation, we use off-the-shelf retrievers that take queries as inputs and return a list of relevant documents.
For datasets that mainly rely on knowledge from Wikipedia, we use the Wikipedia dump from Retrieved document formatting.Multiple retrieved documents are linearized according to their ranking and then added to the beginning of the user input using Prompt D.1.
Other implementation details such as sentence tokenization and efficiency are included Appendix A.

Multi-time Retrieval Baselines
Existing passive multi-time retrieval augmented LMs can also be formulated using our framework (subsection 2.3).In this section, we formally introduce three baseline categories based on when and what to retrieve.These baselines are not exact reproductions of the corresponding paper because many design choices differ which makes direct comparisons impossible.We implemented them using the same settings, with the only variation being when and what to retrieve.
Previous-window approaches trigger retrieval every l tokens, where l represents the window size.Generated tokens from the previous window are used as the query: Some existing methods in this category are RETRO (Borgeaud et al., 2022), IC-RALM (Ram et al., 2023), which retrieve every few tokens, and KNN-LM (Khandelwal et al., 2020), which retrieves every token. 4We follow Ram et al. (2023) to use a window size of l = 16.
Previous-sentence approaches trigger retrieval every sentence and use the previous sentence as the query, and IRCoT (Trivedi et al., 2022) belongs to this category: Question decomposition approaches manually annotated task-specific exemplars to guide LMs to generate decomposed sub-questions while producing outputs.For example, self-ask (Press et al., 2022), a method in this category, manually inserts sub-questions in exemplars using Prompt D.2.For the test case, retrieval is triggered dynamically whenever the model generates a sub-question.
The aforementioned approaches can retrieve additional information while generating.However, they have notable drawbacks: (1) Using previously generated tokens as queries might not reflect what LMs intend to generate in the future.(2) Retrieving information at a fixed interval can be inefficient because it might occur at inappropriate points.(3) Question decomposition approaches require taskspecific prompt engineering, which restricts their generalizability in new tasks.

Experimental Setup
We evaluate the effectiveness of FLARE on 4 diverse knowledge-intensive tasks using few-shot incontext learning (Radford et al., 2019;Brown et al., 2020;Liu et al., 2023).We follow previous works (Trivedi et al., 2022)  Multihop QA The goal of multihop QA is to answer complex questions through information retrieval and reasoning.We use 2WikiMultihopQA (Ho et al., 2020) which contains 2-hop complex questions sourced from Wikipedia articles that require composition, comparison, or inference, e.g., "Why did the founder of Versus die?"We follow Wang et al. (2022) to generate both the chain-ofthought and the final answer.Experimental setting details are included in Appendix B.
We use regular expressions to extract the final answer from the output and compare it with the reference answer using exact match (EM), and tokenlevel F 1 , precision, and recall.
Commonsense reasoning Commonsense reasoning requires world and commonsense knowledge to generate answers.We use StrategyQA (Geva et al., 2021) which is a collection of crowdsourced yes/no questions, e.g., "Would a pear sink in water?"We follow Wei et al. (2022) to generate both the chain-of-thought and the final yes/no answer.Details are included in Appendix B.
We extract the final answer and match it against the gold answer using exact match.
Long-form QA Long-form QA aims to generate comprehensive answers to questions seeking complex information (Fan et al., 2019;Stelmakh et al., 2022).We use ASQA (Stelmakh et al., 2022) as our testbed where inputs are ambiguous questions with multiple interpretations, and outputs should cover all of them.For example, "Where do the Philadelphia Eagles play their home games?" could be asking about the city, sports complex, or stadium.We found in many cases it is challenging even for humans to identify which aspect of the question is ambiguous.Therefore, we created another setting (ASQA-hint) where we provide a brief hint to guide LMs to stay on track when generating answers.The hint for the above case is "This question is ambiguous in terms of which specific location or venue is being referred to."Experimental setting details are included in Appendix B.

Open-domain summarization
The goal of opendomain summarization is to generate a comprehensive summary about a topic by gathering information from open web (Giorgi et al., 2022).We use WikiAsp (Hayashi et al., 2021)

Experimental Results
We first report overall results across 4 tasks/datasets and compare the performance of FLARE with all the baselines introduced in section 4. We then run ablation experiments to study the efficacy of various design choices of our method.

Comparison with Baselines
Overall results.The overall performance of FLARE and baseline across all tasks/datasets are reported in Figure 4. FLARE outperforms all baseline on all tasks/datasets, indicating that FLARE is a generic method that can effectively retrieve additional information throughout the generation.
Among various tasks, multihop QA shows the most significant improvement.This is largely due to the task's clear definition and specific objective of producing the final answer through a 2-hop reasoning process, which makes it easier for LMs to generate on-topic output.In contrast, ASQA and WikiAsp are more open-ended, which increases the difficulty of both generation and evaluation.The improvement on ASQA-hint is larger than that of ASQA because identifying ambiguous aspects is challenging even for humans in many cases, and providing a generic hint helps LMs to stay on topic.
Thorough comparisons with baselines.The performance of all baselines on 2WikiMultihopQA are reported in Table 1.FLARE outperforms all baselines by a large margin, which confirms that forward-looking active retrieval is highly effective.Most multi-time retrieval augmented approaches outperform single-time retrieval but with different Previous-window (Borgeaud et al., 2022;Ram et al., 2023), previous-sentence (Trivedi et al., 2022), and question decomposition (Press et al., 2022;Yao et al., 2022) methods are reimplemented for fair comparisons.
margins.The improvement of retrieving using the previous sentence is relatively small which we hypothesize is mainly because the previous sentence often describes entities or relations different from those in the next sentence in 2WikiMultihopQA.
While the previous-window approach might use the first half of a sentence to retrieve information potentially helpful for generating the second half.Among all baselines, the question decomposition approach (Press et al., 2022) achieves the best performance.which is not surprising since the incontext exemplars manually annotated with decomposed sub-questions (Prompt D.2) guide LMs to generate sub-questions that align with the topic/intent of future generations.FLARE outperforms this baseline, indicating that manual exemplar annotation is not necessary for effective future-aware retrieval.The gap between FLARE instruct and question decomposition is large, indicating that teaching LMs to generate search queries using task-generic retrieval instructions and exemplars is challenging.We report all metrics for the other datasets in Table 2. FLARE outperforms baselines with respect to all metrics.Retrieval using the previ- ous window underperforms single-time retrieval on ASQA, which we hypothesize is because the previous window does not accurately reflect future intent.Since we focus on evaluating factuality, metrics with an emphasis on factual content (such as EM, Disambig-F 1 , UniEval) are more reliable than metrics computed over all tokens (ROUGE-L).

Ablation Study
Importance of forward-looking retrieval.We first validate that forward-looking retrieval is more effective than past-context-based retrieval.We run ablation experiments on 2WikiMultihopQA and ASQA-hint comparing retrieval using the previous versus the next sentence.Specifically, both methods retrieve every sentence and directly use the complete previous/next sentence as queries.As shown in Table 3, using the next sentence to retrieve is clearly better than using the previous sentence, confirming our hypothesis.We also run previous-window approaches using different numbers of past tokens as queries.As shown in Table 4, using too many tokens (> 32) in the past hurts the performance, further confirming our hypothesis that previous context might not be relevant to intent of future generations.
Importance of active retrieval.Next, we investigate how active retrieval threshold θ affects performance.To alter our method from not retrieving to retrieving every sentence, we adjust the confidence threshold θ that determines when to trigger retrieval from 0 to 1.We then calculate the proportion of steps/sentences where retrieval is activated, and present the performance based on it.As shown in Figure 5, on 2WikiMultihopQA, the performance plateaus when the retrieval percentage exceeds 60%, indicating that retrieval when LMs are confident is not necessary.On StrategyQA, the performance drops when the retrieval percentage exceeds 50%, indicating that unnecessary retrieval can introduce noise and impede the original generation process.We found triggering retrieval for 40%-80% of sentences usually leads to a good performance across tasks/datasets.

Effectiveness of different query formulation methods
We study implicit query formation by masking and explicit query formulation through question generation.In thresholds β.Retrieving directly with the complete sentence (β = 0) is worse than masking tokens with low probabilities, confirming our hypothesis that low-confidence erroneous tokens can distract retrievers.We compare implicit and explicit query formulation methods in Table 6.Performances of both methods are similar, indicating that both methods can effectively reflect information needs.

Related Work
We refer to subsection 2.2 and section 4 for extensively discussion on single-time and multi-time retrieval augmented LMs, which is the most relevant area to this paper.
Iterative and adaptive retrieval Iterative retrieval and refinement has been studied in both text and code generation tasks (Peng et al., 2023;Zhang et al., 2023;Zemlyanskiy et al., 2022;Yu et al., 2023).FLARE differs from these methods in the granularity of generation and retrieval strategies.
Adaptive retrieval has been studied in single-time retrieval scenarios based on either question popularity or generation probabilities (Mallen et al., 2022;Li et al., 2023), while we focus on long-form generation requiring active information access.
Browser-enhanced LMs WebGPT (Nakano et al., 2021) and WebCPM (Qin et al., 2023) train LMs to interact with browser to enhance factuality using reinforcement learning or supervised training where multiple queries can be triggered before generation.FLARE is built on text-based retrievers but can be combined with a browser to potentially improve retrieval quality.

Conclusion
To aid long-form generation with retrieval augmentation, we propose an active retrieval augmented generation framework that decides when and what to retrieve during generation.We implement this framework with forward-looking active retrieval that iteratively uses the upcoming sentence to retrieve relevant information if it contains lowconfidence tokens and regenerates the next sentence.Experimental results on 4 tasks/datasets demonstrate the effectiveness of our methods.Future directions include better strategies for active retrieval and developing efficient LM architectures for active information integration.

Limitations
We also conduct experiments on Wizard of Wikipedia (Dinan et al., 2019) and ELI5 (Fan et al., 2019), and found that FLARE did not provide significant gains.Wizard of Wikipedia is a knowledgeintensive dialogue generation dataset where the output is relatively short (∼20 tokens on average) so retrieving multiple disparate pieces of information might not be necessary.ELI5 (Fan et al., 2019) is a long-form QA dataset requiring in-depth answers to open-ended questions.Due to issues mentioned in Krishna et al. (2021) such as difficulties of grounding generation in retrieval and evaluation, both single-time retrieval and FLARE did not provide significant gains over not using retrieval.
From an engineering perspective, interleaving generation and retrieval with a naive implementation increases both overheads and the cost of generation.LMs need to be activated multiple times (once for each retrieval) and a caching-free implementation also requires recomputing the previous activation each time after retrieval.This issue can be potentially alleviated with special architectural designs that encode the retrieved documents D qt and the input/generation (x/y <t ) independently.
A FLARE Implementation Details FLARE instruct implementation details We found that LMs can effectively combine retrieval and downstream task-related skills and generate meaningful search queries while performing the task.However, there are two issues: (1) LMs tend to generate fewer search queries than necessary.
(2) Generating excessive search queries can disrupt answer generation and adversely affect performance.We address these issues using two methods respectively.First, we increase the logit of the token "[" by 2.0 to improve the chances of LMs generating "[Search(query)]".Second, whenever LMs generate a search query, we use it to retrieve relevant information, promptly remove it from the generation, and generate the next few tokens while forbidding "[" by adding a large negative value to the logit of "[".
The initial query of FLARE.FLARE starts with the user input x as the initial query to retrieve documents to generate the first sentence ŝ1 = LM([D x , x]) to bootstrap the iterative generation process.For the following steps, the temporary forward-looking sentence is generated without retrieved documents.
Sentence tokenization.For each step t, we generate 64 tokens which are longer than most sentences, and use NLTK sentence tokenizer5 to extract the first sentence and discard the rest.
Efficiency As shown in subsection 6.2, on average retrieval is triggered for 30% ∼ 60% of sentences depending on downstream tasks.In comparision, KNN-LM (Khandelwal et al., 2020) retrieves every token, RETRO or IC-RALM (Borgeaud et al., 2022;Ram et al., 2023) retrievers every 4∼32 tokens, and IRCoT (Trivedi et al., 2022) retrieves every sentence.Compared to single-time retrieval, however, interleaving retrieval and generation with a naive implementation indeed increases overheads, which we discuss in the limitation section (section 9).

B Datasets and Settings
Datasets, metrics, and experimental settings are summarized in Table 7.
Multihop QA For "Why did the founder of Versus die?", the output we aim to generate is "The founder of Versus was Gianni Versace.Gianni Versace was shot and killed on the steps of his Miami Beach mansion on July 15, 1997.So the answer is shot."We use 8 exemplars from Trivedi et al. (2022) listed in Prompt D.4 for in-context learning, BM25 as the retriever, and Wikipedia articles as the retrieval corpus.Similar to the observation in Trivedi et al. (2022), we found incorporating retrieval results for exemplars improves the performance, we use the input x of each exemplar to retrieve several documents and then add them using the format in Prompt D.1.We found increasing the number of retrieval documents often increases performance.Therefore, we use the maximum number of documents that can fit within the input length limit of text-davinci-003, which is 2 for 2Wiki-MultihopQA.
Commonsense Reasoning For "Would a pear sink in water?", the output we aim to generate is "The density of a pear is about 0.6g/cm 3 , which is less than water.Objects less dense than water float.Thus, a pear would float.So the final answer is no." We use 6 exemplars from Wei et al. (2022) listed in Prompt D.5, BM25 on the Wikipedia corpus, and 3 retrieved documents to run experiments.
Long-form QA For "Where do the Philadelphia Eagles play their home games?", the output we aim to generate is "We need to consider the different possible locations or venues that could be considered the home field of the Philadelphia Eagles.These include the city, the sports complex, or the stadium.Therefore, this question has 3 interpretations and the answers are: (1) The city is Philadelphia.(2) The sports complex is the South Philadelphia Sports Complex.
(3) The stadium is the Lincoln Financial Field stadium."For both the original setting (ASQA) and the setting with hints (ASQA-hint), we manually annotate 8 exemplars (Prompt D.6 and D.8), use BM25 on the Wikipedia corpus, and 3 retrieved documents to run experiments.

C Hyperparameters
Hyperparameters of FLARE on different datasets are listed in Table 9.

D Prompts and Few-shot exemplars
The prompt used to linearize multiple documents is shown in Prompt D.1.The prompt used in selfask (Press et al., 2022)  Generate a yes or no answer to the following question.Question: Could Brooke Shields succeed at University of Pennsylvania?Answer: Brooke Shields went to Princeton University.Princeton University is about as academically rigorous as the University of Pennsylvania.Thus, Brooke Shields could also succeed at the University of Pennsylvania.So the final answer is yes.
Generate a yes or no answer to the following question.Question: Hydrogen's atomic number squared exceeds number of Spice Girls?Answer: Hydrogen has an atomic number of 1. 1 squared is 1.There are 5 Spice Girls.Thus, Hydrogen's atomic number squared is less than 5.So the final answer is no.
Generate a yes or no answer to the following question.Question: Is it common to see frost during some college commencements?Answer: College commencement ceremonies can happen in December, May, and June.December is in the winter, so there can be frost.Thus, there could be frost at some commencements.So the final answer is yes.
Generate a yes or no answer to the following question.Question: Could a llama birth twice during War in Vietnam (1945-46)?Answer: The War in Vietnam was 6 months.The gestation period for a llama is 11 months, which is more than 6 months.Thus, a llama could not give birth twice during the War in Vietnam.So the final answer is no.
Generate a yes or no answer to the following question.Question: Would a pear sink in water?Answer: The density of a pear is about 0.6g/cm 3 , which is less than water.Objects less dense than water float.Thus, a pear would float.So the final answer is no.
Generate a yes or no answer to the following question.Question: Did Spiderman fight against Falcon in the MCU? Answer: born November 20, 1942)  is the 46th president of the United States.

Figure 1 :
Figure1: An illustration of forward-looking active retrieval augmented generation (FLARE).Starting with the user input x and initial retrieval results D x , FLARE iteratively generates a temporary next sentence (shown in gray italic) and check whether it contains low-probability tokens (indicated with underline).If so (step 2 and 3), the system retrieves relevant documents and regenerates the sentence.

Figure 2 :
Figure 2: An illustration of forward-looking active retrieval augmented generation with retrieval instructions (FLARE instruct ).It iteratively generates search queries (shown in gray italic) to retrieve relevant information to aid future generations.
Biden attended , where he earned .LM such as ChatGPT

Figure 3 :
Figure 3: Implicit and explicit query formulation.Tokens with low probabilities are marked with underlines.
Karpukhin et al. (2020) and employ BM25 (Robertson and Zaragoza, 2009) as the retriever.For datasets that rely on knowledge from the open web, we use the Bing search engine as our retriever.3

Figure 5 :
Figure 5: Performance (EM) of FLARE with respect to the percentage of steps/sentences with retrieval on 2WikiMultihopQA and StrategyQA.
to sub-sample at most 500 examples from each dataset due to the cost of running experiments.Datasets, metrics, and settings are summarized in Table 7 of Appendix B. The hyperparameters of FLARE are selected based on the development set and listed in Table 9. FLARE refers to FLARE direct if not specifically stated.

Table 3 :
A head-to-head comparison between using the previous sentence and the next sentence for retrieval.

Table 5 :
Performance of FLARE with respect to the masking threshold β on 2WikiMultihopQA.

Table 7 :
is shown in Prompt D.2.Prompts and exemplars of different tasks/datasets are shown in Prompt D.3, D.4, D.5, D.6, D.8, and D.10, respectively.Prompt D.2: multihop QA with self-ask Question: Who lived longer, Theodor Haecker or Harry Vaughan Watkins?Are follow up questions needed here: Yes.Follow up: How old was Theodor Haecker when he died?Intermediate answer: Theodor Haecker was 65 years old when he died.Follow up: How old was Harry Vaughan Watkins when he died?Intermediate answer: Harry Vaughan Watkins was 69 years old when he died.So the final answer is: Harry Vaughan Watkins.Dataset statistics and experimental settings of different tasks.wikipedia.org,wikiwand.com,wiki2.org,wikimedia.org

Table 8 :
Wikipedia-related domains excluded from Bing's search results.

Table 9 :
Hyperparameters of FLARE on different datasets.Prompt D.5: exemplars of StrategyQAGenerate a yes or no answer to the following question.Question: Do hamsters provide food for any animals?Answer: Hamsters are prey animals.Prey are food for predators.Thus, hamsters provide food for some animals.So the final answer is yes.