Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy

Large language models are powerful text processors and reasoners, but are still subject to limitations including outdated knowledge and hallucinations, which necessitates connecting them to the world. Retrieval-augmented large language models have raised extensive attention for grounding model generation on external knowledge. However, retrievers struggle to capture relevance, especially for queries with complex information needs. Recent work has proposed to improve relevance modeling by having large language models actively involved in retrieval, i.e., to improve retrieval with generation. In this paper, we show that strong performance can be achieved by a method we call Iter-RetGen, which synergizes retrieval and generation in an iterative manner. A model output shows what might be needed to finish a task, and thus provides an informative context for retrieving more relevant knowledge which in turn helps generate a better output in the next iteration. Compared with recent work which interleaves retrieval with generation when producing an output, Iter-RetGen processes all retrieved knowledge as a whole and largely preserves the flexibility in generation without structural constraints. We evaluate Iter-RetGen on multi-hop question answering, fact verification, and commonsense reasoning, and show that it can flexibly leverage parametric knowledge and non-parametric knowledge, and is superior to or competitive with state-of-the-art retrieval-augmented baselines while causing fewer overheads of retrieval and generation. We can further improve performance via generation-augmented retrieval adaptation.


Introduction
Generative Large Language Models (LLMs) have powered numerous applications, with wellperceived utility.Despite being powerful, LLMs lack knowledge that is under-represented in their training data, and are prone to hallucinations, especially in open-domain settings (OpenAI, 2023).* *Corresponding author: Minlie Huang.
Retrieval-augmented LLMs, therefore, have raised widespread attention as LLM outputs can be potentially grounded on external knowledge.
Previous retrieval-augmented LMs (Izacard et al., 2022b;Shi et al., 2023) typically adopted one-time retrieval, i.e., to retrieve knowledge using only the task input (e.g., a user question for open-domain question answering).One-time retrieval should suffice to fulfill the information needs if they are clearly stated in the original input, which is applicable to factoid question answering (Kwiatkowski et al., 2019) and single-hop fact verification (Thorne et al., 2018), but not to tasks with complex information needs, e.g., multi-hop reasoning (Yang et al., 2018) and long-form question answering (Fan et al., 2019).
To fulfill complex information needs, recent work proposes to gather required knowledge multiple times throughout the generation process, using partial generation (Trivedi et al., 2022a;Press et al., 2022)) or forward-looking sentence(s) (Jiang et al., 2023) as search queries.However, such structured workflows of interleaving retrieval with generation have the following limitations: (1) as intermediate generation is conditioned on knowledge retrieved before, with no awareness of knowledge retrieved afterwards, they fail to process all retrieved knowledge as a whole during the generation process; (2) they require multi-round retrieval to gather a comprehensive set of knowledge, and may frequently change the prompts by updating newly retrieved knowledge, thus increasing the overheads of both retrieval and generation.
In this paper, we find it simple but effective to enhance retrieval-augmented LLMs through iterative retrieval-generation synergy (ITER-RETGEN, Fig 1).ITER-RETGEN iterates retrieval-augmented generation and generation-augmented retrieval: Retrieval-augmented generation outputs a response to a task input based on all retrieved knowledge (initially using the task input as the query).This output shows what might be needed to fulfill the task, and thus can serve as an informative context to retrieve more relevant knowledge, i.e., generationaugmented retrieval.The newly retrieved knowledge can benefit another iteration of retrievalaugmented generation.We can also leverage model generations to adapt retrieval, by distilling knowledge from a re-ranker with access to model generations to a dense retriever with access to task inputs only, which may be beneficial in scenarios where user inputs can be easily collected, but relevant knowledge or desirable outputs are not annotated.
We evaluate our method on three tasks, including multi-hop question answering, fact verification, and commonsense reasoning.Our method prompts an LLM to produce a chain of reasoning steps followed by the final answer under a few-shot setting.For in-context demonstrations, we focus on problem-solving and follow Wei et al. (2022) to annotate chains of thoughts, without explicitly considering how generation-augmented retrieval might be affected, which makes it conceptually simple and easy to implement.Our method achieves up to 8.6% absolute gains over previous state-of-theart retrieval-augmented methods on four out of six datasets while being competitive on the remaining two.According to our experiments, generation generally benefits from more iterations, with two iterations giving the most performance gains.One may customize the performance-cost tradeoffs by choosing an appropriate number of iterations.We can further improve performance and also reduce iterations via the aforementioned generation-augmented retrieval adaptation.
We summarize our findings as follows: • Automatic metrics such as exact match can significantly underestimate the performance of LLMs in question answering tasks.Moreover, improvements in exact match do not always reflect improvements in generations.Evaluation using LLMs may be more reliable.
• ITER-RETGEN is superior to or competitive with state-of-the-art retrieval-augmented methods, while being simpler and causing fewer overheads of retrieval and generation.With generation-augmented retrieval adaptation, we can further improve performance and also reduce overheads (by reducing iterations).
• It is desirable for an LLM to leverage both parametric knowledge and non-parametric knowledge effectively.ITER-RETGEN consistently outperforms Self-Ask on question answering tasks, regardless of whether incontext non-parametric knowledge mentions the answers or not.

Related Work
In recent months, there has been a surge in LLMpowered applications, such as ChatGPT, Bing Chat, and CoPilot (Chen et al., 2021).While showing an unprecedented level of performance, LLMs are subject to the following limitations: (1) Due to a high demand for compute and data, it remains an open research question to continually update LLMs both efficiently and effectively (Scialom et al., 2022); (2) LLMs also tend to hallucinate (OpenAI, 2023), i.e., generating plausible but non-factual texts.To alleviate these issues, there is a growing trend of augmenting LLMs with tools (Mialon et al., 2023;Gou et al., 2023), e.g., a code interpreter (Gao et al., 2022b;Shao et al., 2023) or a search engine (Nakano et al., 2021), in an attempt to offload subtasks to more qualified experts, or to enrich the input context for LLMs by providing more relevant information.
Retrieval augmentation is a mainstream direction to connect LLMs to the external world.Previous retrieval-augmented LMs (Izacard and Grave, 2021;Shao and Huang, 2022) typically receive retrieved knowledge in a passive way: knowledge is retrieved based on the task inputs without LMs' intervention.As it is difficult for a retriever to capture relevance, especially in the zero-shot setting, recent work shows a shift towards having LLMs actively involved in retrieval to improve relevance modeling, e.g., to provide a specific context for retrieval with model generations (e.g., generated search queries (Nakano et al., 2021;Press et al., 2022;Yao et al., 2022), partial generation (Trivedi et al., 2022a), or forward-looking sentences (Jiang et al., 2023)).Khattab et al. (2022) proposed a DSP programming framework that supports various retrieval-augmented methods.
Recent work interleaves retrieval with generation when completing a single output.Such a structured workflow may reduce the flexibility in generation (Yao et al., 2022).ITER-RETGEN avoids interrupting generation with retrieval, but iterates retrieval and generation, i.e., to leverage the complete generation from the previous iteration to retrieve more relevant information which helps improve generation in the next iteration.ITER-RETGEN also has the advantage of processing all retrieved knowledge as a whole during the generation process, and is conceptually simpler and easier-to-implement, while being empirically strong in multi-hop question answering, fact verification, and commonsense reasoning.
A closely related work called GAR (Mao et al., 2021) augments queries with generated background information.HyDE (Gao et al., 2022a) also shares a similar spirit, but focuses on zero-shot information retrieval, and proposes to first prompt an LLM to produce "hypothetical" paragraphs that cover the information needed to answer a given question, and then use the generated paragraphs to retrieve the real ones.RepoCoder (Zhang et al., 2023) focuses on repository-level code completion, and proposes a 2-iteration retrieval-generation paradigm where the second iteration leverages the intermediate code completion for retrieval.By contrast, we propose to synergize retrieval and generation with ITER-RETGEN on various natural language tasks, and explore how we can further adapt retrieval with model generations.

Overview
Given a question q and a retrieval corpus D = {d} where d is a paragraph, ITER-RETGEN repeats retrieval-generation for T iterations; in iteration t, we (1) leverage the generation y t−1 from the previous iteration, concatenated with q, to retrieve top-k paragraphs, and then (2) prompt an LLM M to produce an output y t , with both the retrieved paragraphs (denoted as D y t−1 ||q ) and q integrated into the prompt.Therefore, each iteration can be formulated as follows: The last output y T will be produced as the final response.

Generation-Augmented Retrieval
There are many natural language tasks with complex information needs.For example, in opendomain multi-hop question answering, specific information needs may manifest themselves only after correctly answering some prerequisite subquestions.In other words, there may exist semantic gaps between the original question q and its supporting knowledge, which can not be effectively addressed by a retriever with a representation bottleneck.In the first iteration, we can retrieve knowledge with only the question q.In later iterations, the LLM output from the previous iteration, though having no guarantee of correctness, shows what might be needed to answer the question, and thus can be leveraged to bridge the semantic gaps; with improved retrieval, an LLM can potentially produce a better output.

Retrieval-Augmented Generation
In each iteration, we generate an output using Chain-of-Thought prompting except that we also prepend retrieved knowledge to the question q.Though there may exist more advanced prompting variants, e.g., incorporating previous generations into the prompt to enable direct refinements, we leave the explorations for future work, and focus on investigating the synergy between retrieval and generation in a straightforward manner.

Generation-Augmented Retrieval Adaptation
Model generations not only provide specific contexts for retrieval, but can also be leveraged to optimize the retriever, so that information needs in a question can be better captured by the retriever.
Dense Retriever We adopted dense retrieval in our experiments.Given a dense retriever parametrized by θ = {θ q , θ d } where θ q and θ d denote parameters of the query encoder and the paragraph encoder, respectively, the similarity score between a query and a paragraph is calculated as the inner product of their encoded vectors: Re-ranker A re-ranker, parametrized by ϕ, outputs the probability of a paragraph being relevant to a query; we denote the probability as s ϕ (q, d).
Distillation A re-ranker is typically better at capturing relevance between a query and a paragraph than a retriever.Therefore, we distill knowledge from a re-ranker to a retriever.To help the retriever better address the semantic gaps between a question and its supporting knowledge, we allow access to y 1 for the re-ranker (where y 1 is the LLM output from the first iteration).We optimize only the query encoder of the retriever using the following training objective: where KL(•, •) denotes the KL divergence between two probabilistic distributions.

Evaluation Settings
We conducted evaluations on all 125 questions from Bamboogle, the first 500 questions from the train set of StrategyQA, and the first 500 questions from the development sets of the other datasets.All methods are evaluated under the 3-shot setting, sharing the same questions in demonstrations.
Evaluation metrics are exact match (EM) and F1 for multi-hop question answering datasets, and accuracy for both fact verification and commonsense reasoning datasets.For more robust evaluation, we also evaluate the correctness of model outputs using text-davinci-003, the resulting metric denoted as Acc † .The prompt used for evaluation is as follows, where {question}, {model output}, and {answer} are placeholders.

Baselines
Direct Prompting (Brown et al., 2020) prompts an LLM to directly generate the final answer without an explanation.When augmenting Direct prompting with retrieval, we used the question to retrieve knowledge which will be placed before the question in the prompt.
CoT Prompting (Wei et al., 2022) prompts an LLM to generate natural language reasoning steps followed by the final answer.
ReAct (Yao et al., 2022) interleaves reasoning, action, and observation steps, until reaching the action of finalizing an answer.An action can be either generating a query to search for information or finalizing an answer.An observation is the concatenation of retrieved paragraphs.Self-Ask (Press et al., 2022) interleaves (i) followup question generation, (ii) retrieval using the follow-up, and (iii) answering the follow-up conditioned on the retrieved knowledge, until no more follow-up questions are generated and the LLM gives an answer to the original question.We followed (Yoran et al., 2023) to prepend newly retrieved paragraphs to the original question.On our evaluated tasks, Self-Ask is conceptually similar to ReAct, with the main difference being that Self-Ask accumulates retrieved knowledge before the original question in the prompt, while ReAct places retrieved knowledge right after its query.Self-Ask and IRCoT (Trivedi et al., 2022a) also share the spirit of synergizing reasoning and retrieval.DSP (Khattab et al., 2022) comprises a multi-hop retrieval stage and an answer prediction stage.For each hop within the retrieval stage, the model is prompted to generate search queries and to sum-marize retrieve knowledge for subsequent use.In the prediction stage, DSP generates the answer using CoT based on the summarized knowledge and retrieved documents.

Implementation Details
We used text-davinci-003 version of Instruct-GPT (Ouyang et al., 2022) as the backend LLM.
We also present experiments using the open-source Llama-2 models (Touvron et al., 2023) in Appendix A. All experiments used greedy decoding.Contriever-MSMARCO (Izacard et al., 2022a) was used for retrieval.We retrieved top-5 paragraphs for each query.We allowed at most 5 interactions with retrieval for ReAct and Self-Ask.We adapted the implementation of DSP1 to use the same generation model and retrieval systems as the other methods.
Note that the first iteration of ITER-RETGEN is CoT prompting with retrieval augmentation.Therefore, ITER-RETGEN and CoT prompting share the same annotated in-context demonstrations.All prompts are presented in the Appendix.
It is worth noting that, as shown by Table 3, ITER-RETGEN (T = 2) is superior to or competitive with ReAct and Self-Ask using fewer API calls to the LLM (i.e., 2) and fewer retrieved paragraphs (i.e., 5 per iteration, 10 in total).ITER-RETGEN is also conceptually simple, which is to iterate retrievalaugmented CoT, without complex processing.
We also compared ITER-RETGEN with DSP which also generates the answer using CoT based on retrieved knowledge but differs in information collection and processing.In each iteration, ITER-RETGEN retrieves knowledge based on (1) the question and (2) the previous model output which shows what may be needed to answer the question.With the number of iterations increasing, we tend to obtain a more comprehensive and relevant set of knowledge.Besides, unlike DSP, we do not summarize the retrieved documents for answer generation, and thus will not introduce summarization errors.As shown in Table 2, ITER-RETGEN outperforms DSP significantly.We manually investigate 10 random questions where DSP fails but ITER-RETGEN provides correct answers.On 40% of them, DSP fails to retrieve documents that cover the correct answers, while on 50% of them, the summarized knowledge is misleading, e.g., for the question "What occupation do Chris Menges and Aram Avakian share?",DSP generates a wrong summary "Chris Menges and Aram Avakian are both members of the American and British Societies of Cinematographers.", while the retrieved documents mention that Aram Avakian is a film editor and director, and only Chris Menges is with the American and British Societies of Cinematographers.
Acc † is a Reliable Metric To investigate how reliable Acc † is, we focused on model outputs where EM and Acc † disagree, and manually checked which metric gives more correct labels.On each of the four multi-hop question answering datasets, Table 5: Comparisons between Self-Ask and ITER-RETGEN (T = 2) on different subsets, in terms of Acc † .CoT ✓ is the subset of questions which CoT answers correctly without retrieval; CoT is the complement.w/ Answer Retrieved is the subset of questions for which a method (Self-Ask or ITER-RETGEN) successfully retrieves paragraphs that mention the answers; w/o Answer Retrieved is the complement.ITER-RETGEN tends to be much better at preserving the LLM's performance on questions that can be solved using CoT without retrieval, and is consistently more accurate regardless of whether retrieved knowledge mentions the answers or not.
we randomly sampled 20 model outputs from the second iteration of ITER-RETGEN, resulting in 80 samples in total.For 98.75% of samples, EM is 0 and Acc † is 1, while Acc † gives the correct labels 97.5% of the time, indicating that EM severely underestimates model performance.We also carried out the same evaluation for Self-Ask, and Acc † gives the correct labels 98.75% of the time when it is inconsistent with EM.
Acc † offers the advantage of identifying model outputs that are semantically correct, even if their surface forms differ from the annotated answers.As an illustration, for the question "Which country Jan Baptist Van Rensselaer's father is from?", the annotated answer is Dutch, while the model prediction is Netherlands, which is correct in terms of Acc † but is penalized by EM.
Notably, ITER-RETGEN (T ≥ 2) consistently demonstrate lower EM but higher Acc † than Self-Ask on 2WikiMultiHopQA, suggesting that enhancements in EM do not necessarily reflect improvements in the quality of generated answers.Generation Benefits Retrieval Adaptation To investigate how LLM outputs can be leveraged for retrieval adaptation, we experimented on Hot-PotQA and Feverous.Specifically, on each dataset, we sampled 9,000 random questions from the train set for training, and 1,000 for validation.We applied ITER-RETGEN for one iteration, and used the model outputs y 1 for retrieval adaptation as in Section 3.4.We used TART (Asai et al., 2022) as the re-ranker, and distilled knowledge from TART to the dense retriever for no more than 1,000 steps.Batch size was 32 and learning rate was 1e-5.We used the retriever checkpoint with the lowest distillation loss.
As shown by Table 4, retrieval adaptation enables ITER-RETGEN to achieve significantly higher Acc † with fewer iterations.We also demonstrated the benefits of using y 1 for adaptation by showing its improvements over a variant which only differs in that the re-ranker has no access to y 1 ; the training objective of this variant can be obtained by removing all y 1 notations in Eq. 3.

Generation Augments Retrieval
Table 6 shows the answer recall of retrieval in different iterations.The first iteration uses only the questions for retrieval and suffers from low answer recall.In the second iteration, retrieval, augmented with the LLM output from the first iteration,

Generation
The YMCA headquarters is located in Paris, France.Raclette is a traditional Swiss dish, made with melted cheese served with potatoes, pickles, and other accompaniments.It is not a traditional French dish, so it is unlikely to be found in Paris.So the answer is No Iteration 2 Retrieved Knowledge (1) Title: Raclette Context: Raclette is a Swiss dish, also popular in France, based on heating cheese and scraping off the melted part ... (2) Title: YMCA Building (Racine, Wisconsin) Context: ... Since the 50's, headquarters of the publishing house and the bookstore are located in 11 rue de la Montagne Sainte-Geneviève 75005 Paris ... Generation YMCA headquarters is located in Paris, France.Raclette is a dish native to parts of Switzerland, but it is also popular in France.So it is likely that Raclette can be found in Paris.So the answer is Yes Table 7: Two examples demonstrating retrieval-generation synergy.We underline generated phrases that help retrieve relevant knowledge which successfully corrects factual errors (in red) in the second iteration.Irrelevant retrieved paragraphs are not shown in the table for brevity.achieves significantly higher recall, indicating that LLM generations can help bridge the semantic gaps between complex questions and their supporting knowledge.However, performance quickly hits a plateau afterwards.

ITER-RETGEN Leverages Parametric and Non-Parametric Knowledge Better
Ideally, an LLM should flexibly utilize nonparametric knowledge or parametric knowledge depending on whether in-context non-parametric knowledge is relevant or not.Table 5 presents performance breakdowns on different subsets of questions for investigation.We considered the ability of CoT to answer a question correctly without re-trieval as a proxy for assessing an LLM's capability to answer the question using its parametric knowledge.Compared with Self-Ask, ITER-RETGEN tends to be significantly better at preserving the LLM's performance on questions that the LLM can solve using CoT without retrieval, while being competitive on the complementary subset.This may be because the structural constraints from Self-Ask makes an LLM over-sensitive to the precision and comprehensiveness of follow-up question generation and answering, and Self-Ask is also incapable of processing all retrieved knowledge as a whole, thus reducing the LLM's flexibility in solving a question.Moreover, ITER-RETGEN consistently outperforms Self-Ask by a large margin, regardless of whether the in-context non-parametric knowledge mentions the answers or not.This indicates that when the in-context non-parametric knowledge is irrelevant or incomplete, ITER-RETGEN exploits parametric knowledge better than Self-Ask.

Error Analysis
On HotPotQA, we manually analyzed 20 random cases where ITER-RETGEN (T = 2) fails.25% of predictions are false negatives.On 10% of cases, ITER-RETGEN retrieves all necessary information but fails to perform correct reasoning.The remaining 65% of error cases are related with retrieval, on 76.9% of which, retrieval is misled by completely wrong reasoning from the first iteration, while on the other cases, reasoning in the first iteration is partially correct, but the retriever fails to retrieve the missing pieces in the second iteration.We also observed that, in the first iteration, reasoning can be negatively affected by noisy and possibly distractive knowledge retrieved using only the questions as the queries.

Case Study
Table 7 demonstrates retrieval-generation synergy with two examples from HotPotQA and Strate-gyQA, respectively.In the first iteration, as both questions need multi-hop reasoning, the retriever fails to retrieve all supporting knowledge using only the questions.Despite being affected by distractive retrieved knowledge (the capacity of a different arena in the example from HotPotQA) and showing imperfect parametric knowledge (the generated statement that Raclette is unlikely to be found in Paris in the example from StrategyQA) in the first iteration, the LLM generates phrases that help retrieve relevant knowledge in the second iteration, and successfully corrects its outputs.

Conclusion
We demonstrate the effectiveness of ITER-RETGEN in answering questions with complex information needs.Despite simple, ITER-RETGEN outperforms retrieval-augmented methods that have a more complex workflow, which we believe could serve as a strong baseline for future research on retrieval-augmented generation.We also show that generation-augmented retrieval adaptation can further improve the performance of ITER-RETGEN while also reducing overheads.

Limitations
In this work, we propose to enhance retrievalaugmented large language models with ITER-RETGEN which synergizes retrieval and generation in an iterative manner, and demonstrates strong performance compared to more structured prompting techniques such as Self-Ask.However, it's worth noting that our experiments utilized a fixed black-box large language model, which may not have been equally optimized for various forms of prompting.It would be intriguing to investigate the potential of prompting-specific (gradient-based) optimization in pushing the limits further.This could involve enabling a large language model to leverage parametric and non-parametric knowledge more flexibly and effectively.By exploring this avenue, we may uncover new insights and advancements in the field.Furthermore, our experiments did not cover long-form generation which would probably benefit from more fine-grained retrieval than ITER-RETGEN does in this work.We acknowledge that this area warrants further exploration, and we leave it for future work.

A Experiments Using Llama-2
To demonstrate the effectiveness of ITER-RETGEN on open-source models, we replaced the generation model text-davinci-003 in Table 2 with Llama-2 models (Touvron et al., 2023), and re-ran the evaluation.As shown in Table 8, ITER-RETGEN consistently outperforms all baselines significantly.

B Few-Shot Prompts
In this section, we present all few-shot prompts used in our experiments.We replace retrieved paragraphs with the placeholder {Knowledge} for brevity.CoT prompting shares the same in-context demonstrations with ITER-RETGEN, except that it is not augmented with retrieval.

B.5 Feverous
Prompts for Direct Prompting, ReAct, Self-Ask, and ITER-RETGEN are presented in

Figure 1 :
Figure 1: ITER-RETGEN iterates retrieval and generation.In each iteration, ITER-RETGEN leverages the model output from the previous iteration as a specific context to help retrieve more relevant knowledge, which may help improve model generation (e.g., correcting the height of Hesse Hogan in this figure).We only show two iterations in this figure for brevity.Solid arrows connect queries to the retrieved knowledge, and dashed arrows denote retrieval-augmented generation.

HotPotQAQuestion:
The arena where the Lewiston Maineiacs played their home games can seat how many people?Answer: 3,677 seated Iteration 1 Retrieved Knowledge (1) Title: Lewiston Maineiacs Context: The Lewiston Maineiacs were a junior ice hockey team ... The team played its home games at the Androscoggin Bank Colisée ... (2) Title: Bangor Auditorium Context: ... a 5,948 seat multi-purpose arena located in Bangor, Maine ... Generation The Lewiston Maineiacs were a junior ice hockey team of the Quebec Major Junior Hockey League based in Lewiston, Maine.The team played its home games at the Androscoggin Bank Colisée.The Androscoggin Bank Colisée can seat up to 5,948 people.So the answer is 5,948 Iteration 2 Retrieved Knowledge (1) Title: Lewiston Maineiacs Context: The Lewiston Maineiacs were a junior ice hockey team ... The team played its home games at the Androscoggin Bank Colisée ... (2) Title: Androscoggin Bank Colisée Context: The Androscoggin Bank Colisée (formerly Central Maine Civic Center and Lewiston Colisee) is a 4,000 capacity (3,677 seated) multi-purpose arena ... Generation The Lewiston Maineiacs were a junior ice hockey team of the Quebec Major Junior Hockey League based in Lewiston, Maine.They played their home games at the Androscoggin Bank Colisée.The Androscoggin Bank Colisée has a capacity of 4,000 (3,677 seated).So the answer is 3,677 StrategyQA Question: Can you get Raclette in YMCA headquarters city?Answer: Yes Iteration 1 Retrieved Knowledge (1) Title: YMCA Building (Racine, Wisconsin) Context: ... Since the 50's, headquarters of the publishing house and the bookstore are located in 11 rue de la Montagne Sainte-Geneviève 75005 Paris ...

Table 1 :
Example questions from six datasets.

Table 2 :
Evaluation results on multi-hop question answering, fact verification, and commonsense reasoning datasets.Acc † is the accuracy of model outputs evaluated with text-davinci-003.For ITER-RETGEN, we evaluated LLM outputs in different iterations (up to 7 iterations).Underlined metric values are higher than those of Self-Ask.

Table 4 :
Distilled w/ y 1 Original Distilled w/o y 1 Distilled w/ y 1 Effect of using LLM generation y 1 on optimizing a dense retriever.We evaluated ITER-RETGEN on HotPotQA and Feverous in terms of Acc † .

Table 6 :
Answer recall of retrieved paragraphs in different iterations for ITER-RETGEN.

Table 8 :
Experiments using the open-source Llama-2 models.We used Acc † as the evaluation metric, i.e., to evaluate the accuracy of model outputs with text-davinci-003.What is the name of this American musician, singer, actor, comedian, and songwriter, who worked with Modern Records and born in December 5, 1932?The answer is Little Richard {Knowledge} Question: Between Chinua Achebe and Rachel Carson, who had more diverse jobs?The answer is Chinua Achebe {Knowledge} Question: Remember Me Ballin' is a CD single by Indo G that features an American rapper born in what year?The answer is 1979 {Knowledge} Question: