Allies: Prompting Large Language Model with Beam Search

With the advance of large language models (LLMs), the research field of LLM applications becomes more and more popular and the idea of constructing pipelines to accomplish complex tasks by stacking LLM API calls come true. However, this kind of methods face two limitations: narrow information coverage and low fault tolerance. In this work, we propose a novel method called ALLIES. Given an input query, ALLIES leverages LLMs to iteratively generate new queries related to the original query, enabling an iterative reasoning process. By iteratively refining and expanding the scope of the original query, ALLIES captures and utilizes hidden knowledge that may not be directly obtainable through retrieval. We take zero-shot open-domain question answering (ODQA) as an application scene and evaluate ALLIES on the widely-used benchmarks, such as NQ, WebQ and TriviaQA. The experimental results demonstrate that ALLIES significantly outperforms other zero-shot baselines, indicating its effectiveness in tackling those challenges. Our code is available in https://github.com/microsoft/SimXNS/tree/main/ALLIES.

However, despite the advancements, these methods still face two main limitations.(1) Firstly, narrow information coverage.When incorporating relevant information, the majority of these approaches only employ the query itself to find or retrieve additional contextual information.Nonetheless, there are instances where responding to the query necessitates implicit knowledge that is related to the query but cannot be easily found solely using the given query.Consequently, the LLM may fail to acquire crucial information required for accurately responding to the query.
(2) Secondly, low fault tolerance.Most of these methods follow the pipeline style, consisting of unique steps calling LLM APIs to generate responses to fulfill different needs in a single turn.It means that the model is expected to provide the correct response in a single attempt.If an internal step fails, either the whole pipeline will face the risk of exception or the error will be propagated to downstream steps.Consequently, if the model fails to find the necessary information or misinterprets the question, it may produce an incorrect response.
To address the aforementioned limitations, we propose a novel approach called ALLIES that applies a beam search strategy to generate responses.To better elaborate the method, we take opendomain question answering as the application scene and show an example of how ALLIES works in Figure 1.We adopt an interactive and iterative process.Initially, we generate additional queries by asking the LLM what other information they require, based on the existing query-evidence pair.These generated queries serve as prompts for retrieving relevant evidence from external sources.
The example of answering a question "when was the first driver's license required?"using ALLIES.The correct answer is "January 1, 1904".
The retrieved evidence is then added to the existing query-evidence pair.Next, we employ the LLM to respond to the initial query based on the augmented query-evidence pairs.Subsequently, we solicit the LLM to score the response, taking into account the query and the augmented query-evidence pair.This scoring process provides a measure of confidence in the generated response.The iterations continue until the score surpasses a predefined threshold, indicating a sufficiently confident answer or the maximum depth of the tree traversal is reached.Once either of these conditions is fulfilled, the process terminates, and the answer is outputted as the final result.Responding to the query using ALLIES can be conceptualized as a tree traversal process, starting from the root node and progressing towards the leaf nodes, where each internal node in the tree represents a generated query.
The main advantages of our method are two folds: (1) Firstly, we employ an extension strategy that extends the original question to multiple relevant questions, broadening the information coverage.This approach enables the LLM to gain a deeper understanding of the complex question by focusing on its constituent parts.By providing the LLM with more specific and targeted queries, we enhance their ability to comprehend and process the question effectively.(2) Secondly, during the iterative process, we employ a dynamic pruning technique that retains only the top B answers at each step.This increases the fault tolerance and robustness of our model by allowing the LLM to make mistakes during the reasoning process.Any erroneous answers can be replaced by alternative answers, leading to more accurate and reliable responses.This flexibility and adaptability contribute to the improved performance of our approach.
With the idea of ALLIES, we take zero-shot opendomain question answering (ODQA) as an application scene and evaluate ALLIES in several popular benchmarks.We conduct experiments on the NQ, TriviaQA and WebQ datasets.The results demonstrate that ALLIES significantly outperforms several representative baselines while maintaining an acceptable cost.The case study further confirms the aforementioned advantages of our method.
In summary, our main contributions can be summarized as follows: 1. We propose ALLIES, which leverages a beam search strategy for response generation.Within this framework, we adopt an interactive and iterative process to enhance the accuracy and robustness of the responses.
2. By extending the original question into multiple relevant questions and employing a dynamic pruning technique, we improve the understanding of complex questions and increase the model's robustness.This allows for mistakes and alternative answers, resulting in more accurate and robust responses.
3. By taking zero-shot ODQA as an application scene, results on the NQ, TriviaQA and WebQ datasets demonstrate the significant outperformance of our method compared to baseline approaches.The case study further validates the advantages of our approach.
2 Related Work

Open-Domain Question Answering
Open-domain question answering is a task that aims to provide answers to questions without relying on specific context.This task can be categorized into two settings: the open-book setting and the closed-book setting.In the open-book setting, models [Chen et al., 2017, Izacard and Grave, 2021, 2020] typically consist of a retriever and a reader component.The retriever's role is to retrieve relevant information from a corpus such as Wikipedia [Chen et al., 2017, Izacard andGrave, 2021] or web pages [Lazaridou et al., 2022, Nakano et al., 2021], while the reader focuses on answering the question based on the retrieved information.
In the closed-book setting, models have no access to external corpus and have to rely on its model parameters to store all the information.Recent works find that large-scale language models like T5 [Raffel et al., 2020] can already answer questions without access to the external corpus.However, small-scale language models like RoBERTa [Liu et al., 2019] or GPT-2 [Radford et al., 2019] still face challenges in accurately answering questions in this setting.

Large Language Model Enhanced Question Answering
In recent times, there has been a shift towards utilizing large language models (LLMs) for question answering [Chowdhery et al., 2022, Du et al., 2022, Liu et al., 2021].This research can be broadly categorized into two lines of work.The first line of work focuses on preprocess methods [Borgeaud et al., 2022, Ram et al., 2023, Shi et al., 2023], which involve obtaining relevant documents and then utilizing LLMs to generate answers.Within this line of work, there are two main approaches.Retrieve-then-read methods [Ram et al., 2023, Shi et al., 2023] employ a retrieval model to retrieve relevant documents, while generate-then-read methods [Yu et al., 2022, Sun et al., 2022] fully leverage the capabilities of LLMs.Furthermore, researchers have demonstrated that combining generation and retrieval can lead to further gains [Yu et al., 2022].
The second line focuses on posthoc methods (like works on QA with attribution) [Rashkin et al., 2021, Gao et al., 2022, Bohnet et al., 2022, Menick et al., 2022], which involve generating an answer using an LLM and then refining it with the help of a verifier and a retriever.The retrieved documents in the second stage serve as explanations for the generated answer.

Main Idea
The main idea of ALLIES is an interactive and iterative process based on the widely-used search algorithm, beam search1 .We use a tuple with five slots to represent a state, which is the element of a beam.Each state ⟨q, Q, E, r, s⟩ consists of the original query q, the set of historical query completions Q, the set of historical external evidences E, the current response r, and the estimated score s according to the current state.Assume the maximum search depth is D, as illustrated in Figure 2, there are four main stages of ALLIES.

Beam Initialization
In the beginning, we initialize the beam by asking the LLM to answer the query directly and by answering the query based on retrieved evidence.The retrieved evidence is obtained by first retrieving related documents using the original query and then summarizing the documents.The generated tuples will be added to the beam.
Algorithm 1 The process of generating the response to a given query using ALLIES.
Hyperparameters: The maximum number K of generated queries, the maximum depth D of extension, the number N of documents from retrieval, the score threshold S, and the beam size B. Input: A query q.
▷ The second seed.9: for extension depth d in 1 → D do ▷ Extending within the depth.10: Clear the beam for the current depth S d = ∅.11: for each tuple in the previous beam ⟨q, Q, E, a, s⟩ ∈ S d−1 do ▷ Iterate the previous tuples.12: Find the extended queries Q ′ = Ask(q, Q, E, K). 13: for each extended query q ′ ∈ Q ′ do ▷ Try each possible extension.14: Retrieve a evidence e ′ = Retrieve(qori, q ′ , N ).15: Try to answer with all the evidences a ′ = Answer(q, Q ∪ {q ′ }, E ∪ {e ′ }).16: Score the answer s ′ = Score(q, Q ∪ {q ′ }, E ∪ {e ′ }, a ′ ).17: Add the current extended tuple to the beam end if 25: end for 26: Find the tuple ⟨q, Q, E, â, smax⟩ ∈ SD with the largest score smax and â is the final answer.

Beam Expansion
During the beam search process, we iteratively pop out one element from the front of the beam.For each element, we generate queries using the Ask Function.Then, for each generated query, we retrieve relevant evidence and ask the LLM to answer the query based on both the retrieved evidence and the reasoning history.The LLM scores the generated answers based on the reasoning history, and the newly formatted tuples are added to the end of the beam.

Beam Pruning
At the end of each search depth, we rank the newly generated answers and keep only top B answers.

Beam Termination
If the highest-ranking answer in the beam has a score exceeding the predefined threshold, the search process terminates, and the answer is outputted.Otherwise, the process continues.If none of the elements in the beam reaches the threshold, we output the highest-scoring answer when the search reaches the maximum depth.

Detailed Approach for ODQA
In this section, we present the application of AL-LIES in ODQA, whose algorithm is illustrated in Algorithm 1.There are four key functions used in ALLIES, each serving a specific purpose.The corresponding prompts are illustrated in Appendix C.

Answering Function
This function takes as input the original query q, previously generated queries Q, and corresponding retrieval evidence E. It constructs a reasoning history {⟨q 1 , e 1 ⟩ , ⟨q 2 , e 2 ⟩ , ...} by extracting q i ∈ Q and e i ∈ E. The function then asks the LLM to reason over the reasoning history and provide an answer to the original query.

Asking Function
Given the query q, previously generated queries Q, corresponding retrieval evidence E, and the maximum number of queries to be generated K, this function constructs a reasoning history {⟨q 1 , e 1 ⟩ , ⟨q 2 , e 2 ⟩ , ...} by extracting q i ∈ Q and e i ∈ E. The LLM is then asked to reason over the reasoning history and determine what additional information it requires to answer the question.The function outputs the generated queries.
4.3 Retrieval Function Retrieve(q ori , q, N ) Given the original query q ori , query q, and the maximum number of documents to be retrieved N , this function uses a dense retriever to retrieve the top-N most similar documents.The LLM is then asked to extract the most useful information from the documents and summarize them, providing a concise version of the retrieved information.We can also use LLM to directly generate a background document like GENREAD [Yu et al., 2022] as an alternative and we call this function Retrieve ′ (q ori ).
4.4 Scoring Function Score(q, Q, E, a) Given the original query q, previously generated queries Q, corresponding retrieval evidence E, and the generated answer a from the LLM, this function constructs a reasoning history {⟨q 1 , e 1 ⟩ , ⟨q 2 , e 2 ⟩ , ...} by extracting q i ∈ Q and e i ∈ E. The LLM is then asked to consider the reasoning history and assess the probability that the candidate answer is the true answer.The function outputs a score representing the confidence in the generated answer.

Experimental Setting
In this section, we conduct experiments on three open-domain question-answering (QA) datasets: NQ [Kwiatkowski et al., 2019], TriviaQA [Joshi et al., 2017], and WebQ [Berant et al., 2013].Since we focus on zero-shot ODQA, we utilize only the complete test sets of NQ and WebQ.To reduce costs, we randomly selected 1000 samples from the TriviaQA test set for evaluation purposes.Original detailed statistics regarding these three datasets can be found in Appendix A. We evaluate the performance using two metrics: the exact match (EM) score and the F1 score.Specifically, a predicted answer is considered correct only if its normalized form matches any of the normalized versions of the answers provided in the answer list.The F1 score measures the word overlap between the normalized version of the predicted answer and the answers in the provided answer list.

Implementation
We employ GPT-3.5-Turbohosted by Azure Ope-nAI services as our large language model (LLM).As for the retriever component, we conduct separate finetuning for the NQ, TriviaQA, and WebQ datasets using their respective training sets.The architecture and performance of the dense retrieval component can be found in Appendix D. For the retrieval corpus, we use the Wikipedia dump from Dec. 20, 2018 as our retrieval corpus, encompassing a collection of 21,015,324 documents.

Baselines
We compare our method with three groups of zeroshot QA baselines.
The first group comprises baselines that utilize a retriever in their approach.This includes models such as BM25 + InstructGPT, Contriever + Instruct-GPT, Google + InstructGPT, and DPR + Instruct-GPT.These models employ a retriever to retrieve relevant information, which is then used by Instruct-GPT for answer generation.We obtained the reported performance numbers for these baselines from GENREAD [Yu et al., 2022].
The second group consists of baselines that do not utilize a retriever in their approach.This group includes models such as GPT-3 [Brown et al., 2020], InstructGPT [Yu et al., 2022], FLAN [Wei et al., 2021], GLaM [Du et al., 2022], and GEN-READ [Yu et al., 2022].The reported performance numbers for these baselines are obtained from their respective original papers.
The third group consists of models that we implemented ourselves, including directly answer, retrieve-then-answer, GENREAD [Yu et al., 2022], self-Ask [Press et al., 2022], and MCR [Yoran et al., 2023].Directly answer refers to the utilization of the LLM to directly answer the question.Retrievethen-answer involves retrieval before answering, where we experimented with different numbers of retrieved documents and reported their corresponding performance, which is the simplified version of ALLIES without beam search.We implemented GENREAD, self-Ask, and MCR based on their open-source code.However, we evaluate MCR only on the NQ dataset due to its high API cost.To ensure fairness among the baselines, we set the retrievers and LLM configurations to be the same.

Main Results
We present the main results of our zero-shot experiments in Table 1.Based on these results, several observations can be made: (1) Among the methods that utilize a retriever, the choice of the retriever has a significant impact on the model's performance.This indicates that the quality of the retrieved documents plays a crucial role in determining the overall system performance.GPT-3 [Brown et al., 2020] 14.6 ---14.4-InstructGPT [Yu et al., 2022] 20.9 -57.5 -18.6 -FLAN [Wei et al., 2021] 18.6 -55.0 ---GLaM [Du et al., 2022] 24 (2) Among the methods that do not use a retriever, GENREAD achieves the highest performance.This demonstrates the effectiveness of the generate-then-read pipeline, where the model generates background documents based on its own knowledge without relying on external corpus.
(3) Our implemented baselines, such as MCR and self-Ask, may not perform as well as expected.This is mainly because these methods heavily rely on result parsing, which limits their generalizability to other applications.
(4) Our proposed method, ALLIES, outperforms all existing baselines and achieves the highest performance on all datasets.This confirms the effectiveness of our model and demonstrates its superiority in open-domain question answering tasks.Additionally, our method relies less on result parsing, making it more generalizable to other applications.

Ablation Study
In ALLIES, we utilize LLMs to ask questions and retrieve evidence based on those questions.To investigate the effects of the evidence, we conduct  ablations by removing the evidence and using different types of evidence, as shown in Table 2.
Based on the results, we draw several conclusions: (1) When the evidence is removed, we only provide the LLM with related queries without any background information.In this case, the model's performance drops significantly, which confirms that incorporating evidence into the model can greatly improve its understanding of the query.(2) When using the LLM-generated background document (GENREAD), we observe that our model achieves slightly better results compared to retrieval & summary.This finding aligns with the observations made in GENREAD [Yu et al., 2022].The improved performance can be attributed to the fact that LLMs have seen these related documents during pretraining, and the generated documents are more specific and refined.

Query Complementation Analysis
By iteratively generating new queries to complement the original query, our ALLIES is capable of expanding the information coverage of the original query and capturing hidden knowledge that may not be directly obtainable through retrieval with Method Retrieval Times API Times Tokens Per API Tokens Per Query Directly Answer 0 1 54 1 × 54 = 54 GENREAD [Yu et al., 2022] 0 1 342 1 × 342 = 342 Self-Ask [Press et al., 2022  the original query.To verify this, we conduct a query complementation analysis that compares the retrieval results of retrieve-then-answer and AL-LIES.Specifically, we record the percentage of retrieval results containing the ground truth answer and present the findings in Table 3.
From the result, we can find that the retrieval results of ALLIES outperform those of retrievethen-answer across all datasets, which verifies the effectiveness of ALLIES.By iteratively generating new queries, we can expand the knowledge scope of the retrieval results, leading to a more comprehensive understanding of the original query and naturally producing better answers.

Effectiveness Analysis
In ALLIES, the use of multiple iterations of retrieval and generation may introduce additional costs.To analyze its effectiveness, we utilize the complete set of questions from the NQ dataset to conduct the effectiveness analysis, which systematically compares the effectiveness of several methods.
As shown in Table 4, we can have the following conclusions: (1) Multi-turn QA methods, including ALLIES and MCR, incur higher model inference costs compared to single-turn QA methods such as Directly Answer, GENREAD, Self-Ask, and Retrieve-Then-Answer.This increase in cost is primarily due to the multiple API calls involved.
(2) Among the multi-turn QA methods, although ALLIES requires more API calls, the token consumption per API is significantly lower than that of MCR, resulting in 1/6 inference cost of MCR.The higher token consumption per API in MCR can be attributed to the demonstration, which consumes a substantial number of tokens.(3) Generally, single-turn QA methods have lower token costs but exhibit lower model performance.In contrast, ALLIES achieves significantly better model performance while maintaining an acceptable token cost compared to MCR, thus demonstrating the effectiveness of our method.

Human Evaluation
In this section, we conducted a human evaluation to assess the accuracy of the scores generated by LLMs in our scoring function.We randomly selected 100 samples for score calculation and manually verified the generated scores.
Our findings indicate that 93 percent of the generated scores align with the requirements for score calculation.This validation confirms the rationale behind using LLMs to calculate the scores.However, we also observed some rare cases where two answers could both potentially address the question, but one of them was more accurate.In these cases, the LLMs assigned the same score to both answers, potentially leading to the selection of the less accurate answer.This issue can be attributed to the coarse nature of the prompt used for scoring, which can only assess the general relevance score.To address this issue, one possible solution for future work is to calculate the scores using an ensemble-and-vote approach.This would involve asking LLMs to rank all possible answers instead of scoring them individually, which would potentially achieve more accurate and reliable scores.

Hyper-parameter Study
Beam size B and beam depth D are two important hyper-parameters in our method.We study their effects by changing one parameter while fixing Question: Who led the soldiers in ending the raid on the harper's ferry arsenal?Answer: [Brevet Colonel Robert E. Lee,First Lieutenant Israel Greene] Generated Query: -What was the name of the leader who led the soldiers in ending the raid on the Harper's Ferry arsenal?-Who was the overall commander of the soldiers who led the operation to retake the arsenal at Harpers Ferry?Retrieved Evidence: -The soldiers who led the operation to retake the arsenal at Harpers Ferry were under the overall command of Colonel Robert E. Lee.
-Colonel Robert E. Lee was in overall command of the operation to retake the arsenal.It is possible that he may have played a role in leading the soldiers to end the raid.other parameters and observing the performance trends, which are shown in Figure 3.
Study on Beam Size B. Beam size refers to the number of questions we keep at each layer during answer searching.From the results, we observe that the performance reaches its peak when the beam size (B) is set to 2. Values smaller or larger than this threshold lead to performance degradation.This is primarily because a larger beam size provides the model with more opportunities to make mistakes.However, when the beam size is too large, the model struggles to effectively rank the multiple candidates and select the best answer.Additionally, an increase in beam size also incurs additional computational costs.
Study on Beam Depth D. Beam depth refers to the maximum depth our model can reach during answer searching.From the results, we find that the performance change during beam depth tuning is relatively small.This is mainly due to the early stop mechanism we implemented, where the answer searching can terminate before reaching the maximum search depth if the answer score surpasses the threshold.However, we also observe that when the beam depth is too large (e.g., 4), the model's performance starts to decline.We be-lieve this is mainly because, in most cases, a beam depth of 2 provides the model with sufficient background information.Increasing the beam depth beyond that only introduces more noisy information, which may complicate the generation of the correct answer for the LLM.

Case Study
In this section, we provide examples that illustrate the reasoning process of our ALLIES method, which is shown in Table 5.From these examples, we draw the following conclusions: (1) The generated queries in our method are more specific and focused compared to the original query.This specificity improves the accuracy of the retrieval process, resulting in more accurate and relevant retrieved evidence.Consequently, the generated answers are of higher quality.
(2) During the answer generation process, there might be instances where wrong answers are initially predicted.However, our scoring function effectively assigns lower scores to these wrong answers based on the reasoning history.As a result, the final output is the correct answer.This demonstrates the robustness of our method in handling potential mistakes and effectively filtering out incorrect answers.

Conclusion
In this paper, we introduce ALLIES, a novel method that addresses the limitations of using large language models (LLMs) for complex tasks.By leveraging LLMs to generate related queries iteratively, ALLIES enables iterative reasoning and expands the original query's scope to capture hidden knowledge.We evaluate ALLIES in zero-shot open-domain question answering and demonstrate its superiority over other baselines on benchmarks.As for future work, we plan to apply ALLIES in other complex tasks such as mathematical reasoning and so on.

Limitations
In this work, we propose an effective response generation method ALLIES.The limitations of the proposed method are as follows: (1) The computational cost of ALLIES is relatively high due to the need for multiple API calls and document retrieval.This can limit its practicality in resource-intensive scenarios or systems with limited computational resources.
(2) The operation of the model is based on the designed prompt.When applied to a new application scenario, crafting effective prompts may require additional time and effort from users.

Figure 2 :
Figure 2: The abstract process of ALLIES.
d by keeping only B tuples with largerest scores.▷ Prune the beam.21: if a tuple ⟨q, Q, E, a, s⟩ ∈ S d meets s ≥ S then ▷ Examine the exit.

Table 2 :
Ablation study results on NQ and WebQ.

Table 4 :
The effectiveness analysis of ALLIES.

Table 5 :
-In which country was the first driver's license required?-When did the UK implement mandatory licensing for drivers and what was the minimum qualifying age?Retrieved Evidence: -The first driverś license requirement was mandated on January 1, 1904, in the United Kingdom after the Motor Car Act 1903 received royal assent.The minimum qualifying age was set at 17, and every car owner... -The first formal driving test in the UK was introduced with the Road Traffic Act 1934, which made compulsory testing for all new drivers.Prior to this, UK driving licenses were introduced by the Motor Car Act 1903... Case studies of the process of ALLIES.