Adaptive Information Seeking for Open-Domain Question Answering

Information seeking is an essential step for open-domain question answering to efficiently gather evidence from a large corpus. Recently, iterative approaches have been proven to be effective for complex questions, by recursively retrieving new evidence at each step. However, almost all existing iterative approaches use predefined strategies, either applying the same retrieval function multiple times or fixing the order of different retrieval functions, which cannot fulfill the diverse requirements of various questions. In this paper, we propose a novel adaptive information-seeking strategy for open-domain question answering, namely AISO. Specifically, the whole retrieval and answer process is modeled as a partially observed Markov decision process, where three types of retrieval operations (e.g., BM25, DPR, and hyperlink) and one answer operation are defined as actions. According to the learned policy, AISO could adaptively select a proper retrieval action to seek the missing evidence at each step, based on the collected evidence and the reformulated query, or directly output the answer when the evidence set is sufficient for the question. Experiments on SQuAD Open and HotpotQA fullwiki, which serve as single-hop and multi-hop open-domain QA benchmarks, show that AISO outperforms all baseline methods with predefined strategies in terms of both retrieval and answer evaluations.


Introduction
Open-domain question answering (QA) (Voorhees et al., 1999) is a task of answering questions using a large collection of texts (e.g., Wikipedia). It relies on a powerful information-seeking method to efficiently retrieve evidence from the given large corpus.
Traditional open-domain QA approaches mainly follow the two-stage retriever-reader pipeline (Chen et al., 2017;Yang et al., 2018; Karpukhin *Corresponding Author of which P2 and P3 are supporting passages, which are essential to answer the question. Except for the adaptive strategy in the last row, fixed strategy methods such as using BM25 or dense retrieval multiple times and first using BM25 and then entity linking have failed, due to the rank of the remaining supporting passages larger than 1k. The number between two arrows indicates the highest rank of the remaining supporting passages in the retrieval list, unless ranked first. et al., 2020), in which the retriever uses a determinate sparse or dense retrieval function to retrieve evidence, independently from the reading stage. But these approaches have limitations in answering complex questions, which need multi-hop or logical reasoning (Xiong et al., 2021). To tackle this issue, iterative approaches have been proposed to recurrently retrieve passages and reformulate the query based on the original question and the previously collected passages. Nevertheless, all of these approaches adopt fixed information-seeking strategies in the iterative process. For example, some works employ a single retrieval function multiple times Qi et al., 2019;Xiong et al., 2021), and the other works use a pre-defined sequence of retrieval functions (Asai et al., 2020;Dhingra et al., 2020).
However, the fixed information-seeking strategies cannot meet the diversified requirements of various problems. Taking Figure 1 as an example, the answer to the question is 'Catwoman' in P3. Due to the lack of essential supporting passages, simply applying BM25/dense retrieval (DR) multiple times (strategy 1 (Qi et al., 2019) or 2 (Xiong et al., 2021)), or using the mixed but fixed strategy (strategy 3 (Asai et al., 2020)) cannot answer the question. Specifically, it is hard for Qi et al. (2019) to generate the ideal query 'Catwoman game' by considering P1 or P2, thus BM25 (Robertson and Zaragoza, 2009) suffers from the mismatch problem and fails to find the next supporting passage P3. The representation learning of salient but rare phrases (e.g. 'Pitof') still remains a challenging problem (Karpukhin et al., 2020), which may affect the effectiveness of dense retrieval, i.e., the supporting passage P3 is ranked 65, while P1 and P2 do not appear in the top-1000 list at the first step. Furthermore, link retrieval functions fail when the current passage, e.g., P2, has no valid entity links.
Motivated by the above observations, we propose an Adaptive Information-Seeking approach for Open-domain QA, namely AISO. Firstly, the task of open-domain QA is formulated as a partially observed Markov decision process (POMDP) to reflect the interactive characteristics between the QA model (i.e., agent) and the intractable largescale corpus (i.e., environment). The agent is asked to perform an action according to its state (belief module) and the policy it learned (policy module). Specifically, the belief module of the agent maintains a set of evidence to form its state. Moreover, there are two groups of actions for the policy module to choose, 1) retrieval action that consists of the type of retrieval function and the reformulated query for requesting evidence, and 2) answer action that returns a piece of text to answer the question, then completes the process. Thus, in each step, the agent emits an action to the environment, which returns a passage as the observation back to the agent. The agent updates the evidence set and generates the next action, step by step, until the evidence set is sufficient to trigger the answer action to answer the question. To learn such a strategy, we train the policy in imitation learning by cloning the behavior of an oracle online, which avoids the hassle of designing reward functions and solves the POMDP in the fashion of supervised learning.
Our experimental results show that our approach achieves better retrieval and answering performance than the state-of-the-art approaches on SQuAD Open and HotpotQA fullwiki, which are the representative single-hop and multi-hop datasets for open-domain QA. Furthermore, AISO significantly reduces the number of reading steps in the inference stage.
In summary, our contributions include: • To the best of our knowledge, we are the first to introduce the adaptive information-seeking strategy to the open-domain QA task; • Modeling adaptive information-seeking as a POMDP, we propose AISO, which learns the policy via imitation learning and has great potential for expansion.
• The proposed AISO achieves state-of-theart performance on two public dataset and wins the first place on the HotpotQA fullwiki leaderboard. Our code is available at https: //github.com/zycdev/AISO.

Related Work
Traditional approaches of open-domain QA mainly follow the two-stage retriever-reader pipeline (Chen et al., 2017): a retriever first gathers relevant passages as evidence candidates, then a reader reads the retrieved candidates to form an answer.
In the retrieval stage, most approaches employ a determinate retrieval function and treat each passage independently (Wang et al., 2018;Lin et al., 2018;Lee et al., 2018;Yang et al., 2018;Pang et al., 2019;Lee et al., 2019;Guu et al., 2020;Karpukhin et al., 2020;Izacard and Grave, 2021). As an extension, some approaches further consider the relations between passages through hyperlinks or entity links and extend evidence with the linked neighbor passages (Nie et al., 2019;Das et al., 2019b;Zhao et al., 2020). However, pipeline approaches retrieve evidence independently from reader, leading to 1) introduce less-relevant evidence to the question, and 2) hard to model the complex question which has high-order relationship between question and evidence. Instead, recent iterative approaches sequentially retrieve new passages by updating the query inputted to a specific retrieval function at each step, conditioned on the information already gathered.    Qi et al. (2020) update the natural language query. After the first step retrieval using TF-IDF, Asai et al. (2020) and  recursively select subsequent supporting passages on top of a hyperlinked passage graph. Nevertheless, all of these approaches adopt fixed information-seeking strategies, employing the same retrieval function multiple times Feldman and El-Yaniv, 2019;Xiong et al., 2021;Ding et al., 2019;Qi et al., 2019;Zhang et al., 2020;Qi et al., 2020) or pre-designated sequence of applying retrieval functions (Asai et al., 2020;. Due to the diversity of questions, these fixed strategies established in advance may not be optimal for all questions, or even fail to collect evidence.

Method
In this section, we first formulate the open-domain QA task as a partially observed Markov decision process (POMDP) and introduce the dynamics of the environment. Then, we elaborate on how the agent interacts with the environment to seek evidence and answer a question. Finally, to solve the POMDP, we describe how to train the agent via imitation learning.

Open-Domain QA as a POMDP
Given a question q and a large corpus P composed of passages, the task of open-domain QA is to col-lect a set of evidence E ⊂ P and answer the question based on the gathered evidence.
The fashion of iterative evidence gathering, proven effective by previous works Asai et al., 2020;Xiong et al., 2021), is essentially a sequential decision-making process. Besides, since the corpus is large, ranging from millions (e.g., Wikipedia) to billions (e.g., the Web), and the input length of a QA model is limited, the QA model can only observe a part of the corpus. Owing to the above two reasons, we model opendomain QA as a partially observed Markov decision process.
In the POMDP we designed, as shown in Figure 2, the agent is the QA model that needs to issue actions to seek evidence from the largescale corpus hidden in the environment and finally respond to the question. By executing the received action, the environment can return a retrieved passage to the agent as an observation of the corpus. Formally, the POMDP is defined by (S, A, O, Ω, Z, R), where R is the reward function.
Actions: At timestep t = 0, 1, · · · , T , the action a t in the action space A = F × U is a request for an executable function f ∈ F, expressed as ⟨f, u⟩, where u ∈ U is the text argument that gets passed to f . The space of executable functions F includes two groups of functions, 1) retrieval function that takes the query u and corpus P as input and ranks a retrieval list of passages as P f (u) , 2) answer function that replies to the question q with the answer u and ends the process. The action a t is performed following the policy Π described in Subsection 3.2.2.
States: The environment state s t in the state space S contains revealing states of retrieval lists of all history retrieval actions. When the agent issues an action a t = ⟨f, u⟩, s t will transfer to s t+1 governed by a deterministic transition dynamics Ω(s t , a t ). Specifically, Ω will mark the topmost unrevealed passage in the retrieval list P f (u) as revealed. If the environment has never executed a t before, it will first search and cache P f (u) for possible repeated retrieval actions in the future.
Observations: On reaching the new environment state s t+1 , the environment will return an observation o t+1 from the observation space O = {q} ∪ P, governed by the deterministic observation dynamics Z. At the initial timestep, the question q will returned as o 0 . In other cases, Z is designed to return only the last passage marked as revealed in P f (u) at a time. For example, if the action ⟨f, u⟩ is received for the kth time, the kth passage in P f (u) will be returned.

Agent
The agent interacts with the environment to collect evidence for answering the question. Without access to the environment state s t , the agent can only perform sub-optimal actions based on current observations. It needs to build its belief b t in the state that the environment may be in, based on its experience h t = (o 0 , a 0 , o 1 , · · · , a t−1 , o t ). Therefore, the agent consists of two modules: belief module Φ that generates the belief state b t = Φ(h t ) from the experience h t , and policy module Π that prescribes the action a t = Π(b t ) to take for current belief state b t .
Both belief and policy modules are constructed based on pretrained Transformer encoders (Clark et al., 2020), respectively denoted as Ψ belief and Ψ policy , which encode each inputted token into a d-dimensional contextual representation. The input of both encoders is a belief state, formatted as where the subscript o denotes the observation passage, and the others passages come from the collected evidence set E, [SOP] is a special token to separate the title and con-tent of a passage, [YES] and [NO] are used to indicate yes/no answer, and [NONE] is generally used to indicate that there is no desired answer/query/evidence. In this way, the self-attention mechanism across the concatenated sequence allows each passage in the input to interact with others, which has been shown crucial for multi-hop reasoning (Wang et al., 2019a).

Belief Module
The belief module Φ transforms the agent's experience h t into a belief state b t by maintaining a set of evidence E t−1 . At the end of the process, the evidence set E is expected to contain sufficient evidence necessary to answer the question and no irrelevant passage. In the iterative process, the agent believes that all the passages in E may help answer the question. In other words, those passages that were observed but excluded from the evidence set, i.e., o 1:t−1 \ E t−1 , are believed to be irrelevant to the question.
For simplicity, assuming that the negative passages o 1:t−1 \ E t−1 and action history a <t are not helpful for subsequent decision-making, the expe- At the beginning, the belief state b 0 is initialized to ⟨q, ∅⟩, and the evidence set E 0 is initialized to ∅.
To maintain the essential evidence set E t , we use a trainable scoring function ϕ(p|b t ) to identify each evidence candidate p ∈ C t . Specifically, each passage is represented as the contextual representation of the special token [SOP] in it, which is encoded by Ψ belief . Then, the representation of each candidate is projected into a score through a linear layer. Besides, we use a pseudo passage p 0 , represented as [None], to indicate the dynamic threshold of the evidence set. In this way, after step t, the evidence set is updated as It is worth noting that these evidence candidates are scored jointly since encoded together in the same input, different from conventional rerankers that score separately.

Policy Module
The policy module Π decides the next action a t to be taken based on the current belief state b t . In this paper, we equipped the agent with three retrieval functions and one answer function, which means that the action space A consists of three types of retrieval actions and one type of answer actions. However, unlike the finite space of executable functions F, the space of function arguments U includes all possible natural-language queries and answers. To narrow the search space, for each executable function, we employ a suggester to propose a plausible query or answer as the argument passed to the function. Finally, we apply an action scoring function in the narrowed action space and select the action with the highest score.

Equipped Functions Formally, the space of executable functions is defined as
Among them, except f o is the answer function used to reply to the question, the rest are three distinct off-the-shelf retrieval functions (RF) used to explore the corpus. f s is a sparse RF, implemented as BM25 (Robertson and Zaragoza, 2009). It performs well when the query is concise and contains highly selective keywords but often fails to capture the semantics of the query. f d is a dense RF, implemented as MDR (Xiong et al., 2021) for multihop questions, and DPR (Karpukhin et al., 2020) for single-hop questions. Dense RFs can capture lexical variations and semantic relationships, but they struggle when encountering out-of-vocabulary words. f l is a link RF, implemented as hyperlink. When hyperlink markups are available in a source passage, it can readily map a query (i.e., anchor text) to the target passage.

Argument Generation
The space of function arguments U, composed of textual queries and answers, is too large to perform an exhaustive search due to the complexity of natural language. To reduce the search complexity, inspired by Yao et al. (2020), we employ four argument generators to generate the most plausible query/answer for the equipped functions.
g o is a trainable reading comprehension model for f o . It is a span extractor built upon the contextual representations outputted by the encoder Ψ policy . Like conventional extractive reading comprehension models (Yang et al., 2018;Clark et al., 2020), g o uses the contextual representations to calculate the start and end positions of the most plausible answer u o . If the current context C t is insufficient to answer the question, the special token [NONE] will be extreacted. g s is a query reformulation model for f s . In this work, we directly employ the well-trained query reformulator from Qi et al. (2019) for multi-hop questions, which takes the belief state b t as input and outputs a span of the input sequence as the sparse query u s . As for single-hop questions, since there exists no off-the-shelf multi-step query reformulator, we leave g s as an identity function that returns the original question directly. In this case, requesting the same RF multiple times is equivalent to traverse the retrieval list of original question.
g d is a query reformulator for f d . For multi-hop questions, g d concatenates the question q and the passage with the highest score in evidence set E t as the dense query u d , the same as the input of MDR (Xiong et al., 2021). If E t is empty, u d is equal to the question q. Similar to g s , g d for single-hop questions also leaves original questions unchanged.
g l is a trainable multi-class classifier for f l . It selects the most promising anchor text from the belief state b t . To enable rejecting all anchors, [NONE] is also treated as a candidate anchor. g l shares the encoder Ψ policy , where each anchor is represented by the average of contextual representations of its tokens. Upon Ψ policy , we use a linear layer to project the hidden representations of candidate anchors to real values and select the anchor with the highest value as the link query u l .
In this way, the action space is narrowed down toǍ = {⟨f s , u s ⟩, ⟨f d , u d ⟩, ⟨f l , u l ⟩, ⟨f o , u o ⟩}.
Action Selection The action scoring function π is also built upon the output of Ψ policy . To score an action ⟨f, u⟩ for current belief state b t , an additional two-layer (3d × 4d × 1) MLP, with a ReLU activation in between, projects the concatenated representation of b t , executable function f , and function argument u, i.e., v [CLS] , w f , and v u , into a real value. w f ∈ R d is a trainable embedding for each executable function, the same dimension as the token embedding. v u is specific for each function. Since u s , u l and u o have explicit text span in the b t , thus their v u are the averages of their token representations. As for u d , if g d does not expand the original question, v u d is the contextual representation of [NONE]. Otherwise, v u d is the [SOP] of the passage concatenated to the question.
In short, the next action is selected from the narrowed action spaceǍ by the scoring function π, a t = Π(b t ) = arg max a∈Ǎ π(a|b t ). (3)

Training
In the agent, in addition to the encoders Ψ belief and Ψ policy , we need to train the evidence scoring function ϕ, link classifier g l , answer extractor g o , and action scoring function π, whose losses are L ϕ , L l , L o , and L π . Since the policy module is dependent on the belief module, we train the agent jointly using the following loss function, Unlike ϕ, g l and g o that can be trained in supervised learning through human annotations in QA datasets, the supervision signal for π is hard to be derived directly from QA datasets. Even though policies are usually trained via reinforcement learning, reinforcement learning algorithms (Sutton et al., 2000;Mnih et al., 2015) are often sensitive to the quality of reward functions. For a complex task, the reward function R is often hard to specify and exhaustive to tune. Inspired by Choudhury et al. (2017), we explore the use of imitation learning (IL) by querying a model-based oracle online and imitating the action a ⋆ chose by the oracle, which avoids the hassle of designing R and solves the POMDP in the fashion of supervised learning. Thus, the loss of π is defined as the cross entropy, L π = − log e π(a ⋆ |b) a∈Ǎ e π(a|b) , where b is the belief state of the agent. The link classifier g l and the answer extractor g o are also optimized with multi-class cross-entropy losses. For g l , denoting its loss as L l , the classification label is set to the anchor text that links to a gold supporting passage, if there is no such anchor, then the pseudo hyperlink [NONE] is labeled. g o is trained as a classifier of start and end position following previous work (Clark et al., 2020), denoting its loss as L o . Considering the belief state b = ⟨q, {p 1 , p 2 , · · · , p |C| }⟩, the ListMLE (Xia et al., 2008) ranking loss of the evidence scoring function ϕ is defined as the negative log likelihood of the ground truth permutation, where y is the relevance label of {p 0 , p 1 , · · · , p |C| } and τ y is their ground truth permutation. To learn the dynamic threshold ϕ(p 0 |b), we set the relevance label of the pseudo passage p 0 to y 0 = 0.5.
And passages in C are labeled as 1/0 according to whether they are gold supporting passages.
Model-based Oracle The model-based oracle has full access to the environment and can foresee the gold evidence and answer of every question, which means that the oracle can infer the rank of a supporting passage in the retrieval list of any retrieval action. Thus, given a state, the oracle can easily select a near-optimal one from candidate actions according to a greedy policy π ⋆ . Specifically, if all gold evidence is collected and the argument of an answer action is a correct answer, the oracle will select the answer action. Otherwise, the oracle will use a greedy algorithm to select the retrieval action that helps to gather a missing passage of evidence in the fewest steps.
Belief States Sampling We train the agent on sampled belief states instead of long trajectories.
In every epoch, one belief state is sampled for each question. To sample a belief state ⟨q, C⟩, we first uniformly sample a subset from q's gold evidence as C, which could be an empty set. However, at testing time, it is impossible for the candidate evidence set C to contain only gold evidence. To alleviate the mismatch of the state distribution between training and testing, we inject a few negative passages into C and shuffle them. We treat the first passage in the candidate set as the observation, and the others as evidence collected before. The distribution of injected negative passages can affect the test performance. In this work, to make it simple, we sample 0~2 passages from all top-ranked negative passages in retrieval lists of f s , f d , and f l .

Experiments
We evaluate AISO and baselines on two Wikipediasourced benchmarks. We first introduce the experimental setups, then describe the experimental results on evidence gathering and question answering. Furthermore, detailed analyses are discussed.

Experimental Setup
Data HotpotQA (Yang et al., 2018), a multi-hop QA benchmark. We focus on its fullwiki (open-domain) setting 1 . It requires gathering two supporting passages (paragraphs) to answering a question, given the introductory (first) paragraphs of 5M Wikipedia articles dumped on October 1, 2017.
SQuAD Open (Chen et al., 2017), a single-hop QA benchmark, whose questions are from the SQuAD dataset (Rajpurkar et al., 2016) and can be answered based on a single passage. We preprocess the Wikipedia dump on December 21, 2016 and extract hyperlinks using WikiExtractor 2 . Following Karpukhin et al. (2020), we split articles into some disjoint passages, resulting in 20M passages in total. We add two extra hyperlinks to each passage, one linking to its previous passage in the article, the other to the next passage.
Metrics To test whether the top-2 passages in the evidence set exactly cover both gold supporting passages, we use Supporting Passage Exact Match (P EM) as the evaluation metric following (Asai et al., 2020). To test the performance of answer extraction, we use EM and F1 as our metrics following (Yang et al., 2018). Implementation Details For sparse retrieval, we index all passages in the corpus with Elasticsearch and implement BM25 following Qi et al. (2019) 3 . For dense retrieval, we leverage the trained passage encoder and query encoder from Karpukhin et al. (2020) 4 and Xiong et al. (2021) 5 and index all passage vectors using FAISS (Johnson et al., 2019) offline. During training, we use the HNSWbased index for efficient low-latency retrieval; in test time, we use the exact inner product search index for better retrieval results. For link retrieval, the filtered hyperlinks are used, whose targets have to be another article from this dump.
Based on Huggingface Transformers (Wolf et al., 2020), we use ELECTRA (Clark et al., 2020) (d = 768/1024 for base/large) 6 as the initializations for our encoders Ψ belief and Ψ policy . The maximum number of passages inputted into the encoders is set to 3 and the length of input tokens is limited to 1 https://hotpotqa.github.io/wiki-readme.html 2 https://github.com/attardi/wikiextractor. We do not use the processed data provided by Chen et al. (2017) because it removed the hyperlinks required by our link RF.
3 https://github.com/qipeng/golden-retriever 4 https://github.com/facebookresearch/DPR, the multi-set version is used 5 https://github.com/facebookresearch/multihop_dense _retrieval 6 Many recent approaches are based on ELECTRA, so we use ELECTRA for fair comparison.  (Khattab et al., 2021) 86.70 f n s CogQA * (Ding et al., 2019) 57.80 -DDRQA † * (Chen et al., 2017) 79.80 -IRRR † * (Qi et al., 2020) 84.10 ≥150 f s • f n−1 l GRR † * (Asai et al., 2020) 75.70 ≥500 HopRetriever † *  82.54 ≥500 HopRetriever-plus † * 86.94 >500 TPRR † * (Xinyu et al., 2021) 86 512. To avoid the high confidence passages from being truncated, we input the passages of evidence in descending order of their belief scores from the previous step. To accelerate the model training, for the first 24 epochs, Ψ belief and Ψ policy share parameters, for the next 6 epochs, they are trained separately. The batch size is 32. We use Adam optimization with learning rate 2 × 10 −5 . To select the best agent (QA model), we first save several checkpoints that perform well on heuristic single-step metrics, such as action accuracy. Then we choose the one that performs best in the whole process on the development set. In test time, the number of interaction steps is limited to T . We set the maximum number of steps to T = 1000 if not specified. Once the agent has exhausted its step budget, it is forced to answer the question.
strategies. Moreover, the first three groups are the traditional pipeline approaches, and the others are iterative approaches. For effectiveness, we can conclude that 1) almost all the iterative approaches perform better than the pipeline methods, 2) the proposed adaptive information-seeking approach AISO large outperforms all previous methods and achieves the stateof-the-art performance. Moreover, our AISO base model outperforms some baselines that use the large version of pretrained language models, such as HopRetriever, GRR, IRRR, DDRQA, and MDR.
For efficiency, the cost of answering an opendomain question includes the retrieval cost and reading cost. Since the cost of reading a passage along with the question online is much greater than the cost of a search, the total cost is linear in # read, reported in the last column of Table 1. # read means the total number of passages read along with the question throughout the process, which is equal to the adaptive number of steps. We can find that the number of read passages in AISO model, i.e., the is about 35, which is extremely small than the competitive baselines (P EM > 80) that need to read at least 150 passages. That is to say, our AISO model is efficient in practice.
Question Answering Benefit from highperformance evidence gathering, as shown in Tables 2 and 3, AISO outperforms all existing methods across the evaluation metrics on the HotpotQA fullwiki and SQuAD Open benchmarks. This demonstrates that AISO is applicable to both multi-hop questions and single-hop questions. Notably, on the HotpotQA fullwiki blind test set 7 , AISO large significantly outperforms the second place TPRR (Xinyu et al., 2021) by 2.02% in Sup F1 (supporting sentence identification) and 1.69% on Joint F1.

Analysis
We conduct detailed analysis of AISO base on the HotpotQA fullwiki development set.
The effect of the belief and policy module As shown in the second part of Table 4, we examine the variations of AISO with the oracle evidence scoring function ϕ ⋆ or oracle action scoring function π ⋆ , which are key components of the belief  and policy module. When we replace our learned evidence scoring function with ϕ ⋆ that can identify supporting passage perfectly, the performance increase a lot while the reading cost do not change much. This means that the belief module has a more impact on the performance than the cost. If we further replace the learned π with π ⋆ , the cost decreases a lot. This shows that a good policy can greatly improve the efficiency.
The impact of retrieval functions As shown in the last part Table 4, the use of a single RF, such as f t s and f t d , leads to poor performance and low efficiency. Moreover, lack of any RF will degrade performance, which illustrates that all RFs contribute to performance. Specifically, although the link RF f l cannot be used alone, it contributes the most to performance and efficiency. Besides, the sparse RF f s may be better at shortening the information-seeking process than the dense RF f d , since removing f s from the action space leads to the number of read passages increase from 36.64 to 61.41. We conjecture this is because f s can rank the evidence that matches the salient query very high.
The impact of the maximum number of steps As shown in Figure 3, with the relaxation of the step limit T , AISO base can filter out negative passages and finally observe low-ranked evidence through more steps, so its performance improves and tends to converge. However, the cost is more paragraphs to read. Besides, once T exceeds 1000, only a few questions (about 1%) can benefit from the subsequent steps.
The ability to recover from mistakes We count three types of mistakes in gathering evidence on the HotpotQA development set. In the process of collecting evidence for 7405 questions, false evidence was added into the evidence set for 1061 questions, true evidence was missed for 449 questions, and true evidence was deleted from the evidence set for 131 questions. And we find that AISO recovered from 17.7%, 43.9%, and 35.9% of these three types of errors respectively, which implies that even without beam search, AISO base can make up for previous mistakes to some extent. Besides, we can see that false evidence is the most harmful to evidence gathering and the most difficult to remedy.

Conclusion and Future Work
This work presents an adaptive informationseeking approach for open-domain question answering, called AISO. It models the open-domain QA task as a POMDP, where the environment contains a large corpus and the agent is asked to sequentially select retrieval function and reformulate query to collect the evidence. AISO achieves stateof-the-art results on two public datasets, which demonstrates the necessity of different retrieval functions for different questions. In the future, we will explore other adaptive retrieval strategies, like directly optimizing various informationseeking metrics by using reinforcement learning techniques.

Ethical Considerations
We honor and support the ACL code of Ethics. The paper focuses on information seeking and question answering tasks, which aims to answer the question in the open-domain setting. It can be widely used in search engine and QA system, and can help people find the information more accuracy and efficiency. Simultaneously, the datasets we used in this paper are all from previously published works and do not involve privacy or ethical issues.