Mitigating False-Negative Contexts in Multi-document Question Answering with Retrieval Marginalization

Question Answering (QA) tasks requiring information from multiple documents often rely on a retrieval model to identify relevant information for reasoning. The retrieval model is typically trained to maximize the likelihood of the labeled supporting evidence. However, when retrieving from large text corpora such as Wikipedia, the correct answer can often be obtained from multiple evidence candidates. Moreover, not all such candidates are labeled as positive during annotation, rendering the training signal weak and noisy. This problem is exacerbated when the questions are unanswerable or when the answers are Boolean, since the model cannot rely on lexical overlap to make a connection between the answer and supporting evidence. We develop a new parameterization of set-valued retrieval that handles unanswerable queries, and we show that marginalizing over this set during training allows a model to mitigate false negatives in supporting evidence annotations. We test our method on two multi-document QA datasets, IIRC and HotpotQA. On IIRC, we show that joint modeling with marginalization improves model performance by 5.5 F1 points and achieves a new state-of-the-art performance of 50.5 F1. We also show that retrieval marginalization results in 4.1 QA F1 improvement over a non-marginalized baseline on HotpotQA in the fullwiki setting.


Introduction
Multi-document question answering refers to the task of answering questions that require reading multiple documents, extracting relevant facts, and reasoning over them. Systems built for this task typically involve retrieval and reasoning components that work in tandem. The retrieval component needs to extract information from the documents Figure 1: Examples of false-negative contexts in multidocument QA. Equivalent information is marked in blue, and only the snippets with " " are annotated as gold evidence. False negatives are outlined in red and both are retrieved by our proposed framework. that is suitable for the reasoning model to perform the end-task effectively. Recent advances in reading comprehension have resulted in models that have been shown to answer questions requiring complex reasoning types such as bridging, comparison (Asai et al., 2020;Fang et al., 2020) or even arithmetic (Ran et al., 2019;Gupta et al., 2020), given adequate context. However, when the context needs to be retrieved from a large text corpus (e.g., Wikipedia), the performance of such reading comprehension models is greatly affected by the quality of the retrieval model. Given supervision at all stages (i.e., document, supporting evidence and answer), it is common to build retrieval and reasoning models independently and connect them as a pipeline at test time. In this case, the retrieval and reasoning models are usually trained to maximize the likelihood of labeled supporting evidence snippets and the answer given the gold context respectively.
However, in a multi-document QA setting, it is common to have some relevant snippets not marked as gold. Two such examples are shown in Figure 1. In the first example, only snippet 2 is marked as gold evidence, and consequently snippets 1 and 3 are treated as negative examples during retrieval. This is problematic because unlike snippet 1 which is actually irrelevant, snippet 3 is not only useful, but provides an even more direct way to derive the correct answer since it does not require a subtraction. Similarly, in the second example two evidence snippets from different documents contain the same information, thus at least two contexts can be used to answer this question, yet only one of them is labeled as being a positive example for the training objective. We define these contexts that contain non-gold snippets and can still be used to answer the questions as alternative contexts. Alternative contexts are inevitable when datasets are created from large corpora, because it is prohibitively expensive to exhaustively annotate all possible contexts. These alternative contexts are false negatives during training and lead to a noisy and weak learning signal, even with this "fully-supervised" setup.
We design a training procedure for handling these false negatives, as well as cases where retrieval should fail (i.e., when the question is unanswerable). Specifically, we assign probabilities to documents, evidence candidates, and potential answers with parameterized models, and marginalize over a set of potential contexts by combining top retrieved evidence from each document, allowing the model to score false negatives highly. To make the marginalization feasible, we decompose the retrieval problem into document selection and evidence retrieval and show how we can still model contexts as sets. We evaluate our model on two multi-document QA datasets: IIRC (Ferguson et al., 2020) and HotpotQA (Yang et al., 2018). We see 2.8 and 4.8 F1 point improvement on IIRC and Hot-potQA respectively by jointly modeling our proposed set-valued retrieval and the reasoning steps, and a further 2.7 and 4.1 F1 point improvement re-spectively by using retrieval marginalization. Our final result of 50.5 F1 on the test set of IIRC represents a new state-of-the-art.

Multi-Document QA
Here we formally describe the multi-document QA setting and highlight the two main challenges in this setting that our work attempts to address.
Problem Definition Multi-document question answering measures both the retrieval and reasoning abilities of a QA system. Given a question q and a set of documents D = {d 1 , d 2 , ..., d n }, each document containing a set of evidence snippets .., s i n i }, the goal of the model is to output the correct answer a. This task is typically modeled with a retrieval step, which locates a set of evidence C = {s i 1 j 1 , s i 2 j 2 , ..., s i k j k } to formulate a context, and a reasoning step to derive the answer from such context C. Though such models can be learned with or without annotations on supporting evidence, we focus on the fully-supervised setting and assume supervision for all stages. It is also common for such documents to have some internal structure (e.g., hyperlinks in Wikipedia, citations for academic papers), which can be used to constrain the space of retrieval.

Inevitable False-negatives in Context Retrieval
Annotations Even when supporting evidence is annotated, we claim that the learning signal provided by those labels may be weak and noisy when retrieving from a large corpus such as Wikipedia. This is due to the redundancy of information in such large corpora: it is common to have multiple sets of evidence snippets that can answer the same question, as in Figure 1. To quantify how often alternative contexts exist for the multi-document QA problem, we analyzed IIRC (Ferguson et al., 2020), an information seeking multi-document QA dataset. We sampled 50 answerable questions with their annotated gold context and manually checked if equivalent information can be found in sentences not labeled as supporting evidence, in the same document. We found that more than half of the questions have at least one alternative evidence, and on average 2 there is more than one sentence we can find in the same document that contains the same information as the gold evidence. Note that it is also possible to have alternative evidence in a different document, which would further increase the frequency of questions with false-negative contexts.
Due to the prevalence of such false-negative contexts, simply training the retrieval model to maximize the likelihood of the labeled supporting evidence would result in the models ignoring or even being confused by other unlabeled relevant information that could benefit the reasoning process. The problem is more severe when considering questions with Boolean answers or those that are unanswerable since the answers have no lexical overlap with corresponding evidence, making it harder to identify unlabeled yet relevant evidence snippets. Such false-negative context annotations are also inevitable in the data creation process, since the annotators will have to exhaustively search for evidence snippets from all relevant documents, which is rather impractical. As solving this problem is not typically feasible during data collection, we instead deal with it during learning.
Learning to Reason with Noisy Context Given retrieved supporting evidence as context, the second step of the problem is reading comprehension. Recently proposed models have shown promise in answering questions that require multi-hop (Asai et al., 2020;Fang et al., 2020) or numerical (Ran et al., 2019;Gupta et al., 2020) reasoning given small and sufficient snippets as contexts. However, the performance of such models degenerates rapidly when they are evaluated in a pipeline setup and attempt to reason with retrieved contexts that are potentially noisy and incomplete. For instance, Ferguson et al. (2020) found that the performance of reasoning models dropped 39.2 absolute F1 when trained on gold contexts and evaluated on retrieved contexts. This is mainly because the model is exposed to much less noise at training time than at test time.
Handling Retrieval for Unanswerable Questions It is especially challenging when we consider unanswerable questions since it is possible to have seemingly relevant documents that are missing key information. 3 This is common for an information-seeking setting such as IIRC, where the question annotators are only given an initial 3 An example would be looking for the birth year of some person when such information is not presented even in the Wikipedia article titled with their name.

Reasoning
Marginalize over top-m context candidates paragraph to generate questions, with the actual content of linked documents being unseen. Thus it raises the question of how to make use of such learning signal and correctly model the retrieval step for unanswerable questions.

Learning with Marginalization over Retrieval
To address these challenges, we decompose the retrieval problem into document selection and evidence retrieval, leaving the handling of unanswerable questions entirely to the evidence retrieval component. These two components together produce a probability distribution over sets of retrieved contexts, which we marginalize over during training to account for false negatives. Our general framework is illustrated in Figure 2.

Modeling Multi-Document Retrieval
As described in Section 2, the end product of retrieval for multi-document QA should be a set of evidence snippets from different documents .., s i k j k } which are combined to be the reasoning context C. When a question is not answerable, C is empty, and supervision of the model is not entirely straightforward. The decision of where to look for evidence is separate from whether the necessary information is present, and a naive supervision that nothing should be retrieved risks erroneously telling the model that the place it chose to look was incorrect, possibly leading to spurious correlations between question artifacts and answerability judgments. For this reason, we separate document selection from evidence retrieval, and we leave answerability determinations entirely to the evidence retrieval step.
Document selection Given the question q and a set of documents D = {d 1 , d 2 , ..., d n }, to evaluate the relevance of each document d i and the question q, we first jointly encode them with a transformerbased model. To model the selection of documents as a set variable, a Sigmoid function (σ) is used to compute the document probability: For simplification, we assume the selection of each document is independent, thus the joint probability of selecting a set of documents Evidence retrieval Given the set of selected documents D, the goal for the evidence retrieval model is to select the evidence snippets s i j ∈ d i that are relevant to question q for each document d i ∈ D. To model the relevance between an evidence snippet and the question, we first use pretrained language models to obtain a joint embedding of the concatenated question-evidence input. To simplify the problem while approximating the set of evidence snippets as context, we only take one evidence snippet from each document. In addition, we allow the evidence retrieval model to retrieve nothing from a document by predicting NULL, a special token which is artificially added to the end of every document. This is essential for our modeling since it places the responsibility of determining sentence-level relevance solely on the evidence retrieval step and allows it to reject the proposal from the document model by selecting the NULL option, which is useful especially for unanswerable questions. Finally, we model the probability of an evidence snippet being retrieved given its document as: Here we can derive the joint probability of a set of evidence snippets C = {s i 1 j 1 , s i 2 j 2 , ..., s i k j k } being retrieved as context:

Joint Modeling with Marginalization
With the retrieved context C, the final step is to predict the answer. For this part, we use existing reading comprehension models that take a question and relatively small context and output a probability distribution of its answer predictions. The retrieved sentences in the context are simply concatenated and treated as context for the reading comprehension model (RC). Given the context C and question q, the probability of the answer is defined as: Now we can derive the joint probability of the answer and the retrieved context: With the objective to maximize the likelihood of the training set with supervision on goldC,D and a, the loss function is as in Equation 4: where Marginalization over Retrieved Evidence As mentioned in Section 2, the learning signals for {G D , G C , G a } may be noisy and weak because the objectives in Equation 4 assume that given a question-answer pair (q, a) there is only one set of gold contextC that can derive the correct answer. To augment the learning signal, we propose to add the weakly-supervised objective with marginalization over a set of alternative context S = {(D 1 , C 1 ), (D 2 , C 2 ), ..., (D m , C m )} given the selected documents D: Ideally, we want the marginalization set S to be all possible combinations of sentences in different documents, but this is infeasible for large text corpora. So here, we approximate the marginalization set by: 1) using only the top-ranked document set D, and 2) selecting only the top-m contexts from each d i in D.
However, not all of the contexts in the top-m set S are good alternative contexts, especially when the retrieval model is under-trained and performs poorly. We use a set of answer-type-dependent heuristics to determine whether a context C is valid: (1) Span: when the context has at least one span that matches the gold answer string; (2) Number: when the answer can be derived from the numbers in the context with the arithmetic operation supported by our RC model, or it is a span in the context; (3) Yes/No/Unanswerable: all contexts are considered valid. Using these heuristics, we can divide the top m retrieved context S into two subsets S 1 , S 2 ⊆ S, where S 1 contains all valid contexts while the contexts in S 2 are invalid.
Auxiliary Loss for Invalid Context Because contexts in S 2 are not valid alternative contexts for obtaining the correct answer, we do not marginalize over contexts in this set. We can still use them during training, however, by formulating an auxiliary loss that encourages the RC model to predict the question as unanswerable (i.e., a = a N ) given these invalid contexts: Note that here we do not use joint probability P (a N , D * , C * |q) since doing so would also encourage the retrieval models to retrieve irrelevant context for answerable questions. In this way, this auxiliary loss can also be viewed as augmenting the dataset with extra unanswerable question-context pairs for the RC model.
The only weight we tune in the objective is α to regulate the contribution of the loss from the invalid contexts the RC model encounters at training time.

Datasets and Settings
We test our method on two multi-document question answering datasets: IIRC and HotpotQA.
IIRC (Ferguson et al., 2020) is a dataset consisting of 13K information-seeking questions generated by crowdworkers who had access only to single Wikipedia paragraphs and the list of hyperlinks to other Wikipedia articles, but not the articles themselves. Given an initial paragraph, a model needs to retrieve missing information from multiple linked documents to answer the question. Since the question annotators can only see partial context, the questions and contexts containing the answers have less lexical overlap. The questions in IIRC may have one of four types of answers: 1) span; 2) number (resulting from discrete operations); 3) yes/no; 4) none (when the questions are unanswerable).
HotpotQA fullwiki (Yang et al., 2018) consists of 113K questions and the contexts for answering those questions are a pair of Wikipedia paragraphs.
In the fullwiki setting, the model has access to 5.2M paragraph candidates and needs to retrieve relevant information from this corpus. We believe this open domain QA setting would provide a different perspective for studying false-negative contexts, especially across different documents.

Model Details
Transformer-based pretrained language models are used to encode questions and contexts for retrieval and reasoning in our experiments.  Table 1: Main results on IIRC. "Baseline" refers to the performance reported in Ferguson et al. (2020) and "-" denotes that no results are available. Work marked with † is by Yoran et al. (2021), which appeared after our initial submission. All pipeline models shares the same retrieval model and its output thus the same retrieval performance.
we use NumNet+ (Ran et al., 2019) for IIRC and BERT-wwm  for HotpotQA. Note that our general modeling framework is agnostic to the choices of specific models for retrieval and reasoning. We choose these models to experiment with since they are easy to use and present strong results as shown in previous work (Groeneveld et al., 2020). More implementation details can be found in Appendix A.

Evaluation Metrics
For HotpotQA, we follow previous work (Yang et al., 2018;Asai et al., 2020;Xiong et al., 2020) and use F1 score and Exact Match (EM) for the answer (QA) and supporting facts (SP) prediction. Similarly, we report QA F1 and EM for IIRC as in Ferguson et al. (2020). In addition, we define the following metrics for understanding the retrieval performance: (1) Document selection F1 (Doc-F1) measures the performance of the document retrieval model given the documents marked as gold; (2) Overall retrieval recall(Rt-Recall) measures the retrieval ability of the overall retrieval system given the annotated set of evidence snippets. Over the metrics above, our main goal is to improve question answering performance, which is best measured by QA F1.

Training Settings
Since documents can contain up to hundreds of sentences, for efficient training of our evidence retrieval model, we downsample the negative examples to 7 for IIRC and 3 for HotpotQA. But no downsampling is done during inference. For IIRC, we take m = 4 for the top-m context marginalization and take m = 5 for HotpotQA. For the weight for invalid context loss, we use 0.5 for IIRC 4 and 0 4 We performed a simple binary search and found 0.5 to work better than 0 or 1.
for HotpotQA since it does not have unanswerable questions. For memory and storage efficiency, we tie the pretrained language model weights among all the components in our joint model. The models are trained for 30 epochs for IIRC and 5 epochs for HotpotQA fullwiki. Our most expensive experiment takes about 1.5 days to run on two RTX 8000 (48GB) GPUs or one A100 (40GB) GPU, while a typical experiment takes about half of that computing power. 5 Table 1 shows our main results on IIRC. We can see that our proposed joint model with marginalization outperforms the pipeline model by 5.2 and 4.8 points for QA exact match and F1 score, respectively. While the 17.6 point improvement over the baseline system seems large, the correct point of comparison for our contribution in this work is our pipeline system, which is simply an improved version of the pipeline used by the baseline system. 6 Another comparison worth noticing is that despite the large improvement on the QA side, the retrieval performance is slightly lower than its pipeline counterpart. Our hypothesis is that our trained joint model with marginalization can better utilize the alternative contexts that are not marked as gold and derive the correct answers based on them. Since the retrieval performance is compared with only the annotated evidence, the gain from alternative contexts cannot be reflected in these numbers. We explore this hypothesis in Section 5.1 and some examples are shown in Section 7 and Appendix B.

Main Results on IIRC
To further understand the effectiveness of joint modeling with marginalization comparing to the  pipeline model, we breakdown the QA performance by different answer types in Table 2. Our proposed method yields large performance gains on unanswerable questions, and those with binary and numerical answers. As we discussed in Section 2, since the retrieval model cannot rely on lexical overlap between contexts and correct answers for these types of questions, it is harder for it to learn from the false negatives, and the reasoning model trained in a pipeline is more susceptible to noise in the retrieved context. We also notice that the QA F1 on span-type questions drops 1.2 points; we think this is because the auxiliary loss we have on invalid contexts slightly altered the distribution to favor the unanswerable questions. To confirm this, we removed the auxiliary loss and its QA F1 on span questions went back up to 48.7 points.

Analysis
Effectiveness of retrieval marginalization Table 3 shows that training with marginalization improves the final QA F1 performance by 2.7 points, while doing slightly worse in terms of retrieving annotated context. To go beyond pure numbers and explain why modeling with retrieval marginalization results in better final QA performance, we analyzed 50 questions where the model with marginalization correctly answered a question that the model without marginalization missed, and 50 questions where the opposite was true. We found that in 24% of the cases where the marginalization model was correct, it relied on non-gold evidence to make its prediction, while this was only true 4% of the time for the model without marginalization. This suggests that marginalization over retrieval improves the QA performance by retrieving alternative contexts that can help reasoning. We show some specific examples of this in Section 7 and Appendix B.  Table 3: Marginalization and other ablations on IIRC. Note that the removal of parts (noted by "-") from the full model is accumulative 7 . The last setting is equivalent to a pipeline model with shared RoBERTa weights.
Effectiveness of invalid context loss Without the auxiliary loss on the subset of invalid contexts during marginalization, we observe a 1.4 points decrease in the QA performance. On further inspection, we found that the main reason was the performance on the unanswerable questions, which decreases 8.9 points in F1 (not shown in the table).

Effectiveness of joint modeling
We also explore the setting where both marginalization and joint modeling are taken away from our model. This is similar to the pipeline setting but we minimize the sum of fully supervised losses from all three models and the pretrained language model weights are shared. The difference between rows 3 and 4 in Table 3 illustrates the performance improvements from joint modeling alone, which is 2.8 QA F1. We believe this is largely due to the fact that when the model is jointly trained, the reasoning model is dynamically adapting to the noisy retrieval results, which makes it more resilient to noise at test time.

Results on HotpotQA
The general effectiveness of retrieval marginalization and joint modeling on HotpotQA fullwiki is displayed in Table 4. We observe a 4.8 QA F1 improvement with joint modeling and a further 4.1 points improvement with retrieval marginalization. Similar to IIRC, joint modeling and retrieval marginalization improve final QA performance despite inferior retrieval scores evaluated against annotated supporting evidence. To better understand how our proposed methods alter the retrieval and reasoning steps, we compare the output 7 It is also a must because marginalization depends on joint modeling and the auxiliary loss depends on having a marginalization set. 8 Here we compare with Multi-hop RAG, an adaptation of RAG (Lewis et al., 2020) to HotpotQA fullwiki by Xiong et al. (2020).  (Fang et al., 2020) 50.0 76.4 56.7 69. 2 Asai et al. (2020) 49.2 76.1 60.5 73.3 MDR (Xiong et al., 2020) 57.5 80.9 62.3 75.3 RAG 8 (Lewis et al., 2020) --51.2 63.9 Table 4: HotpotQA answer F1 on the development set. Best performance in our ablation and from previous work are in bold. "-" denotes that no results are available. from our full model as well as the version without marginalization and joint modeling. Interestingly, while we found most alternative sources of evidence are from the same document in IIRC, the alternative evidence our full model locates for Hot-potQA fullwiki is mostly from different documents. We believe this is because HotpotQA fullwiki considers far more documents (i.e., 5.2M in the whole Wikipedia) than IIRC (i.e., less than 20 linked documents) for each question, but HotpotQA typically use the introductory paragraph instead of the whole document, giving less chance for inner-document alternatives. Concrete examples are shown in Section 7 and Appendix B While the purpose of these experiments is to show the effectiveness of our proposed methods for mitigating the false negatives in HotpotQA rather than competing with state-of-the-art models, we do list some results from previous work in Table 4 to help put ours in context. With joint modeling and retrieval, our full model achieves a QA F1 of 71.2 and EM of 58.6, surpassing several strong baselines presented by previous work, though there is still a non-trivial gap with state-of-the-art models. Potential improvements that are orthogonal to our contributions include modeling document selection as a sequence and condition selection based on previous selected documents as done by Asai et al. (2020) and Xiong et al. (2020), and using stronger language models (e.g., ELECTRA (Clark et al., 2019)). Note that all three modules of our framework are simple classification models on top of BERT, and that the formulation of our proposed set-valued retrieval and marginalization are modelagnostic.

Qualitative Analysis
As mentioned in Section 5.1 and Section 6, the reason for our model to achieve higher QA scores though having lower scores in matching annotated evidence is that our model locates false-negative evidence that can be used as alternatives for reasoning. In Table 5, we show two of such alternatives our model found in the development set for IIRC and HotpotQA fullwiki. From the IIRC example, we can see that our model is able to find alternative evidence in different sentences of the same gold document (i.e., "Alexander Hamilton") or even in a non-gold document (i.e., "Aaron Burr"), mitigating false-negatives annotations in sentence or document retrieval. The HotpotQA example shows that in multi-hop reasoning, key evidence is not exclusively in documents titled with bridging entity (i.e., "E Street band"), but also sometimes included in a document for related third entity (i.e., "David Sancious") as well. This indicates that such false-negative contexts can be more prevalent in multi-hop QA settings.

Related Work
False-negative contexts in QA In terms of dealing with false negatives in retrieved texts for question answering, the most similar prior work to ours is by Clark and Gardner (2017). However, they focused only on span-type answers while we apply similar methods to more complex reasoning types. False negative contexts are not exclusive to IIRC or HotpotQA, but are a rather common issue when scaling up QA to document or multidocument level. In prior analysis of TyDiQA (Clark et al., 2020) and Natural Questions (Kwiatkowski et al., 2019), it is suggested that humans typically have low recall on finding supporting evidence, though they focused on dealing with such issues in evaluation while we focus on altering the training process. Such issue was also found by previous work on HotpotQA, as Xiong et al. (2020) noted that many of the "errors" in their document retrieval model are actually valid alternative contexts. The false-negative contexts can also lead to false-negative annotations of answer spans. In some previous work (Chen et al., 2017;Hu et al., 2019;Asai et al., 2020), it was shown effective to manually add distantly-supervised examples in the reasoning model's training, while we try to solve this problem from the retrieval part. Joint models for retrieval and reasoning More recent work such as ORQA  and Dense Passage Retrieval (Karpukhin et al., 2020) focus on modeling retrieval for question answering. But their main focus is to develop efficient neural retrieval systems that help to scale up QA to a large corpus, which is orthogonal to our contribution in mitigating false-negative contexts. Some recent works also investigate the possibility of modeling retrieval as a latent task and enabling end-to-end training. REALM (Guu et al., 2020) used a neural retriever to augment language models with external knowledge for question answering. While REALM also uses a maximum marginal likelihood objective, it only marginalizes over retrieved documents. However, our model marginalizes over a set of contexts consisting of evidence snippets from different documents to account for set-valued retrieval of variable sizes, which is crucial for multidocument QA. RAG (Lewis et al., 2020) adopted a similar method as REALM, but for sequence generation tasks. RAG can be adapted to perform multi-hop question answering (Xiong et al., 2020), and we directly compared our results with multihop RAG on HotpotQA fullwiki in the experiments. Finally, our work leverages marginalization over latent variables to deal with weak and noisy supervision signals, which is reminiscent of using maximum marginal likelihood for training weakly supervised semantic parsers (Berant et al., 2013;Krishnamurthy et al., 2017, among others).

Conclusion
We proposed a new probabilistic model for retrieving set-valued contexts for multi-document QA and show that training the QA model with marginalization over this set can help mitigate the false negatives in evidence annotations. Experiments on IIRC and HotpotQA fullwiki show that our proposed framework can learn to retrieve unlabeled alternative contexts and improves QA F1 by 5.5 on IIRC and 8.9 on HotpotQA.
Failed modes In our case, failing to answer a user-issued question may result in incorrect or misleading information. Thus we should be careful when put our systems into practical use.
Computation power Our most expensive experiment takes about 1.5 days to run on two RTX 8000 (48GB) GPUs or one A100 (40GB) GPU, while a typical experiment takes about half of that computing power. While conducing experiments, we made effort to take advantage of technologies such as mixed-precision training to shorten our training time under the same experiment setting and save power consumption.

A Model Implementation Details
Since the two datasets where we conduct our experiments are different in terms of document length and structure, reasoning types from questions and possible answer types, here we describe the datasetspecific implementation details for IIRC and Hot-potQA.

A.1 Document Selection
Though IIRC and both settings from HotpotQA can be seen as multi-document QA problems, the documents are structured differently. Accordingly we handle the document selection part differently for the two datasets, namely using different Encode(·) functions mentioned in section Section 3.1. But note that the outputs are all a set of documents (or paragraphs for HotpotQA) with their probabilities, so other parts of the framework remain the same.
Link prediction for IIRC. For IIRC, an initial paragraph p is given and we need to follow certain links in it, so the document selection problem can be translated into a link prediction problem given p and q. So for IIRC, we define Encode(·) as the BERT embedding of the concatenated questionparagraph sequence at the position of the link l i to the document d i : For IIRC, since we do not know how many links needs to be followed to answer the questions, we use P (d i |q, p) = 0.5 as a threshold for document selection.
HotpotQA fullwiki. To select paragraphs for HotpotQA, we simply concatenate the question and the first 64 tokens of the candidate paragraph d i , run it through BERT and take the embedding of the separation token: We further follow Asai et al. (2020) and apply their trained recurrent retriever to retrieve a small subset of relevant paragraphs D with the highest score. We choose to use this model because they are the best performing model on HotpotQA fullwiki setting with public code. For a better learning signal, we manually add the paragraphs marked as gold if they are not already included in D but we only use D at test time for a fair comparison under the fullwiki setting.

A.2 Reading Comprehension Models
We use existing reading comprehension models for both IIRC and HotpotQA. NumNet+ (Ran et al., 2019) is used for IIRC since it can handle numerical reasoning. We augment the model by adding binary and unanswerable as two additional question types to its question type classification model and further introduce a binary classification model for outputting "Yes" and "No" when a question is classified as binary type. For HotpotQA, to handle questions with binary answers, we append "yes or no" to the start of the retrieved context to transform its reasoning part to a pure span-prediction problem. Then we follow  and append two linear layers to the contextualized embeddings from transformer-based language models, and they are used to separately model the starting and ending position of the span.

B Retrieved Alternative Contexts
Here we show more examples of alternative contexts retrieved by our proposed methods, for the development set for IIRC in Table 6 and for Hot-potQA fullwiki in Table 7.