End-to-End Training of Neural Retrievers for Open-Domain Question Answering

Recent work on training neural retrievers for open-domain question answering (OpenQA) has employed both supervised and unsupervised approaches. However, it remains unclear how unsupervised and supervised methods can be used most effectively for neural retrievers. In this work, we systematically study retriever pre-training. We first propose an approach of unsupervised pre-training with the Inverse Cloze Task and masked salient spans, followed by supervised finetuning using question-context pairs. This approach leads to absolute gains of 2+ points over the previous best result in the top-20 retrieval accuracy on Natural Questions and TriviaQA datasets. We next explore two approaches for end-to-end training of the reader and retriever components in OpenQA models, which differ in the manner the reader ingests the retrieved documents. Our experiments demonstrate the effectiveness of these approaches as we obtain state-of-the-art results. On the Natural Questions dataset, we obtain a top-20 retrieval accuracy of 84%, an improvement of 5 points over the recent DPR model. We also achieve good results on answer extraction, outperforming recent models like REALM and RAG by 3+ points.


Introduction
The task of open-domain question answering (OpenQA) consists of finding answers to the information-seeking questions using a large knowledge source such as Wikipedia. This knowledge source is also referred to as evidence and it typically contains millions of documents. Most approaches for OpenQA consist of a two-stage pipeline (Chen et al., 2017;Chen, 2018). In the first stage, given Document 1: Bowling is a target sport and recreational activity in which a player rolls a ball towards pins (in pin bowling) or another target (in target bowling) … Document 2: "Hall of Fame" is a song by Irish pop rock band the Script. It is the lead single from their studio album #3. The track features American hip-hop artist will.i.am of The Black Eyed Peas.  a question, a retriever module identifies the most relevant documents, which is often a very small subset of the evidence known as context documents. Traditionally, approaches based on document ranking such as BM25 (Robertson and Zaragoza, 2009) have been used for the retriever. In the second stage, these relevant documents are given as input to the reader module, which understands them and extracts the answer for the question (Figure 1).
The main drawback of the BM25 method is that it is not trainable and hence it can't be adapted to tasks involving open-retrieval. Recent work has addressed this limitation by building upon advances in self-supervised learning, such as BERT . These approaches model both the retriever and reader using neural networks, allowing the retriever to be trained using task-specific datasets Guu et al., 2020). Typically, the retriever model consists of a dual-encoder architecture (Bromley et al., 1994), where one encoder processes the question and the other encoder processes the context document. Prior work has investigated both unsupervised and supervised approaches to train the retriever. Unsupervised approaches include separately training the retriever with Inverse Cloze Task (ICT)  or training the retriever and reader jointly by pre-dicting masked salient spans (REALM) (Guu et al., 2020), while supervised approaches such as Dense Passage Retrieval (DPR) (Karpukhin et al., 2020) train the retriever using human-annotated sets of question and context pairs.
However, there is no study that investigates the comparative advantages of using these two styles of training when the retrieval task is challenging, i.e., when the evidence contains millions of documents. It is unclear if the unsupervised approaches can further help to improve the performance of strong supervised approaches, and, if so, under what conditions. A core focus of this work is systematically studying these aspects of retriever training.
We propose a unified approach to train the retriever: unsupervised pre-training followed by supervised finetuning. We also investigate key design choices-such as relevance score scaling and longer training-and showcase their effectiveness. Our results demonstrate that the proposed approach obtains substantial accuracy gains when evaluated on benchmark OpenQA datasets. Extensive experiments also highlight the relative importance of different pre-training strategies, revealing important trade-offs when varying the amount of supervised data available to train the retriever.
Furthermore, motivated by recent work (Guu et al., 2020;Lewis et al., 2020a), we also explore two approaches for end-to-end supervised training of the reader and retriever components. In the first approach, the reader considers each retrieved document separately while in the second approach, the reader takes as input all the retrieved documents together. We compare the effectiveness of these approaches on both retrieval accuracy and answer extraction. We show that the first approach leads to an improved retrieval performance, while the second approach results in an improved answer extraction. With end-to-end training, we outperform previous best models to obtain new state-of-the-art results on retrieval accuracy and answer extraction. We also perform experiments by scaling the model size to a large configuration for both retriever and reader and observe consistent improvements, compared with smaller models.
In summary, the contributions of this work are: • We demonstrate that our proposed method of unsupervised pre-training of the retriever with ICT followed by supervised finetuning leads to absolute gains of more than 2 points in the top-20 retrieval accuracy over the previous best result on Natural Questions and TriviaQA datasets. • We show that masked salient spans-based pretraining of the retriever is more effective when the supervised dataset sizes are small. • Our end-to-end training approach obtains new state-of-the-art performance on retrieval accuracy. On Natural Questions, our top-20 accuracy is 84, which is a 5 points gain over DPR results. • We achieve competitive results on answer extraction with gains of more than 3 points over recent models such as REALM (Guu et al., 2020) and RAG (Lewis et al., 2020c). • We scale up end-to-end training to large models and show consistent gains in performance.
The rest of the paper is organized as follows. Sec. 2 and 3 explain the retriever model and endto-end training, respectively. Sec. 4-6 describe the experimental details with the results. Sec. 7 reviews the related work followed by conclusion in Sec. 8.

Neural Retriever
In this section, we first describe the retriever architecture and then discuss different approaches to train it, including our proposed approach.

Background
Given a collection of documents in the evidence Z = {z 1 , · · · , z m } and a question q, the task of the retriever is to select a relevant subset of documents for the question. To do this, the retriever performs a ranking of the evidence documents conditioned on the question and outputs the top-ranked documents.
The retriever model consists of two modules: a question encoder (f Q ) and a context encoder (f Z ). Such a model is often referred to as a dual-encoder model (Bromley et al., 1994). Here, we detail the training methodology of the dual-encoder model given a questions (q) and context documents (z i ) from Z. First, we compute the relevance score between the question and context. We define the relevance score to be the dot-product between the question and context representations where f Q (q) ∈ R d and f Z (z) ∈ R d denote the question and context encoders, respectively, which are parameterized by φ = [φ Q , φ Z ]. We model the f Q and f Z using BERT-style transformer networks Vaswani et al., 2017). We consider the hidden states of the first token of the sequence (i.e. [CLS] token) as the encoder's output. The probability of a context document z i being relevant to the question q is calculated as where τ is a scaling factor. While previous work had used the setting of τ = 1, in this work, we set τ = √ d. Bigger scaling factor helps in better optimization when the model hidden size (d) is large. We refer to this as relevance score scaling. To train the retriever, we maximize the log-likelihood computed from Eq. 2.
In practice, as the evidence set consists of millions of documents, the normalization term would be expensive to compute. Hence, we approximate the denominator of the above equation by using the context documents in the batch as negative examples, a technique that has shown to perform well in practice .

Training
In this section, we discuss different approaches to train the retriever. In all the approaches, we initialize the parameters of both the question and context encoders using BERT weights as implemented in Megatron-LM (Shoeybi et al., 2019). We also experimented with random initialization but it vastly underperformed BERT initialization.

Supervised Training
In the supervised setting, human-annotated questions, answers, and sometimes context are provided. If the context is not included, then a common approach is to use distant supervision (Mintz et al., 2009) to obtain the context document. Specifically, we select the top-ranked document using BM25 (Robertson and Zaragoza, 2009) from the evidence that contains the answer as the context. We also select other top-ranked documents that do not contain the answer as additional hard negative examples. This approach to train neural retriever was popularized by (Karpukhin et al., 2020).

Unsupervised Training
Inverse Cloze Task (ICT): In this setup, we do not consider the human-annotated question-context pairs. Instead, the retriever is trained in an unsupervised manner. Specifically, a randomly sampled sentence from a paragraph is considered as the query while other sentences as the context. This approach was first proposed by .
Masked salient spans training: (Guu et al., 2020) showcased that the ICT initialized retriever can be further improved by training it with an objective where the reader predicts the masked salient spans such as named entities conditioned on the retrieved documents. In this work, we adopt the same approach. However, unlike (Guu et al., 2020) who use BERT for the reader, we use a generative language model based on T5 (Raffel et al., 2020).

Proposed Approach: Unsupervised
Pre-training and Supervised Finetuning To improve the retriever training, we propose the approach of unsupervised pre-training of the retriever followed by supervised finetuning. In this approach, we first pre-train the retriever weights with ICT training or masked salient spans training (Sec. 2.2.2). After pre-training, we finetune the retriever with supervised training (Sec. 2.2.1).

End-to-End Retriever and Reader Training
In this section, we explore two supervised training approaches to end-to-end train the reader and retriever components from the task-specific data.
In the first approach, the reader considers each retrieved document separately (Sec. 3.1) while in the second approach, the reader takes as input all retrieved documents together (Sec. 3.2). These approaches are designed such that when predicting the answer conditioned on the question, the learning process improves both the reader and retriever.
Background and notation: In end-to-end training, the trainable components consists of the retriever (φ) and reader (θ) parameters. For retriever, we use the dual-encoder architecture and train it as discussed previously in Sec. 2.3. Our reader is a generative model designed according to the sequence-to-sequence modeling paradigm (Sutskever et al., 2014). Specifically, we use pre-trained T5 as the reader. The inputs to the training process are questions (q) and its answers (a), both in string form. Given a question, first the retriever obtains the k relevant context documents (K) from the evidence (Z) as The reader then takes the question and one or more context documents (z i ) as input to predict the an-  Figure 2: A schematic diagram illustrating end-to-end supervised training of the retriever and reader components.
swer, the likelihood of which is defined as where N is the number of answer tokens. Next, we describe the two proposed approaches. A block diagram illustrating the end-to-end training process is shown in Figure 2.

Approach 1: Individual Top-k
In this approach, similar to (Guu et al., 2020), the reader's likelihood is first computed conditioned on the question and each retrieved document. The marginal likelihood is defined as the weighted average of the individual likelihoods as where p(z i | q, Z; φ) is computed using Eq. 2. However, the normalization is done over K instead of Z. The final loss is defined as the negative marginal log-likelihood We note that the RAG model (Lewis et al., 2020c) also proposed a similar approach, but there are two main differences. The first is that while we update all the parameters of the retriever (both the query and context encoders), RAG just updates the query encoder. The second is that we use T5 model as the reader while RAG uses BART model (Lewis et al., 2020b). These enhancements help us obtain substantial gains over the RAG model, which we will discuss in Sec. 6.

Approach 2: Joint Top-k
In this approach, similar to (Lewis et al., 2020a), the likelihood is defined as the reader's likelihood conditioned on the question, all the retrieved documents, and the retrieval score As the T5 reader consists of separate encoder and decoder modules, it provides the flexibility to customize the input or output of the encoder. We concatenate each retrieved document with the question and feed them as input to the encoder, which computes their hidden representations. Next, we stack the hidden representations of all the retrieved documents, which the decoder jointly attends to during the encoder-decoder attention, thus allowing a more powerful form of information aggregation from multiple retrieved documents. We also add retriever similarity score to bias the encoder-decoder attention as it helps facilitate end-to-end training and enables the reader to pay higher attention to the relevant documents. The interaction score during the encoder-decoder attention is computed as where Q is the query vector computed from decoder's input, K is the key vector computed from encoder's output, and λ is a trainable parameter.
Final loss is defined according to Eq. 6. We further note that a similar approach for OpenQA was proposed in (Izacard and Grave, 2020) but it only optimizes the reader model and didn't perform end-to-end training of the retriever.

Experimental Setup
In this section, we describe the datasets and model settings. For reproducibility, we provide training details and list the hyperparameters in Appendix A.

OpenQA Datasets
We perform experiments using two widely used QA datasets whose details are provided below and their statistics are shown in Table 1.
Natural Questions (NQ): This corpus consists of real questions asked from the Google search engine along with their long and short answer annotations from the top-ranked Wikipedia pages (Kwiatkowski et al., 2019). Following prior work (Karpukhin et al., 2020), we use the same subset of the short answer questions in our experiments, as it is more suited for OpenQA.
TriviaQA: This corpus consists of a collection of trivia questions and their answers scraped from multiple sources in the Web (Joshi et al., 2017).

Results: Retriever Training
In this section, we compare different approaches to train the retriever. Retrieval accuracy is evaluated using the top-k metric (k ∈ {1, 5, 20, 100}).   Table 2. We then observe that incorporating relevance score scaling and longer training till 80 epochs helps to improve the top-5 and top-20 accuracy by 1.5-2 points. These results also signify that the original DPR model was significantly undertrained and not fully optimized.
In addition to score scaling, we further include 1 additional hard-negative example (similar to DPR) for each question-context pair and train the model for 80 epochs. Our results, in sync with the results of DPR, obtain substantial additional gains in performance. These findings highlight that relevance score scaling, longer training, and including a hard negative example are essential to improve the supervised retriever's accuracy. These supervised training results can be considered as a very strong baseline. Hence, we employ these settings in subsequent experiments.

Effect of Retriever Initialization
We first characterize the zero-shot retriever's performance when its weights are initialized with either BERT or ICT or masked salient spans pre-training (Table 3). As is understood that unsupervised language models do not perform well in information retrieval tasks , evidently, BERT also leads to a poor retrieval accuracy. We note that ICT initialization is quite effective in providing a non-trivial zero-shot accuracy which is further improved by masked salient spans training by more than 8 points. Both being unsupervised approaches  demonstrate their utility in effectively bootstrapping the retriever almost from scratch. We next empirically analyze our proposed approach of pre-training with ICT and masked salient spans followed by supervised finetuning. We observe that it provides absolute improvements of 2-3 points over the already strong supervised training results, with the gains being consistent across both the datasets. These results highlight that even after finetuning the retriever with thousands of labeled examples, it does not lead to catastrophic forgetting of the discriminative properties learned by the retriever during ICT and masked salient spans pre-training. Another merit is that being unsupervised, large text collections can be leveraged to pre-train the retriever, a considerable advantage over data-augmentation methods which rely on the availability of human-annotated question-context pairs. Furthermore, when comparing ICT with masked salient spans initialization, we note that their accuracy gains are roughly similar.

Effect of Amount of Training Data
We study the effect on accuracy when the retriever is pre-trained with BERT, ICT, or masked salient spans and the amount of supervised training data is varied. We train the retriever with 1%, 2%, 5%, 10-50%, of NQ's training data and plot the top-20 accuracy in Figure 3. Results reveal that in the lowresource regime, masked salient spans pre-training is much more effective than ICT, consistently leading to large gains. As the fraction of training data increases to beyond 40% towards a high-resource setup, the gains from salient spans pre-training saturates to that of ICT. We believe that these findings will have important implications for future research in OpenQA-with only a few hundred ex- amples, performing expensive masked salient span training is beneficial while if the training data has thousands of examples, ICT is just as optimal as masked salient spans training.

Effect of End-to-End Training
For end-to-end training, retriever weights are initialized with the previous best setting of ICT pretraining and supervised finetuning. The number of retrieved evidence documents for the reader is considered as a hyperparameter and is selected via performance on the dev set. The focus here is to analyze the effect on retrieval accuracy when updating the retriever weights using question-answer pairs in an end-to-end setting (Sec. 3). From the results in Table 4, we observe that for Individual Top-k, when only the query encoder is updated, it tends to improve retrieval accuracy. In addition, when the context encoder is also updated, the retrieval accuracy improves to 75% at top-5, a big gain of 8 points over the previous best DPR retriever. Larger models further help to improve the performance leading to new state-of-the-art results.
On the other hand, in Joint Top-k, updating the   query encoder just improves the top-1 score but does not really lead to much accuracy gains for higher top-k's. We also do not update the context encoder for Joint Top-k as it did not result in improvements during our initial experiments. These results showcase that when the retriever is already well-initialized, the objective function of Individual Top-k method is designed such that it significantly improves the retrieval accuracy while the Joint Top-k method does not result in improvements. As we will show next, that the usefulness of this method lies in answer extraction.

Intuition for Retriever Score Scaling
Retrieval score scaling is used when computing the probability distribution of the retrieved documents according to Equation 2, where the retrieval score is normalized by the scaling factor (τ ). To study the effect of τ on the retrieval accuracy, we perform an ablation study with different values of τ on the NQ retrieval task, whose results can be seen in Table 5 Here, we briefly explain the intuition regarding the usage of the scaling factor. In our preliminary experiments on retriever training and end-to-end training without the scaling factor, we observed that a few of the top-k document's similarity score with the query was very high that in turn led to it being assigned a high retrieval probability score. This high score was leading to a skewed probability distribution with most of the mass being centered over the top-1 or top-2 retrieved documents. A larger value of scaling factor results in a more even distribution of probability mass over the top-k documents, which in turn leads to better results in both retrieval accuracy and in the end-to-end training.

Results: Answer Extraction
We next present the results of end-to-end training on answer extraction. To train the model, retriever weights are initialized with ICT pre-training and supervised finetuning while the reader is initialized with pre-trained T5 weights. The number of retrieved evidence documents for the reader is tuned on the dev set. Results are reported using the conventional Exact Match (EM) metric.

Individual Top-k Approach
We compare our results as presented in Table 6 with the recent related approaches in OpenQA. For the base configuration on NQ, our model outperforms both REALM and DPR by more than 4 points. For the large configuration, we compare with the RAG model (Lewis et al., 2020c), where our approach outperforms it by 3.5+ points on NQ and by 2.8 points on TriviaQA. Our improved results are because of a more accurate initial retriever, stronger reader, and updating both the query and context encoders during training.   Our analysis in Figure 4 reveals that updating the context encoder improves the results for both the base and large configurations. Quite surprisingly, we also observe that the performance of Individual Top-k approach is sensitive to the number of top-k documents and can also decrease with an increase in top-k documents. We leave an in-depth investigation of this as a future work.

Joint Top-k Approach
We compare our results with the recent Fusion-in-Decoder (FiD) approach (Izacard and Grave, 2020) that also performs joint encoder-decoder attention. It consists of DPR as the retriever and T5 as the reader, which are initialized with their open-source weights. However, unlike our approach, FiD just finetunes the reader weights. Our results in Table 7 show that for the base configuration, Joint Topk outperforms the FiD model by 1 point on NQ, highlighting the significance of end-to-end training. For the large configuration, we obtain a gain of 0.7 points on TriviaQA.
Our analysis in Figure 5 portrays that the EM scores improve with more retrieved documents. This highlights that in contrast to Individual Top-k, the Joint Top-k better aggregates the information  Table 7: Results on answer extraction using Joint Topk approach. This signifies that with more retrieved documents, the utility of end-to-end training tends to diminish, thus explaining the lower gains observed in retrieval performance for Joint Top-k in Table 4.

Overall Comparison
Based on the discussions in Sec. 5.4 and Sec. 6, we remark that end-to-end training using the two approaches has a complementary effect on the retrieval accuracy and answer extraction. While the Individual Top-k approach helps to significantly improve the retrieval performance, the Joint Top-k approach is more useful for answer extraction.
7 Related Work (Yih et al., 2011) proposed a discriminative approach to train a retriever by learning dense representations of query and context documents based on word frequency. However, this approach was data-hungry and not scalable. Recently, Karpukhin et al., 2020) address this by leveraging pre-trained BERT weights  to train a dual-encoder retriever by using smaller amounts of question-context pairs. In particular,  first pre-train the retriever in an unsupervised manner using ICT and then jointly train the retriever and reader for OpenQA. On the other hand, (Karpukhin et al., 2020) perform supervised training of the retriever using hardnegative examples, yielding impressive results on several retrieval benchmarks.
To improve the retrieval accuracy of the dualencoder model, (Chang et al., 2020) explore several paragraph-level pre-training strategies including the application of ICT. They demonstrated the effectiveness of pre-training over sparse-retrieval approaches such as BM25. Their evidence consisted of the training documents that was further increased to 1M documents for OpenQA. Our work differs from them in several ways. First, our OpenQA setup is more challenging as the evidence consists of 21M documents. Second, we pre-train with two strategies consisting of ICT and masked salientspans and finetune using strong supervised methods, which leads to much improved results. Third, we further update the retriever with end-to-end training leveraging question-answer pairs, which further improves the retrieval accuracy leading to new state-of-the-art results.
A new line of work investigates task-specific pretraining of language models. For example, (Guu et al., 2020) predicts masked salient spans consisting of named entities to pre-train the reader and retriever components for OpenQA. Similarly, (Lewis et al., 2020a) perform cross-lingual pre-training where the objective is to predict a sequence using its paraphrases in different languages, demonstrating improved zero-shot performance in document translation tasks.

Conclusion
We propose approaches to improve the retrieval accuracy of the dual-encoder model for the OpenQA task. We first perform a systematic investigation of the importance of pre-training with ICT and masked salient spans tasks for supervised training of the retriever. We then present two approaches for end-to-end training of the reader and retriever components in OpenQA. In one approach, the reader considers each retrieved document individually while in the other approach where the reader con-siders all the retrieved documents jointly. Overall, these methods help achieve state-of-the-art results on both retrieval and answer extraction.

Broader Impact and Ethics Statement
To understand the ethical context of our work on open-domain question answering, it is important to consider the real-world use cases and potential individuals who may interact with systems developed based on our proposed methods. The potential real-world applications could be search engines or virtual assistants, where our techniques can improve the question-answering ability. However, it is worthwhile to mention that our trained systems can not be deployed off-the-shelf for such applications, given that our models were trained on the Natural Questions and TriviaQA datasets with the goal of matching the specific training data distribution. Real-world applications building on our work should be re-trained using a custom training dataset that is relevant to the kind of queries that originates in practice.
Our system represents a prototype model for answering questions over Wikipedia and can easily be extended to be used in sensitive contexts such as legal or health-care settings. However, extensive and robust quality assurance testing will be needed as our system was not designed to meet those criteria. More generally, there is the possibility of social biases which could be introduced by the training data. Since we did not control or regularize our model to remove such biases, we would urge the users to undertake the necessary quality-assurance testing to evaluate and understand the extent to which such biases might be present. User should also understand how much these biases are impacting their trained system and to make modifications to their training data and procedures accordingly.

A Training Details
We provide the training details of all the experiments below. We use the same training settings for both the base and large model configurations and use the open-source Megatron-LM toolkit (Shoeybi et al., 2019) to implement the models. 1 To train the models, we employed mixed-precision training (Micikevicius et al., 2018) and leveraged distributed training feature as implemented in the Pytorch framework (Li et al., 2020). All of our experiments were performed on the Selene cluster which consists of NVIDIA A100 GPUs.

A.1 Language Models Training
We train BERT Lan et al., 2020) and T5 (Raffel et al., 2020) language models from scratch, whose hyperparameters for both the base and large configurations are detailed in Table 8. We used 32 GPUs to train the BERT-large (330M) model and 128 GPUs to train the T5-large (770M) model.

A.2 Retriever Training
Supervised: We use Adam optimizer (Kingma and Ba, 2015), a batch size of 128, learning rate of 2e-5 with a linear decay, and train for 80 epochs. Training was performed on 16 GPUs.
ICT training: We initialize the parameters of both the question and context encoders using BERT weights trained with Megatron-LM. We train the model on Wikipedia paragraphs with maximum length of 256 tokens. We use a batch size of 4, 096, learning rate of 1e-4 with linear decay, and train the model for 100, 000 steps using Adam optimizer. This corresponds to training the model for roughly 20 epochs over the Wikipedia dataset. We set the weight decay to 0.01 and the warmup ratio of the optimizer to 0.01. With a probability of 0.1, we also keep the query sentence in the context. We train the large ICT model using 128 GPUs.
Masked salient spans generative training: We initialize the retriever with ICT training and pretrain the T5 reader on an aggregated dataset from (Shoeybi et al., 2019). We use the pre-trained models provided by the Stanza toolkit (Qi et al., 2020) to segment Wikipedia paragraphs into sentences and extract named entities. 2 The masked sentence is used as a query to retrieve evidence documents with the help of which the reader predicts the masked words. The model is trained according to Equation 5 and 6. We train the model for 100, 000 steps with Adam optimizer using a learning rate of 2e-5 and a warmup ratio of 0.05. Similar to (Guu et al., 2020), we also compute the evidence embeddings asynchronously and update the evidence index every 500 steps. Training was performed on 240 GPUs.

A.3 End-to-End Supervised Training
As the performance of the ICT pre-trained retriever and masked salient spans pre-trained retriever is similar when all the training data is used (Sec. 5.2), we select the retriever pre-trained with ICT initialization and finetuned with supervised data. For the reader, we use a pre-trained T5 model. For all experiments, we train for 10 epochs using a batch size of 64, learning rate of 2e-5 with linear decay, and weight regularization of 0.1. For Individual Top-k approach, during training, the evidence embeddings index is refreshed after every 500 steps. The number of retrieved evidence documents for the reader is considered as a hyperparameter and is selected via performance on the dev set. Training of Individual Top-k was performed on 240 GPUs while training of Joint Top-k was performed on 64 GPUs.
For retrieving the top-k documents from our evidence (∼21M documents), we perform exact search. Specifically, we utilize matrix multiplication and top-k functionalities as provided by the PyTorch framework. This matrix multiplication operation is highly optimized for GPU computations and we observed that performing exact search was not a bottleneck during training. We therefore did not optimize or approximate the similarity search using LSH (Andoni et al., 2015) or efficient maximum inner product search (Shrivastava and Li, 2014).
NQ and TriviaQA Specific Details: For both datasets, we uniformly sample the target answer from the list of provided answers during the training process. For answer extraction, similar to (Guu et al., 2020), we did not append the title of the Wikipedia article with the corresponding top-k retrieved document as the reader's input.

A.4 Individual Top-k Inference
During inference, the reader model first greedily generates an answer for each retrieved document. We then score each generated answer using Eq. 5 and finally select the answer with the highest likelihood score.

A.5 Example Outputs from Retriever
We present few examples in Table 9  • The average runtime for each model or algorithm, or estimated energy cost: We provide the average runtime and compute used for training different models in Appendix A. However, we want to highlight that our codes were not carefully optimized to minimize runtime or to make optimal use of the hardware resources.
• The number of parameters in each model: We provide number of parameters in models in Sec. 4.2.
• Corresponding validation performance for each reported test result: Validation set performance is currently not reported in the main paper. However, we followed rigorous experimentation protocol, and selected the best models by its performance on the validation set. If the program committee or reviewers require the validation set performance, we will include it in the final version of the paper.
• A clear definition of the specific evaluation measure or statistics used to report results: Our evaluation metrics are standard and widely used by the question answering community. We provide their details in the main paper in Sec. 5 and Sec. 6.
Question from NQ test Answer Top-1 Document Retrieved by ICT + Supervised what parts make up the peripheral nervous system autonomic nervous system . . . The connection between CNS and organs allows the system to be in two different functional states: sympathetic and parasympathetic. The peripheral nervous system is divided into the somatic nervous system, and the autonomic nervous system. The somatic nervous system is under voluntary control, and transmits signals from the brain to end organs such as muscles. The sensory nervous system is part of the somatic nervous system and transmits signals from senses such as taste and touch (including fine touch and gross touch) to the spinal cord and brain. . . who challenged the aristotelian model of a geocentric universe Copernicus . . . ("On the Revolutions of the Heavenly Spheres"), which posited that the Earth and the other planets instead revolved around the Sun.
The geocentric system was still held for many years afterwards, as at the time the Copernican system did not offer better predictions than the geocentric system, and it posed problems for both natural philosophy and scripture. The Copernican system was no more accurate than Ptolemyś system, because it still used circular orbits. This was not altered until Johannes Kepler postulated that they were elliptical (Keplerś first law of planetary motion). . . . Table 9: Examples of top-1 retrieved documents from the NQ test as outputted from the ICT + Supervised retriever. If the answer exists in the document, it is highlighted in bold.

B.2 For all results involving multiple experiments, such as hyperparameter search
• The exact number of training and evaluation runs: We provide training details for all models in Appendix A. Specifically, for the finetuning experiments, we train the models until convergence, which is 80 epochs for retriever models and 10 epochs for answer extraction models. We evaluate the model after each epoch on the validation set and save the best checkpoint according to their performance on the corresponding evaluation metric.
• Hyperparameter configurations for bestperforming models: We provide the hyper-parameter settings in Appendix A.
• The method of choosing hyperparameter values (e.g., uniform sampling, manual tuning, etc.) and the criterion used to select among them (e.g., accuracy): We performed manual hyperparameter tuning. We also performed tuning of the number of warmup steps for the Adam optimizer. We selected the best hyperparameter using performance on the validation set.
• Summary statistics of the results (e.g. mean, variance, error bars, etc.): All of our experiments are compute expensive large-scale runs utilizing a lot of resources such as CPUs, GPUs and take time ranging from tens of hours to several days. Therefore, due to computational and time constraints performing multiple runs for each experiment was not feasible. Therefore, we adopted the approach of using the same seed value (1234) for all the training runs including both pre-training and finetuning experiments.

B.3 For all datasets used
• Details of train/validation/test splits: We use the standard training / dev / test splits whose details are provided in Sec. 4.
• Relevant statistics such as number of examples and label distributions: We provide dataset statistics details in Table 1.
• An explanation of any data that were excluded, and all pre-processing steps: We include the relevant details in Sec. 4.
• For natural language data, the name of the language(s): Our datasets are in English language.
• A link to a downloadable version of the dataset or simulation environment: Both the datasets of NQ and TriviaQA are open-source and widely used by the community. NQ is available at: https://ai.google.com/ research/NaturalQuestions/download.
• For new data collected, a complete description of the data collection process, such as instructions to annotators and methods for quality control: This is not applicable to this work.