Efficient Passage Retrieval with Hashing for Open-domain Question Answering

Most state-of-the-art open-domain question answering systems use a neural retrieval model to encode passages into continuous vectors and extract them from a knowledge source. However, such retrieval models often require large memory to run because of the massive size of their passage index. In this paper, we introduce Binary Passage Retriever (BPR), a memory-efficient neural retrieval model that integrates a learning-to-hash technique into the state-of-the-art Dense Passage Retriever (DPR) to represent the passage index using compact binary codes rather than continuous vectors. BPR is trained with a multi-task objective over two tasks: efficient candidate generation based on binary codes and accurate reranking based on continuous vectors. Compared with DPR, BPR substantially reduces the memory cost from 65GB to 2GB without a loss of accuracy on two standard open-domain question answering benchmarks: Natural Questions and TriviaQA. Our code and trained models are available at https://github.com/studio-ousia/bpr.


Introduction
Open-domain question answering (QA) is the task of answering arbitrary factoid questions based on a knowledge source (e.g., Wikipedia). Recent stateof-the-art QA models are typically based on a twostage retriever-reader approach (Chen et al., 2017) using a retriever that obtains a small number of relevant passages from a large knowledge source and a reader that processes these passages to produce an answer. Most recent successful retrievers encode questions and passages into a common continuous embedding space using two independent encoders (Lee et al., 2019;Karpukhin et al., 2020;Guu et al., 2020). Relevant passages are retrieved using a nearest neighbor search on the index con-  taining the passage embeddings with a question embedding as a query. These retrievers often outperform classical methods (e.g., BM25), but they incur a large memory cost due to the massive size of their passage index, which must be stored entirely in memory at runtime. For example, the index of a common knowledge source (e.g., Wikipedia) requires dozens of gigabytes. 1 A reduction in the index size is critical for real-world QA that requires large knowledge sources such as scientific databases (e.g., PubMed) and web-scale corpora (e.g., Common Crawl).
In this paper, we introduce Binary Passage Retriever (BPR), which learns to hash continuous vectors into compact binary codes using a multitask objective that simultaneously trains the encoders and hash functions in an end-to-end manner (see Figure 1). In particular, BPR integrates our learning-to-hash technique into the state-ofthe-art Dense Passage Retriever (DPR) (Karpukhin et al., 2020) to drastically reduce the size of the passage index by storing it as binary codes. BPR computes binary codes by applying the sign function to continuous vectors. As the sign function is not compatible with back-propagation, we approximate it using the scaled tanh function during training. To improve search-time efficiency while maintaining accuracy, BPR is trained to obtain both binary codes and continuous embeddings for questions with multi-task learning over two tasks: (1) candidate generation based on the Hamming distance using the binary code of the question and (2) reranking based on the inner product using the continuous embedding of the question. The former task aims to detect a small number of candidate passages efficiently from the entire passages and the latter aims to rerank the candidate passages accurately.
We conduct experiments using the Natural Questions (NQ) (Kwiatkowski et al., 2019) and Triv-iaQA (TQA) (Joshi et al., 2017) datasets. Compared with DPR, our BPR achieves similar QA accuracy and competitive retrieval performance with a substantially reduced memory cost from 65GB to 2GB. Furthermore, using an improved reader, we achieve results that are competitive with those of the current state of the art in open-domain QA. Our code and trained models are available at https://github.com/studio-ousia/bpr.

Related Work
Retrieval for Open-domain QA Many recent open-domain QA models depend on the retriever to select relevant passages from a knowledge source. Early works involved the adoption of sparse representations (Chen et al., 2017) for the retriever, whereas recent works (Lee et al., 2019;Guu et al., 2020;Karpukhin et al., 2020) have often adopted dense representations based on neural networks. Our work is an extension of DPR (Karpukhin et al., 2020), which has been used in recent state-of-theart QA models (Lewis et al., 2020;. Concurrent with our work,  attempted to reduce the memory cost of DPR using post-hoc product quantization with dimension reduction and filtering of passages. However, they observed a significant degradation in the QA accuracy compared with their full model. We adopt the learning-to-hash method with our multi-task objective and substantially compress the index without losing accuracy. Learning to Hash The objective of hashing is to reduce the memory and search-time cost of the nearest neighbor search by representing data points using compact binary codes. Learning to hash (Wang et al., 2016(Wang et al., , 2018) is a method for learning hash functions in a data-dependent manner. Recently, many deep-learning-to-hash methods have been proposed (Lai et al., 2015;Zhu et al., 2016;Li et al., 2016;Cao et al., 2017Cao et al., , 2018 to jointly learn feature representations and hash functions in an end-to-end manner. We follow Cao et al. (2017) to implement our hash functions. Similar to our work, Xu and Li (2020) used the learning-to-hash method to reduce the computational cost of the answer sentence selection task, the objective of which is to select an answer sentence from a limited number of candidates (up to 500 in their experiments). Our work is different from the aforementioned work because we focus on efficient and scalable passage retrieval from a large knowledge source (21M Wikipedia passages in our experiments) using an effective multi-task approach. In addition to hashingbased methods, improving approximate neighbor search has been actively studied (Jégou et al., 2011;Malkov and Yashunin, 2020;Guo et al., 2020). We use Jégou et al. (2011) and Malkov and Yashunin (2020) as baselines in our experiments.

Model
Given a question and large-scale passage collection such as Wikipedia, a retriever finds relevant passages that are subsequently processed by a reader. Our retriever is built on DPR (Karpukhin et al., 2020), which is a retriever based on BERT (Devlin et al., 2019). In this section, we first introduce DPR and then explain our model.

Dense Passage Retriever (DPR)
DPR uses two independent BERT encoders to encode question q and passage p into d-dimensional continuous embeddings: where e q ∈ R d and e p ∈ R d . We use the uncased version of BERT-base; therefore, d = 768. The output representation of the [CLS] token is obtained from the encoder. To create passage p, the passage title and text are concatenated ([CLS] title [SEP] passage [SEP]). The relevance score of passage p, given question q, is computed using the inner product of the corresponding vectors, e q , e p .
be m training instances consisting of a question q i , a passage that answers the question (positive passage), p + i , and n passages that are irrelevant for the question (negative passages), p − i,j . The model is trained by minimizing the negative log-likelihood of the positive passage: Inference DPR creates a passage index by applying the passage encoder to each passage in the knowledge source. At runtime, it retrieves the top-k passages employing maximum inner product search with the question embedding as a query. Figure 1 shows the architecture of BPR. BPR builds a passage index by computing a binary code for each passage in the knowledge source. To compute the binary codes for questions and passages, we add a hash layer on top of the question and passage encoders of DPR. Given embedding e ∈ R d computed by an encoder, the hash layer computes its binary code,

Model Architecture
where sign(·) is the sign function such that for However, the sign function is incompatible with back-propagation because its gradient is zero for all non-zero inputs and is illdefined at zero. Inspired by Cao et al. (2017), we address this by approximating the sign function using the scaled tanh function during the training: where β is a scaling parameter. When β increases, the function gradually becomes non-smooth, and as β → ∞, it converges to the sign function. At each training step, we increase β by setting β = √ γ · step + 1, where step is the number of finished training steps. We set γ = 0.1 and explain the effects of changing it in Appendix B.

Two-stage Approach
To reduce the computational cost without losing accuracy, BPR splits the task into candidate generation and reranking stages. At the candidate generation stage, we efficiently obtain the top-l candidate passages using the Hamming distance between the binary code of question h q and that of each passage, h p . We then rerank the l candidate passages using the inner product between the continuous embedding of question e q and h p and select the top-k passages from the reranked candidates. We perform candidate generation using binary code h q for search-time efficiency, and reranking using expressive continuous embedding e q for accuracy. We set l = 1000 and describe the effects of using different l values in Appendix C.

Training
To compute effective representations for both the candidate generation and reranking stages, we combine the loss functions of the two tasks: Task #1 for Candidate Generation The objective of this task is to improve candidate generation using the ranking loss with the approximated hash code of questionh q and that of passageh p : We set α = 2 and investigate the effects of selecting different α values and using the cross-entropy loss instead of the ranking loss in Appendix D. Note that the retrieval performance based on the Hamming distance can be optimized using this loss function because the Hamming distance and inner product can be used interchangeably for binary codes. 2 Task #2 for Reranking We improve the reranking stage using the following loss function: This function is equivalent to L dpr , with the exception thath p is used instead of e p .

Algorithms for Candidate Generation
To perform candidate generation, we test two standard algorithms: (1) linear scan based on efficient Hamming distance computation, 3 and (2) hash table lookup implemented by building a hash table that maps each binary code to the corresponding passages and querying it multiple times by increasing the Hamming radius until we obtain l passages.

Experiments
Datasets We conduct experiments using the NQ and TQA datasets and English Wikipedia as the knowledge source. We use the following preprocessed data available on the DPR website: 4 Wikipedia corpus containing 21M passages and the training/validation datasets for the retriever containing multiple positive, random negative, and hard negative passages for each question.
Baselines We compare our BPR with DPR with linear scan and DPR with Hierarchical Navigable Small World (HSNW) graphs (Malkov and Yashunin, 2020) -which builds a multi-layer structure consisting of a hierarchical set of proximity graphs, following Karpukhin et al. (2020) -for our primary baselines. We also apply two popular post-hoc quantization algorithms to the DPR passage index: simple locality sensitive hashing (LSH) (Neyshabur and Srebro, 2015) and product quantization (PQ) (Jégou et al., 2011). We configure these algorithms such that their passage representations have the same size as that of BPR: the number of bits per passage of the LSH is set as 768, and the number of centroids and the code size of the PQ are configured as 96 and 8 bits, respectively.
Experimental settings Our experimental setup follows Karpukhin et al. (2020). We evaluate our model based on its top-k recall (the percentage of positive passages in the top-k passages), retrieval 4 https://github.com/facebookresearch/ DPR efficiency (the index size and query time), and exact match (EM) QA accuracy measured by combining our model with a reader. We use the same BERTbased reader as that used by DPR. Our model is trained using the same method as DPR. We conduct experiments on servers with two Intel Xeon E5-2698 v4 CPUs and eight Nvidia V100 GPUs. The passage index are built using Faiss (Johnson et al., 2019). Further details are provided in Appendix A. Table 1 presents the top-k recall (for k ∈ {1, 20, 100}), EM QA accuracy, index size, and query time achieved by BPR and baselines on the NQ and TQA test sets. BPR achieves similar or even better performance than DPR in both retrieval with k ≥ 20 and EM accuracy with a substantially reduced index size from 65GB to 2GB. We observe that BPR performs worse than DPR for k = 1, but usually the recall in small k is less important because the reader usually produces an answer based on k ≥ 20 passages. BPR significantly outperforms all quantization baselines. The query time of BPR is substantially shorter than that of DPR. Hash table lookup is faster than linear scan but requires slightly more storage. DPR+HNSW is faster than BPR; however, it requires 151GB of storage.

Main results
Ablations Table 2 shows the results of our ablation study. Disabling the reranking clearly degrades performance, demonstrating the effectiveness of our two-stage approach. Disabling the can-  didate generation (treating all passages as candidates) results in the same performance as using only top-1000 candidates, but significantly increases the query time due to the expensive inner product computation over all passage embeddings.
Comparison with State of the Art Table 3 presents the EM QA accuracy of BPR combined with state-of-the-art reader models. Here, we also report the results of our model using an improved reader based on ELECTRA-large (Clark et al., 2020) instead of BERT-base. Our improved model outperforms all models except the large model of Fusion-in-Decoder (FiD), which contains more than twice as many parameters as our model.

Conclusion
We introduce BPR, which is an extension of DPR, based on a learning-to-hash technique and a novel two-stage approach. It reduces the computational cost of open-domain QA without a loss in accuracy.

A.2 Question Answering Datasets
We conduct experiments using the NQ and TQA datasets with the training, development, and test sets as in Lee et al. (2019); Karpukhin et al. (2020). A brief description of these datasets is provided as follows: • NQ is a QA dataset for which questions are obtained from Google queries and answers comprise the spans of English Wikipedia articles.
• TQA consists of trivia questions and their answers retrieved from the Web. We use the preprocessed datasets available on the website of Karpukhin et al. (2020). 5 The numbers of questions contained in these datasets are listed in Table 4. For each question, the dataset contains three types of passages: (1) positive passages selected based on gold-standard human annotations or distant supervision, (2) random negative passages selected randomly from all the passages, and (3) hard negative passages selected based on the BM25 scores between the question and all the passages.

A.3 Details of BPR
Our training configuration follows that of Karpukhin et al. (2020). In particular, for each question, we use one positive and one hard negative passage and create a mini-batch comprising 128 questions. We use the method of inbatch-negatives, wherein each positive passage in a mini-batch is treated as the negative passage of each question 5 https://github.com/facebookresearch/ DPR   in the mini-batch if it does not correspond to the question. Our model contains 220 million parameters, and is trained for up to 40 epochs using Adam.
Regarding the hyperparameter search, we select the learning rate from the search range {1e-5, 2e-5, 3e-5, 5e-5} based on the top-100 recall on the validation set of the NQ dataset. Therefore, the number of hyperparameter search trials is 4. The detailed hyperparameters are listed in Table 5.

A.4 Details of Reader
For each passage in the top-k passages retrieved by the retriever, the reader assigns a relevance score to the passage and selects the best answer span in the passage. The final answer is the selected span from the passage with the highest relevance score. Let P i ∈ R q×d (1 ≤ i ≤ k) be a BERT output representation for the i-th passage, where q is the maximum token length of the passage, and d is the dimension size of the output representation. The probabilities of a passage being selected and a token being the start or end positions of an answer is computed as P score (i) = softmax P w score i , (8) P start,i (s) = softmax P i w start s , (9) ] ∈ R d×k , w score ∈ R d , w start ∈ R d , and w end ∈ R d .   The passage selection score of the i-th passage is given as P score (i), and the score of the s-th to t-th tokens from the i-th passage is given as P start,i (s) × P end,i (t).
During the training, we sample one positive and multiple negative passages from the passages returned by the retriever. The model is trained to maximize the log-likelihood of the correct answer span in the positive passage, combined with the loglikelihood of the positive passage being selected. We use the BERT-base or ELECTRA-large as our pretrained model. Regarding the hyperparameter search, we select the learning rate from {1e-5, 2e-5, 3e-5, 5e-5} based on its EM accuracy on the validation set of the NQ dataset. Therefore, the number of hyperparameter search trials is 4. Detailed hyperparameters are listed in Table 6.

B Effects of Scaling Parameter
To investigate how the scaling parameter, γ, affects the performance, we test the performance of our model using various γ values, where γ ∈ {0.025, 0.05, 0.1, 0.2}. The retrieval performance on the validation set of the NQ dataset is shown in Table 7. Overall, the scaling parameter has a minor impact on the performance. We select γ = 0.1 because of its enhanced performance.

C Effects of Number of Candidate Passages
We report the performance of our model with the varied number of candidate passages l in Table 8. Overall, BPR achieves similar performance in all settings. Increasing the number of candidate passages slightly improves the top-100 performance until it reaches l = 1000.

D Effects of Loss of Task #1 with Various Settings
We investigate the effects of using various settings of the loss function L cand in Eq.(6). Instead of using the ranking loss, we test the performance with the cross-entropy loss, similar to Eq.(2), andh q andh p are used instead of e q and e p , respectively. Furthermore, we also test how the parameter α affects the performance. As shown in Table 9, the cross-entropy loss clearly performs worse than the ranking loss. Furthermore, a change in the parameter α has a minor impact on the performance. Here, we select the ranking loss with α = 2.0 because of its enhanced performance on the top-20 and top-100 performance.