Learning Dense Representations of Phrases at Scale

Open-domain question answering can be reformulated as a phrase retrieval problem, without the need for processing documents on-demand during inference (Seo et al., 2019). However, current phrase retrieval models heavily depend on sparse representations and still underperform retriever-reader approaches. In this work, we show for the first time that we can learn dense representations of phrases alone that achieve much stronger performance in open-domain QA. We present an effective method to learn phrase representations from the supervision of reading comprehension tasks, coupled with novel negative sampling methods. We also propose a query-side fine-tuning strategy, which can support transfer learning and reduce the discrepancy between training and inference. On five popular open-domain QA datasets, our model DensePhrases improves over previous phrase retrieval models by 15%-25% absolute accuracy and matches the performance of state-of-the-art retriever-reader models. Our model is easy to parallelize due to pure dense representations and processes more than 10 questions per second on CPUs. Finally, we directly use our pre-indexed dense phrase representations for two slot filling tasks, showing the promise of utilizing DensePhrases as a dense knowledge base for downstream tasks.


Introduction
Open-domain question answering (QA) aims to provide answers to natural-language questions using a large text corpus (Voorhees et al., 1999;Ferrucci et al., 2010;. While a dominating approach is a two-stage retriever-reader approach (Chen et al., 2017;Guu et al., 2020;, we focus on a recent new paradigm solely based on phrase retrieval (Seo et al., 2018(Seo et al., , 2019. Phrase retrieval highlights the use of phrase representations and finds answers purely based on the similarity search in the vector space of phrases. Without relying on an expensive reader model (e.g., a 12-layer BERT model ) for processing text passages, it has demonstrated great runtime efficiency at inference time. Table 1 compares the two approaches in detail.
Despite great promise, it remains a formidable challenge to build effective dense representations for each single phrase in a large corpus (e.g., Wikipedia). First, since phrase representations need to be decomposable from question representations, they are often less expressive than querydependent representations-this challenge brings the decomposability gap as stated in (Seo et al., 2018(Seo et al., , 2019. Second, it requires retrieving answers correctly out of ten billions of phrases, which are more than four orders of magnitude larger than the number of documents in Wikipedia. Consequently, this approach heavily relies on sparse representations for locating relevant documents and paragraphs while still falling behind retriever-reader models (Seo et al., 2019;. In this work, we investigate whether we can build fully dense phrase representations at scale for opendomain QA. First, we attribute the cause of the decomposability gap to the sparsity of training data. We close this gap by generating questions for every answer phrase, as well as distilling knowledge from query-dependent models (Section 3). Second, we use negative sampling strategies such as inbatch negatives (Henderson et al., 2017;, to approximate global normalization. We also propose a novel method called pre-batch negatives, which leverages preceding mini-batches as negative examples to compensate the need of large-batch training (Section 4). Lastly, for taskspecific adaptation of our model, we propose queryside fine-tuning that drastically improves the per-
formance of phrase retrieval without re-building billions of phrase representations (Section 5). Consequently, all these improvements enable us to learn a much stronger phrase retrieval model, without the use of any sparse representations. We evaluate our final model, DensePhrases, on five standard open-domain QA datasets and achieve much better accuracies than previous phrase retrieval models (Seo et al., 2019;)-a 15% to 25% absolute improvement in most datasets. Our model also matches performance of state-ofthe-art retriever-reader models (Guu et al., 2020;. Because of the removal of sparse representations and careful design choices, we further reduce the storage footprint for the full English Wikipedia from 1.5TB to 320GB, as well as improving the throughput by processing more than 10 questions per second on CPUs. Finally, we envision that DensePhrases acts as a neural interface for retrieving phrase-level knowledge from a large text corpus. As such, it can be integrated into other knowledge-intensive NLP tasks beyond question answering. To showcase this possibility, we demonstrate that we can directly use DensePhrases for fact extraction, without the need for re-building billions of phrase representations. With fine-tuning on a small number of subject-relation-object triples on the query representations alone, we achieve state-of-the-art performance on two slot filling tasks (Petroni et al., 2020), using only 5% of the training data.

Open-domain QA
We first formulate the task of open-domain question answering for a set of K documents D = {d 1 , . . . , d K }. We follow the recent work (Chen et al., 2017; and treat all of English Wikipeida as D, hence K ≈ 5 × 10 6 . However, most approaches-including ours-are generic and could be applied to other collections of documents. The task aims to provide an answerâ for the input question q based on D. In this work, we focus on the extractive QA setting, where each answer is a segment of text, or a phrase, that can be found in D. Denote the set of phrases in D as S(D) and each phrase s k ∈ S(D) consists of contiguous words w start(k) , . . . , w end(k) in its document d doc(k) . In practice, we consider all the phrases up to L = 20 words in D and S(D) comprises a large number of 6 × 10 10 phrases. An extractive QA system returns a phraseŝ = argmax s∈S(D) f (s | D, q) and f is a scoring function. The system finally mapsŝ to an answer stringâ: TEXT(ŝ) =â. Evaluation is typically done by comparing the predicted answer a with a gold answer a * (after normalization).

Retriever-reader A dominating paradigm in
open-domain QA is the retriever-reader approach (Chen et al., 2017;, which leverages a firststage document retriever (f retr ) and only reads top K (K K) documents for finding the answer with a reader model (f read ). The scoring function f (s | D, q) can be decomposed as: where {j 1 , . . . , j K } ⊂ {1, . . . , K} and if s / ∈ S({d j 1 , . . . , d j K }), the score will be 0. The model can be easily adapted from documents to passages and has been studied extensively (Yang et al., 2019;Wang et al., 2019). However, this approach suffers from error propagation when incorrect documents are retrieved and can be slow as it usually requires running an expensive reader model on every retrieved document or passage at inference time.
Phrase retrieval Seo et al. (2019) introduce the phrase retrieval approach that encodes phrase and question representations independently and performs similarity search over the phrase representations to find an answer. Their scoring function f is computed as follows: where E s and E q denote the phrase encoder and the question encoder respectively. As E s (·) and E q (·) representations are decomposable, it can support maximum inner product search (MIPS) and improve the efficiency of open-domain QA models. Previous approaches (Seo et al., 2019;  Although we focus on the extractive QA setting, recent works propose to use a generative model as the reader Izacard and Grave, 2020), or learn a closed-book QA model , which predicts an answer directly without using an external knowledge source. The extractive setting provides two advantages: first, the model directly locates the source of the answer, which is more interpretable, and second, phraselevel knowledge retrieval can be uniquely adapted to other NLP tasks as we show in Section 6.3.

Technical Challenges
Despite an appealing approach, phrase retrieval poses several key technical challenges. The first challenge is the decomposition constraint between question and phrase encoders as stated in Equation (2), which brings a significant degradation of performance. While a similar problem is observed in learning dual encoders for passage representations (Humeau et al., 2019;Khattab and Zaharia, 2020), phrase representations are even more difficult to learn due to the fine-grained representations. The second challenge arises from the scale. Compared to 5-million documents, or 21-million text blocks as used in previous work, each phrase representation has to be properly normalized over 60 billions of dense representations. This normalization problem is also implied in Equation (2) that E s (s, D) is defined over all the phrases in S(D). Lastly, since it is computationally expensive to build billions of phrase representations at the Wikipedia scale, it is prohibitive to update the phrase representations once they are obtained. As a result, current phrase retrieval models often rely on their zero-shot ability .
In this work, we introduce DensePhrases to tackle these technical challenges, as illustrated in Figure 1. In the following, we first describe our query-agnostic model in a single-passage setting and address the decomposability gap (Section 3). Then we propose several normalization techniques for scaling phrase representations to the full collection of documents D (Section 4). Finally, we detail how we adapt our model for transfer learning (Section 5), without the need for re-building of phrase representations at scale.

Phrase-Indexed Question Answering
In this section, we start by learning query-agnostic phrase representations in a reading comprehension setting, in which a gold passage p is given for a question-answer pair (q, a * ). Our goal is to build a strong reading comprehension model while enforcing the decomposability of query and phrase representations. This problem was first formulated by Seo et al. (2018) and dubbed the phrase-indexed question answering (PIQA) task. In the following, we first describe our base model (Section 3.1) and propose two new solutions to close the decomposability gap: tackling data sparsity via question generation (Section 3.2) and distillation from querydependent models (Section 3.3).

Base Model
We first describe our base architecture, which consists of a phrase encoder E s and a question encoder E q . Given a passage p of m tokens w 1 , . . . , w m , we consider phrases of up to L tokens and the full set of phrases is denoted by S(p). Each phrase s k has start and end indicies start(k) and end(k) and the gold phrase is s * ∈ S(p) 4 . Following previous work on phrase, or span representations (Lee et al., 2017;Seo et al., 2018), we first use a pretrained language model M p to obtain contextualized word representations for each passage token: h 1 , . . . , h m . Then, we can represent each phrase s k ∈ S(p) as the concatenation of corresponding start and end vectors: (3) Using contextualized word representations to construct phrase representations has another great advantage that we can eventually reduce the storage of phrase representations to word representations. Therefore, we only need to save |W(D)| (the total number of tokens in D) vectors, which is at least one magnitude order smaller than |S(D)|. Similarly, we need to learn a question encoder E q (·) that maps a question of n tokens w 1 , . . . , w n to a vector of the same dimension as E s (·). To do so, we use another two different pre-trained LMs M q,start and M q,end applied on q and finally obtain representations q start and q end pooled from the [CLS] token representations of M q,start and M q,end respectively. Finally, E q (·) simply takes their concatenation: In summary, we have three different LMs in total, which are initialized from the same pre-trained LM. Since the start and end representations of phrases are produced by the same language model, we need two question representations q start and q end to differentiate the start and end positions. In our pilot experiments, we found that SpanBERT (Joshi et al., 2020) leads to superior performance compared to BERT . SpanBERT is designed to predict the information in the entire span from its two endpoints, therefore it is well suited for our phrase representations. In our final model, we use SpanBERT-base-cased as our base LMs for E s and E q , and hence d = 768.
Single-passage training For the reading comprehension setting, we maximize the log-likelihood of the start and end positions of the gold phrase s * : We define L start = − log P start start(s * ) and L end in a similar way. The final loss is calculated as: Differences from DenSPI We deviate from Den-SPI in the following ways: (1) Previous models split a hidden vector from a pre-trained LM into four vectors (start & end vectors and two vectors for calculating a coherency score). We don't do any splitting of vectors and remove the use of coherency scalars. We find that it is beneficial to keep the output dimension of pre-trained LMs for fully utilizing their representational capacity; (2) Previous models use a shared encoder for phrases and questions. However, we use two different language models for representing questions. (3) We use SpanBERT instead of BERT. See Table 5 for an ablation study.

Tackling Data Sparsity
The performance of query-agnostic models is always inferior to query-dependent models with cross-attention . We hypothesize that one key reason is most reading comprehension datasets only provide a few annotated questions in each passage, compared to the set of possible answer phrases. For instance, each passage in the training set of Natural Questions  mostly has only one annotated question. Suppose that we are given a passage with the following question-answer pair in the training set: While cross-attention models only need to represent the passage focusing on "who is Queen Elizabeth II's heir apparent", our phrase encoder should take all the other phrases into account, (e.g., s = Prince George), because their representations will be re-used for other questions (e.g., q = who is the eldest child of the duke of cambridge).
Following this intuition, we propose to use a simple question generation model to generate questions for each training passage. We generate questions from a question generation (QG) model built upon a T5-large model . The input of the model is a passage p with the gold answer s * highlighted by inserting surrounding special tags and the model is trained to maximize the log-likelihood of the question words of q. Since we want to cover many possible answer phrases during inference, we extract all the entities in each training passage as candidate answers to extract named entities. to generate questions. We keep the questionanswer pairs only when a strong query-dependent (QD) reading comprehension model (SpanBERTlarge, 88.2 EM on SQuAD) gives correct prediction on the generated pair. The remaining generated QA pairs {(q 1 , s 1 ), (q 2 , s 2 ), . . . , (q r , s r )} are directly augmented to the original training set. Generated QA pairs can help learn phrase representations aligned with those of corresponding questions, instead of biased to few annotated questions.

Distillation
As query-dependent (QD) models with crossattention are considered stronger models, we also propose improving our query-agnostic model by distilling knowledge from a cross-attention model. We minimize the Kullback-Leibler (KL) divergence between the probability distribution from our phrase encoder and that from a query-dependent model (Hinton et al., 2015). We use a SpanBERTbase QA model as the QD model. The distillation loss is computed as follows: where P start (and P end ) is defined in Equation (5) and P start qd and P end qd denote the probability distributions, which are used to predict the start and end indices in the query-dependent model.

Phrase Representations at Scale
Eventually, we need to build phrase representations for billions of phrases, therefore a bigger challenge is to incorporate more phrases during training so the representations can be better normalized. While Seo et al. (2019) simply sample two negative questions from other passages based on question similarity, we propose two effective negative-sampling strategies, which are efficient to compute and highly useful in practice.

In-batch Negatives
We use in-batch negatives for our dense phrase representations, which also has been shown to be effective in learning dense passage representations before . Specifically, for the i-th example in a mini-batch of size B, we denote the hidden representations of the gold start and end positions (h start(s * ) and h end(s * ) ) as g start i and g end i , as well as the question representation as [q start i , q end i ]. Let G start , G end , Q start , Q end be the B × d matrices and each row corresponds to g start i , g end i , q start i , q end i respectively. Basically, we can treat all the gold phrases from other passages in the same mini-batch as negative examples. We compute S start = Q start G start and S end = Q end G end and the i-th row of S start and S end return B scores each, including a positive score and B − 1 negative scores: s start 1 , . . . , s start B and s end 1 , . . . , s end B . Similar to Equation (5), we can compute the loss function for the i-th example as: Finally, the loss is summed over B examples in the mini-batch. We also attempted using other nongold phrases from other passages as negatives but didn't find a meaningful improvement.

Pre-batch Negatives
The in-batch negatives are highly effective and they usually benefit from a large batch size . However, batch sizes are bounded by GPU memory and it is challenging to further increase them. In this section, we propose a novel negative sampling method called pre-batch negatives, which can effectively utilize the representations from the preceding C mini-batches. In each iteration, we maintain a FIFO queue of C minibatches to cache phrase representations (G start , G end ). The cached phrase representations are then used as negative samples for the next iteration, providing B × C additional negative samples in total. This approach is inspired by the momentum contrast idea recently proposed in unsupervised visual representation learning (He et al., 2020). However, our approach differs from theirs in that we have independent encoders for phrases and questions and back-propagate to both during training, without a momentum update. 5 These pre-batch negatives are used together with in-batch negatives and the training loss is the same as Equation (8), except that the gradients are not back-propagated to the cached pre-batch negatives.
In practice, we found that pre-batch negatives work well, once the phrase encoder is warmed up with in-batch negatives. After the warm-up stage, we simply shift from in-batch negatives (B − 1 negatives) to in-batch and pre-batch negatives (hence a total number of B × C + B − 1 negatives). For simplicity, we use L neg to denote both in-batch negatives and pre-batch negatives during training. Since we do not retain the computational graph for forward and backward propagation, the memory consumption of pre-batch negatives is much more manageable while allowing increasing the number of negative samples. Empirically, we found that using a large number of pre-batch negatives does not always help since the phrase representations can get easily outdated.

Optimization, Indexing and Search
With our loss terms defined previously, we finally minimize the following loss function on a question answering dataset, together with the generated questions (Section 3.2): where λ 1 , λ 2 , λ 3 determine the importance of each loss term. We found that setting λ 1 = 1, λ 2 = 2, and λ 3 = 4 works well in practice. In our experiments (Section 6.2), we use reading comprehension datasets (a gold passage p is provided) such as SQuAD (Rajpurkar et al., 2016) and Natural Questions  to train the phrase and question encoders.
Indexing After training the phrase encoder E s , we need to encode all the phrases S(D) in the entire English Wikipedia D and store an index of the phrase dump. We segment each document d i ∈ D into a set of natural paragraphs, from which we obtain token representations for each paragraph using E s (· Search During inference, for a given question q, we can find the answerŝ as follows (Figure 1): where s (i,j) denotes a phrase with start and end indices as i and j in the index H. We can compute the argmax of Hq start (or Hq end ) efficiently by performing MIPS over H with q start (or q end ). In practice, we search for the top k start and top k end positions separately and perform a constrained search over the start and end positions respectively such that 1 ≤ i ≤ j < i + L ≤ |W(D)|. Since we share H for the start and end representations, q start and q end are also batched for MIPS to benefit from multi-threading. To avoid producing redundant answers, we only keep the best scoring phrases when the two phrases that have the same normalized string are retrieved from the same paragraph.

Query-side Fine-tuning
So far, we have created a phrase dump H as well as a question encoder E q that can be directly used for question answering. In this section, we propose a novel method called query-side fine-tuning, which can facilitate transfer learning on a new dataset, by training the question encoder E q to correctly retrieve a desired answer a * for a question q given H. There are several advantages for doing this: (1) It can help our model quickly adapt to new QA tasks without re-building billions of phrase representations. 7 (2) Even for the questionanswering datasets used to build H (SQuAD and NQ in our experiments), we also find that queryside fine-tuning can further improve performance because it can reduce the discrepancy between training and inference.
(3) It also creates a possibility to adapt our DensePhrases to non-QA tasks when the query is written in a different format. In Section 6.3, we show the possibility of directly using DensePhrases for slot filling tasks by using a query such as (Michael Jackson, is a singer of, x).
In this regard, we can view our model as a knowledge base that can be accessed by many different types of queries and it is able to return phraselevel knowledge efficiently. Formally speaking, we maximize the marginal log-likelihood of the gold answer a * for a question q, which resembles the weakly-supervised setting in the open-domain QA Min et al., 2019). The loss for query-side fine-tuning is computed as follows: where f (s|D, q) is the score of the phrase s (Equation (2)) andS(q) denotes the top k phrases for q (Equation (10)). In practice, we use k = 100 for the query-side fine-tuning. Note that only the parameters of the question encoder E q are updated.

Setup
Datasets We use two reading comprehension datasets: SQuAD (Rajpurkar et al., 2016) and Natural Questions , in which a gold passage is provided. For Natural Questions, we use the short answer as a ground truth answer a * and its long answer as a gold passage p (Appendix D). We train our phrase representations on these two datasets, and also report the performance of our query-agnostic models in the reading comprehension setting.
We evaluate our approach on five popular open-domain QA datasets: Natural Questions , WebQuestions (Berant et al., 2013), CuratedTREC (Baudiš and Šedivỳ, 2015), TriviaQA (Joshi et al., 2017), and SQuAD (Rajpurkar et al., 2016) and the data statistics are provided in Table D.3. Although many questions in SQuAD are context-dependent, we evaluate our model on SQuAD for the comparison with previous phrase retrieval models (Seo et al., 2019;, which were mainly trained and evaluated on SQuAD. Finally, we also evaluate our model on two slot filling tasks, to show how to adapt our DensePhrases for other knowledge-intensive NLP tasks. We focus on using two slot filling datasets from the KILT benchmark (Petroni et al., 2020): T-REx (Elsahar et al., 2018) and zero-shot relation extraction (Levy et al., 2017). Each query is provided in the form of "{subject entity} [SEP] {relation}", where the answer is the object entity.  (Guu et al., 2020) {Wiki., CC-News} † 40.4 40.7 42.9 --DPR-multi     Then, we perform query-side fine-tuning our model 8 The quality of generated questions from a QG model trained on Natural Questions is worse due to the ambiguity of information-seeking questions. on each dataset using Equation (11). While we use a single 48GB GPU (Quadro RTX 8000) for training the phrase encoders with Equation (9), queryside fine-tuning is relatively cheap and uses a single 12GB GPU (TITAN Xp). For slot filling, we use the same set of hyperparameters used for queryside fine-tuning our models on open-domain QA datasets and we use the phrase dump obtained from DensePhrases (NQ + SQuAD). To see how rapidly our model adapts to the new query types, we train our models on randomly sampled 5k or 10k training samples. See Appendix E for more details on the hyperparameters for each task.

Experiments: Question Answering
Baselines For reading comprehension, we report scores of query-agnostic models including Den-SPI (Seo et al., 2019), DenSPI + Sparc  as well as Deformer (Cao et al., 2020) and DilBERT (Siblini et al., 2020), where they only allow a late interaction (cross-attention) in the last few layers of BERT. We report their results based on the interaction in the last layer, which mostly resembles the fully query-agnostic models.
For open-domain QA, we report the scores of extractive open-domain QA models including DrQA (Chen et al., 2017), BERT + BM25 , ORQA , REALM (Guu et al., 2020), and DPR-Multi . We also show the performance of previous phrase retrieval models: DenSPI and DenSPI + Sparc. In Appendix A, we provide a thorough analysis on the computational complexity of each open-domain QA model.  Table 4: Slot filling results on the test sets of T-REx and Zero shot RE (ZsRE) in the KILT benchmark (Petroni et al., 2020). We report KILT-AC and KILT-F1 (denoted as Acc and F1 in the table), which consider both span-level accuracy and correct retrieval of evidence documents. We consider two settings, in which we use 5K and 10K training examples respectively.

T-REx ZsRE
Reading comprehension Experimental results on reading comprehension datasets are shown in Table 3. Among query-agnostic models, our models achieve the best performance of 78.3 EM on SQuAD by improving the previous dense phrase retrieval model (DenSPI) by 4.7%. Despite it is still behind query-dependent models, the gap has been greatly reduced and serves as a strong starting point for the open-domain QA model.

Open-domain QA
Experimental results on open-domain QA are summarized in Table 2. Without using any sparse representations, DensePhrases outperforms previous phrase retrieval models by a large margin of a 15%-25% absolute improvement on all the datasets except SQuAD. As previous models only used SQuAD to train the phrase model and perform zero-shot prediction on other datasets, we add one more experiment training the model of  on C phrase = {NQ, SQuAD} for a fair comparison. However, it only increases the result from 14.5% to 16.5% on Natural Questions, demonstrating that it is not enough to simply add more datasets for training phrase representations. Our performance is also competitive with recent retriever-reader models Guu et al., 2020), while running much faster during inference (Table 1). We can also consider using distantly-supervised examples of TriviaQA, WebQuestions and TREC for training our phrase representations, as the DPR-multi model did, and we leave it to future work. Finally, we find that using C phrase = {SQuAD} or {NQ, SQuAD} doesn't make much difference after query-side fine-tuning in most datasets, except that including 78.3 Table 5: Ablation of DensePhrases on the development set of SQuAD. Bb: BERT-base, Sb: SpanBERT-base, Bl: BERT-large. Share: whether question and phrase encoders are shared or not. Split: whether the full hidden vectors are kept or split into start and end vectors. DenSPI (Seo et al., 2019) also included a coherency scalar and see their paper for more details.
NQ in C phrase brings a large gain of 9.7% on the NQ evaluation. Table 4 summarizes the results of our model on the two slot filling datasets, along with the scores of baseline models provided by Petroni et al. (2020). The only extractive baseline is DPR + BERT which performs poorly in zero-shot relation extraction. On the other hand, our model achieves competitive performance on all datasets and achieves state-ofthe-art performance on two datasets using only 5K training samples (less than 5% of the training data). This showcases how DensePhrases can be easily leveraged for knowledge-intensive NLP tasks. Table 5 shows the ablation result of our model on SQuAD. We observe that not sharing phrase and question encoders (Share = ) and using the full output dimension (Split = ) together improves the performance by 2%. Using a stronger pre-trained LM SpanBERT leads to another 1.3% improvement. Augmenting training set with generated questions (QG = ) and performing distillation from query-dependent models (Distill = ) further improves performance up to EM = 78.3. We also attempt adding the generated questions to the training of the query-dependent model and only find a 0.3% improvement (SpanBERT-base), which validates our hypothesis that data sparsity is a bottleneck for query-agnostic models.  Table 6: Effect of in-batch negatives and pre-batch negatives on the development set of Natural Questions. B: batch size, C: number of preceding mini-batches used in pre-batch negatives. We report EM of our model with smaller sets such as D small (all the gold passages in the development set of NQ) and {p} (single passage).

Effect of Batch Negatives
We further evaluate the effectiveness of various negative sampling methods introduced in Section 4.
Since it is computationally expensive to test each setting at the full Wikipedia scale, we use a smaller text corpus D small of all the gold passages in the development sets of Natural Questions, for the ablation study. We also report performance in the reading comprehension setting so D = {p} only consists of a gold passage. Empirically, we find that results are generally well correlated when we gradually increase the size of |D| and we encourage interested readers to experiment with these settings for model development. The results are summarized in Table 6. While using a larger batch size (B = 84) is beneficial for in-batch negatives, the number of preceding batches in pre-batch negatives is optimal when C = 2. Somewhat surprisingly, the pre-batch negatives also improve the performance when D = {p}.

Effect of Query-side Fine-tuning
The query-side fine-tuning further trains the question encoder E q with the phrase dump H and enables us to fine-tune DensePhrases on different types of questions. We use three different phrase encoders, each of which is trained on a different training dataset C phrase . We summarize the results in Table 7. For the datasets that were not used for training the phrase encoders (TQA, WQ, TREC), we observe a 15% to 20% improvement after queryside fine-tuning, compared to zero-shot prediction. Even for the datasets that have been used (NQ, SQuAD), it leads to significant improvements (e.g., +8.3% improvement on NQ for C phrase = {NQ}) and it clearly demonstrates it can effectively reduce the discrepancy between training and inference.  Table 7: Effect of query-side fine-tuning in DensePhrases on each test set. We report EM of each model before (QS = ) and after (QS = ) the query-side fine-tuning.

Conclusion
In this study, we show that we can learn dense representations of phrases at the Wikipedia scale, which are readily retrievable for open-domain QA and other knowledge-intensive NLP tasks. We tackle the decomposability gap by mitigating data sparsity and introduce two batch-negative techniques for normalizing billions of phrase representations. We also introduce query-side finetuning that easily adapts our model to any type of query with a single 12GB GPU. As a result, we achieve much stronger performance on five popular open-domain QA datasets compared to previous phrase retrieval models, while reducing the storage footprint and improving latency significantly. We also achieve strong performance on two slot filling datasets using only a small number of training examples, showing the possibility of utilizing our DensePhrases as a dense knowledge base.

A Complexity Analysis
We describe the resources and time spent during inference (Table 1 and A.1) and indexing (Table A.1). With our limited GPU resources (24GB × 4), it takes about 20 hours for indexing the entire phrase representations. We also largely reduced the storage from 1,547GB to 320GB by (1) removing sparse representations and (2) using our sharing and split strategy. See Appendix B for the details on the reduction of storage footprint and Appendix C for the specification of our server for the benchmark.

Indexing
Resources

B Reducing Storage Footprint
As shown in Table 1, we have reduced the storage footprint from 1,547GB  to 320GB. We detail how we can reduce the storage footprint in addition to the several techniques introduced by Seo et al. (2019). First, following Seo et al. (2019), we apply a linear transformation on the passage token representations to obtain a set of filter logits, which can be used to filter many token representations from W(D). This filter layer is supervised by applying the binary cross entropy with the gold start/end positions (trained together with Equation (9)). We tune the threshold for the filter logits on the reading comprehension development set to the point where the performance does not drop significantly while maximally filtering tokens.
Second, in our architecture, we use a base model (SpanBERT-base) for a smaller dimension of token representations (d = 768) and does not use any sparse representations including tf-idf or contextualized sparse representations . We also use the scalar quantization for storing float32 vectors as int8 during indexing.
Lastly, since the inference in Equation (10) is purely based on MIPS, we do not have to keep the original start and end vectors which takes about 500GB. However, when we perform query-side fine-tuning, we need the original start and end vectors for reconstructing them to compute Equation (11) since MIPS index only returns the top-k scores and their indices, but not the vectors.

C Server Specifications for Benchmark
To compare the complexity of open-domain QA models, we install all models in Table 1  For DPR, due to its large memory consumption, we use a similar server with a 24GB GPU (TITAN RTX). For all models, we use 1,000 randomly sampled questions from the Natural Questions development set for the speed benchmark and measure #Q/sec. We set the batch size to 64 for all models except BERTSerini, ORQA and REALM, which do not support a batch size of more than 1 in their open-sources. #Q/sec for DPR includes retrieving passages and running a reader model and the batch size for the reader model is set to 8 to fit in the 24GB GPU (retriever batch size is still 64). For other hyperparameters, we use the default settings of each model. We also exclude the time and the number of questions in the first five iterations for warming up each model. Note that despite our effort to match the environment of each model, their latency can be affected by various different settings in their implementations such as the choice of library (PyTorch vs. Tensorflow).

D Pre-processing for Single-Passage Training
We use two reading comprehension datasets (SQuAD and Natural Questions) for training our model on Equation (9). For SQuAD, we use the original dataset provided by the authors (Rajpurkar et al., 2016). For Natural Questions (Kwiatkowski et al., 2019), we use the pre-processed version provided by Asai et al. (2019). 9 We also match the gold passages in Natural Questions to the paragraphs in Wikipedia whenever possible. Since we want to check the performance changes of our model with the growing number of tokens, we follow the same split (train/dev/test) used in Natural Questions-Open for the reading comprehension setting as well.

E Hyperparameters
We use the Adam optimizer (Kingma and Ba, 2015) in all our experiments. For training our phrase and question encoders with Equation (9), we use a learning rate of 3e-5 and the norm of the gradient is clipped at 1. We use a batch size of 84 and train each model for 4 epochs for all datasets where the loss of pre-batch negatives is applied in the last two epochs. We use spaCy 10 for extracting named entities in each training passage, which are used to generate questions. The number of generated questions is 327,302 and 1,126,354 for SQuAD and Natural Questions, respectively. The number of preceding batches C is set to 2. For the query-side fine-tuning with Equation (11), we use a learning rate of 3e-5 and the norm of the gradient is clipped at 1. We use a batch size of 12 and train each model for 10 epochs for all datasets. The top k for the Equation (11) is set to 100. Using the development set, we select the best performing model for each dataset, which are then evaluated on each test set. Since SpanBERT only supports cased models, we also truecase the questions (Lita et al., 2003) that are originally provided in the lowercase (Natural Questions and WebQuestions).