Generation-Augmented Retrieval for Open-Domain Question Answering

We propose Generation-Augmented Retrieval (GAR) for answering open-domain questions, which augments a query through text generation of heuristically discovered relevant contexts without external resources as supervision. We demonstrate that the generated contexts substantially enrich the semantics of the queries and GAR with sparse representations (BM25) achieves comparable or better performance than state-of-the-art dense retrieval methods such as DPR. We show that generating diverse contexts for a query is beneficial as fusing their results consistently yields better retrieval accuracy. Moreover, as sparse and dense representations are often complementary, GAR can be easily combined with DPR to achieve even better performance. GAR achieves state-of-the-art performance on Natural Questions and TriviaQA datasets under the extractive QA setup when equipped with an extractive reader, and consistently outperforms other retrieval methods when the same generative reader is used.


Introduction
Classic retrieval methods such as TF-IDF and BM25 use sparse representations to measure lexical overlap. These sparse methods are lightweight and efficient, but unable to perform semantic matching and fail to retrieve relevant passages without explicit token overlap. To tackle the lexical mismatch problem, the traditional approach is query expansion (QE), which expands a query with relevant terms using e.g., relevance models with (pseudo) * Work was done during an internship at Microsoft Dynamics 365 AI.
1 Work in progress.
relevance feedback (Lavrenko and Croft, 2001;Abdul-Jaleel et al., 2004). More recently, methods based on dense representations (Huang et al., 2013;Guu et al., 2020; learn to embed queries and passages into a latent vector space, in which text semantics beyond lexical overlap can be measured. These methods can retrieve semantically relevant but lexically different passages and often achieve better performance than the sparse methods. However, the dense models are more computationally expensive and may suffer from information loss as they condense the entire text sequence into a fixed-size vector that does not guarantee exact matching (Luan et al., 2020).
In this paper, we propose a novel query expansion method for information retrieval, named Generation-Augmented Retrieval (GAR). At a high level, GAR augments the semantics of a query with relevant contexts (expansion terms) through text generation of a pre-trained language model (LM). For example, by prompting a pre-trained LM to generate the title of a relevant passage given a query and appending the generated title to the query, the generation-augmented query becomes semantically richer and thus it is easier to retrieve the relevant passage. Intuitively, the generated contexts of a query explicitly "express" the semantics of the search intent that is not presented in the original query. As a result, GAR with sparse representations can achieve comparable or even better performance than existing approaches with dense representations of the original queries, while being much more lightweight and efficient.
We evaluate the effectiveness of GAR on opendomain question answering (QA), which aims to answer factoid questions without a pre-specified domain and has numerous real-world applications (Kwiatkowski et al., 2019). A large collection of documents (e.g., Wikipedia) are often used to find information pertaining to the questions. One of the most common approaches to open-domain QA uses a retriever-reader architecture (Chen et al., 2017), which first retrieves a small subset of the documents using the question as the query, and then reads the retrieved documents to extract (or generate) an answer. The retriever is crucial as it is infeasible to examine every piece of information in the entire document collection (e.g., millions of Wikipedia passages) and the retrieval accuracy bounds the performance of the (extractive) reader.
Instead of using questions as queries directly, GAR uses a pre-trained LM to generate contexts relevant to a question and expands the query by adding generated contexts. Specifically, we conduct sequence-to-sequence (Seq2Seq) learning with the question as input and various generation targets as output such as the answer, the sentence where the answer belongs to, and the title of a passage that contains the answer. We then append the generated contexts to the question as the generationaugmented query for retrieval. We demonstrate that using multiple contexts from various generation targets is beneficial as fusing the retrieval results of different generation-augmented queries consistently yields better performance.
We conduct extensive experiments on the Natural Questions (NQ) (Kwiatkowski et al., 2019) and TriviaQA (Trivia) (Joshi et al., 2017) datasets. The results reveal four major advantages of GAR. First, GAR, combined with BM25, achieves significant gains over the same BM25 model that uses the original queries or conventional query expansion methods. Second, GAR, combined with sparse representations (BM25), achieves comparable or even better performance than the current state-ofthe-art retrieval methods, such as DPR , that use dense representations. Third, since GAR uses sparse representations to measure lexical overlap 2 , it is complementary with DPR: by fusing the retrieval results of GAR and DPR, we obtain consistently better performance than that of either method used individually. Lastly, GAR outperforms DPR in the end-to-end performance when the same extractive reader is used (EM=41.6 vs. 41.5 on NQ, 62.7 vs. 57.9 on TriviaQA), creating new state-of-the-art results on the extractive open-domain QA.

Contributions.
(1) We propose Generation-Augmented Retrieval (GAR), a novel query expansion method that augments queries with relevant contexts through text generation. (2) We demonstrate that using generation-augmented queries achieves significantly better retrieval and QA results than using the original queries alone or the baseline query expansion method. (3) We show that GAR, combined with a simple BM25 model, achieves new state-of-the-art performance on two benchmark datasets of extractive open-domain QA.

Related Work
Query / Document Expansion. Query expansion (QE) is widely used in information retrieval. GAR shares some merits with QE methods based on pseudo relevance feedback (Rocchio, 1971;Abdul-Jaleel et al., 2004;Lv and Zhai, 2010) in that both augment the queries with relevant contexts (terms). GAR is unique as it expands the queries with information stored in the pre-trained LMs rather than the retrieved passages and its expanded terms are learned through text generation.
There are also recent studies that expand queries (documents) with generative models. Notably, Yu et al. (2020) rewrite concise conversational queries to fully specified, context-independent queries by using continuous queries in the same search session as weak supervision. Alternatively, Nogueira et al. (2019) expand the documents by generating potential queries and appending them to the documents. However, these methods only use one type of generation target and (or) require additional resources such as search logs. Also, it is infeasible to expand every document with all possible queries that are potentially relevant. In contrast, GAR leverages various query contexts such as passage titles and sentences, which are complementary to each other and freely accessible.
Retrieval for Open-domain QA. Early opendomain QA methods (Chen et al., 2017) use sparse representations for retrieval, while more recent methods (Guu et al., 2020; leverage dense representations, e.g., BERT bi-encoders , and generally achieve better performance. GAR helps sparse retrieval methods to achieve comparable or better performance than dense methods, while enjoying the simplicity and efficiency of sparse representations. GAR can also be used with dense representations to seek for even better performance. Generative QA. Generative QA generates answers through Seq2Seq learning instead of extracting answer spans. Recent studies on generative opendomain QA (Lewis et al., 2020;Izacard and Grave, 2020) are orthogonal to GAR in that they focus on improving the reading stage and directly reuse DPR  as the retriever. Unlike generative QA, the goal of GAR is not to generate perfect answers to the questions but pertinent contexts that are helpful for retrieval. Another line in generative QA learns to generate answers without relevant passages as the evidence but solely the question itself using pretrained LMs (Roberts et al., 2020;Brown et al., 2020). GAR further confirms that one can retrieve factual knowledge from pre-trained LMs, which is not limited to the answers as in prior studies but also other relevant contexts.

Generation-Augmented Retrieval
Open-domain QA aims to answer factoid questions without pre-specified domains. We assume that a large collection of documents C (i.e., Wikipedia) are given as the resource to answer the questions and a retriever-reader architecture is used to tackle the task, where the retriever retrieves a small subset of the documents D ⊂ C and the reader reads the documents D to extract (or generate) an answer. Our goal is to improve the effectiveness and efficiency of the retriever and consequently improve the performance of the reader.

Generation of Query Contexts
In GAR, queries can be augmented with various generated contexts in order to retrieve more relevant passages. For the task of open-domain QA where the query is a question, we take the following three contexts as the generation targets. We show in Sec. 5.3.2 that having multiple generation targets is helpful in that fusing their results consistently brings better retrieval accuracy.
Context 1: The default target (answer). The answer to the question is obviously useful for the retrieval of relevant passages that contain the answer itself. As shown in previous work (Roberts et al., 2020;Brown et al., 2020), pre-trained LMs are able to answer certain questions solely by taking the questions as input and generating answers. Instead of using the generated answers directly, GAR takes them as contexts of the question for retrieval, the advantage of which is that even if the generated answers are partially correct (or even incorrect), they may still benefit retrieval as long as they are relevant to the passages that contain the correct answers. 3 We also observe that conducting retrieval with the generated answers alone as queries is ineffective since (1) some of the generated answers are rather irrelevant, and (2) a query with the correct answer alone (without the question) may retrieve false positive passages with unrelated contexts that happen to contain the answer, leading to potential issues in the following reading stage.
Context 2: Sentence containing the default target. The sentence that contains the answer is used as another generation target. Similar to using answers as the generation target, the generated sentences are still beneficial for retrieving relevant passages even if they do not contain the answers, as their semantics is highly related to the questions/answers (examples in Sec. 5.3.1). One can take the relevant sentences in the gold-standard passages (if any) or those in the positive passages of a retriever as the reference, depending on the trade-off between reference quality and diversity.
Context 3: Title of passage containing the default target. One can also use the titles of relevant passages as the generation target if available. Specifically, we retrieve Wikipedia passages using BM25 with the question as the query, and take the page titles of positive passages that contain the answers as the generation target. We observe that the page titles of positive passages are often entity names of interest, and sometimes (but not always) the answers to the questions. Intuitively, if GAR learns which Wikipedia pages the question is related to, the queries augmented by the generated titles would naturally have a better chance of retrieving relevant passages.

Retrieval with Generation-Augmented Queries
After generating the context(s) of a query, we append the context(s) to the query as the generationaugmented query. 4 If there are multiple query contexts, we conduct retrieval using queries with different generated contexts separately and then fuse their results. 5 For simplicity, we fuse different retrieval results in a straightforward way: an equal number of retrieved passages are taken from each source. One may also use weighted or more sophisticated fusion strategies such as BPFusion, RRFFusion, or BordaCountFusion. 6 Next, one can use any off-the-shelf retrieval tool of interest for passage retrieval. Here, we use a simple BM25 model to demonstrate that GAR with sparse representations can already achieve comparable or better performance than state-of-the-art dense methods while being much more lightweight and efficient, closing the gap between sparse and dense retrieval methods.
4 Open-domain QA with GAR

Passage Reading
We largely follow the design of the extractive reader in DPR  for a fair comparison with it, while virtually any existing QA reader can be incorporated into GAR. In particular, using recent generative readers Izacard and Grave, 2020) may lead to better results.

Passage-level Span Voting
Many extractive QA methods (Chen et al., 2017;Min et al., 2019b;Guu et al., 2020; measure the probability of span extraction in different retrieved passages independently, despite that their collective signals may provide more evidence in determining the correct answers. Therefore, we propose passage-level span voting, which aggregates the predictions of the spans in the same surface form from different retrieved passages. Intuitively, if a text span is considered as the answer multiple times in different passages, it is more likely to be the correct answer. Specifically, let D = [d 1 , d 2 , ..., d k ] denote the list of retrieved passages with passage relevance scores D, let S i = [s 1 , s 2 , ..., s N ] denote the topranked N text spans in passage d i with span relevance scores S i , GAR calculates a normalized score p(S i [j]) for the j-th span in passage d i as follows: 5 The performance of one-time retrieval with all the contexts appended is slightly but not significantly worse. 6 The tools for the advanced fusion strategies can be found at https://github.com/joaopalotti/ trectools   (Kwiatkowski et al., 2019) and TriviaQA (Trivia) (Joshi et al., 2017). The statistics of the datasets are listed in Table 1.
Evaluation Metrics. Following prior studies , we use top-k retrieval accuracy to evaluate the performance of the retriever and the Exact Match (EM) score to measure the performance of the reader.
Top-k retrieval accuracy is defined as the proportion of questions for which the top-k retrieved passages contain (at least) one answer span, which is an upper bound of how many questions are "answerable" by an extractive reader.
Exact Match (EM) is the proportion of the predicted answer spans being exactly the same as (one of) the ground-truth answer(s), after string normalization such as articles and punctuation removal.

Implementation Details
We use Anserini (Yang et al., 2017) for BM25 retrieval with its default parameters. We conduct grid search for the classic query expansion baseline  RM3 (Abdul-Jaleel et al., 2004). We use BARTlarge (Lewis et al., 2019) to generate query contexts in GAR. When there are multiple desired targets (such as multiple answers or titles), we concatenate them with [SEP] tokens as the reference and remove the [SEP] tokens in the generationaugmented queries. The generators on different datasets are trained independently without additional training samples from other datasets (i.e., the single-dataset setting in DPR). GAR uses the same reader as DPR with largely the same hyperparameters, which is initialized with BERT-base  and takes 100 (500) retrieved passages during training (inference).

Experimental Results
We evaluate the effectiveness of GAR in three stages, namely the generation of query contexts (Sec. 5.3.1), the retrieval of relevant passages (Sec. 5.3.2), and the passage reading for opendomain QA (Sec. 5.3.3).

Query Context Generation
Automatic Evaluation. To evaluate the quality of the generated query contexts, an automatic evaluation with ROUGE is first conducted ( Table 2). As suggested by the nontrivial ROUGE scores, GAR does learn to generate meaningful query contexts that could help the retrieval stage. We take the checkpoint with the best ROUGE-1 F1 score on the validation set, while observing that the retrieval accuracy of GAR is relatively stable to the checkpoint selection since we do not directly use the generated contexts but treat them as augmentation for retrieval.
Case Studies. In Table 3, we show several examples of the generated query contexts and their gold-standard references. In the first example, the correct album release date appears in both the generated answer and sentence, and the generated title is the same as the Wikipedia page of the album. In the last two examples, the generated answers are wrong but fortunately, the generated sentences contain the correct answer and (or) other relevant information and the generated titles are highly related to the question as well.

Generation-Augmented Retrieval
Comparison w. the state-of-the-art. We next evaluate the effectiveness of GAR for retrieval. In Table 4, we show the top-k retrieval accuracy of BM25, BM25 with query expansion (RM3) (Abdul-Jaleel et al., 2004), DPR , GAR (i.e., BM25 with generationaugmented queries), and GAR +DPR.
On the NQ dataset, while BM25 clearly underperforms DPR regardless of the number of retrieved  Top-20 Top-100 Top-500 Top-1000 Top-20 Top-100 Top-500 Top- Table 4: Top-k retrieval accuracy of sparse and dense methods on the test sets of NQ and Trivia. GAR helps BM25 to achieve comparable or better performance than DPR. passages, the gap between GAR and DPR is significantly smaller and negligible when k ≥ 100. When k ≥ 500, GAR is slightly better than DPR despite that it simply uses BM25 for retrieval. In contrast, the classic query expansion method RM3, while showing marginal improvement over the vanilla BM25, does not achieve comparable performance with GAR or DPR. By fusing the results of GAR and DPR in the same way as described in Sec. 3.2, we further obtain consistently higher performance than both methods, with top-100 accuracy 88.8% and top-1000 accuracy 93.2%. On the Trivia dataset, the results are even more encouraging -GAR achieves consistently better retrieval accuracy than DPR when k ≥ 20. On the other hand, the difference between BM25 and BM25 +RM3 is negligible, which suggests that naively considering top-ranked passages as relevant (pseudo relevance feedback) for query expansion does not always work. Results on more cutoffs of k can be found in App. A.
Effectiveness of various query contexts. In Fig. 1, we show the performance of GAR when different query contexts are used to augment the

Passage Reading with GAR
We show the comparison of end-to-end QA performance in Table 5. GAR achieves the state-of-theart performance among extractive methods on both NQ and Trivia datasets, despite that it is much more lightweight and computationally efficient. Since the best performing generative methods Izacard and Grave, 2020) rely on the retrieval results of DPR as model input, we believe that it is a low-hanging fruit to replace their input with GAR +DPR and further boost the performance. GAR outperforms the standard BM25 significantly, which indicates the effectiveness of generationaugmented queries. We also observe that, perhaps surprisingly, BM25 performs reasonably well when we take 500 passages during reader inference instead of 100 as in , especially on the Trivia dataset, outperforming many recent state-of-the-art methods. 8

Efficiency of GAR
GAR is efficient and scalable since it uses sparse representations (BM25) for retrieval. The only overhead is on the generation of query contexts and the retrieval with generation-augmented (thus longer) queries, whose computational complexity is significantly lower than other methods with comparable retrieval accuracy (Table 6).  The training time of the generator in GAR is 3 to 6 hours on 1 V100 GPU depending on the generation target. As a comparison, REALM (Guu et al., 2020) uses 64 TPUs to train for 200k steps during pre-training alone and DPR , although more efficient, still takes about 24 hours to train with 8 V100 GPUs. To build indices of the Wikipedia passages (21 million), GAR only takes around 30 min with 35 CPUs, while DPR takes 8.8 hours on 8 GPUs to generate the dense representations and another 8.5 hours to build the FAISS index (Johnson et al., 2017). For retriever inference, GAR takes less than 1 min to retrieve 1,000 passages for the test set of NQ with answer/title-augmented queries and 2 min with sentence-augmented queries using 35 CPUs (i.e., less than 4 min in total). In contrast, DPR takes about 30 min on 1 V100 GPU.

Discussions
Despite the promising results of GAR, we note that there is still much space to improve in future work. First of all, the current results are obtained without extensive hyper-parameter tuning for each module. In terms of the methodology, for query context generation, we will explore multi-task learning to further reduce computational cost and examine whether different contexts can mutually enhance each other when generated by the same generator.
For passage retrieval, one can adopt more advanced fusion techniques based on both the ranking and score of the passages. As the generator and retriever are largely independent now, it is also interesting to study how to jointly optimize generation and retrieval such that the generator is aware of the retriever and generates query contexts more beneficial for the retrieval stage.
Beyond open-domain QA, GAR also has great potentials for other tasks that involve text matching such as conversation utterance selection (Lowe et al., 2015;Dinan et al., 2020), or broad search in information retrieval (Nguyen et al., 2016;Craswell et al., 2020). The default generation target is always available for different (supervised) tasks. For example, for conversation utterance selection one can use the reference utterance as the default target and then match the concatenation of the conversation history and the generated utterance with the provided utterance candidates; For article search, the default target could be (part of) the ground-truth article itself. Other generation targets are more task-specific and can be designed as long as they can be fetched from the latent knowledge inside pre-trained LMs and are helpful for further text retrieval (matching).

Conclusion
In this work, we propose Generation-Augmented Retrieval (GAR) and demonstrate on open-domain QA that the relevant contexts generated by pretrained LMs can significantly enrich the query semantics and improve retrieval accuracy. Remarkably, GAR with sparse representations performs similarly or better than state-of-the-art methods based on the dense representations of the original queries. Furthermore, GAR equipped with an extractive reader achieves the state-of-the-art end-toend performance on extractive open-domain QA.

A Details of Retrieval Performance
We show the detailed results of top-k retrieval accuracy of the compared methods in Figs. 2 and 3. Top-k Accuracy (%) GAR +DPR DPR GAR BM25 +RM3 BM25 Figure 3: Top-k retrieval accuracy on the test set of Trivia. GAR achieves better performance than DPR when k ≥ 5.