Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

Generative models for open domain question answering have proven to be competitive, without resorting to external knowledge. While promising, this approach requires to use models with billions of parameters, which are expensive to train and query. In this paper, we investigate how much these models can benefit from retrieving text passages, potentially containing evidence. We obtain state-of-the-art results on the Natural Questions and TriviaQA open benchmarks. Interestingly, we observe that the performance of this method significantly improves when increasing the number of retrieved passages. This is evidence that sequence-to-sequence models offers a flexible framework to efficiently aggregate and combine evidence from multiple passages.


Introduction
Recently, several works have shown that factual information can be extracted from large scale language models trained on vast quantities of data (Radford et al., 2019;Petroni et al., 2019;Jiang et al., 2019;Talmor et al., 2019).Building on that observation and the advances in pretraining of natural language processing models, Roberts et al. (2020) introduced a generative model for open domain question answering.Without relying on external knowledge, this method obtained competitive results on several benchmarks.However, it requires models containing billions of parameters, since all the information needs to be stored in the weights.This makes models expensive to query and train.In this paper, we investigate how much this method could benefit from having access to an external source of knowledge, such as Wikipedia.
Retrieval based approaches were previously considered in the context of open domain question answering with extractive models (Chen et al., 2017).In that case, systems start by retrieving  Then, a generative encoder-decoder model produces the answer, conditioned on the question and the retrieved passages.This approach scales well with the number of retrieved passages, as the performance keeps improving when retrieving up to one hundred passages.
support documents, before extracting the answer from these documents.Different retrieval techniques have been considered, either using sparse representations based on TF/IDF or using dense embeddings (Guu et al., 2020;Karpukhin et al., 2020).The models which extract the answers are often based on contextualized word representations such as ELMo or BERT (Peters et al., 2018;Devlin et al., 2019), and predict a span as answer.
Aggregating and combining evidence from multiple passages is not straightforward when using extractive models, and multiple techniques have been proposed to address this limitation (Clark and Gardner, 2018;Min et al., 2019a).
In this paper, we explore a simple approach having the best of both worlds, by building on the exciting developments in generative modeling and retrieval for open domain question answering.This method proceeds in two steps, by first retrieving supporting passages using either sparse or dense representations.Then, a sequence-to-sequence model generates the answer, taking as input the retrieved passages in addition to the question.While conceptually simple, this method sets new state-ofthe-art results on the TriviaQA and NaturalQuestions benchmarks.In particular, we show that the performance of our method significantly improves when the number of retrieved passages increases.We believe that this is evidence that generative models are good at combining evidence from multiple passages, compared to extractive ones.

Related work
Open domain question answering is the task of answering general domain questions, in which the evidence is not given as input to the system.While being a longstanding problem in natural language processing (Voorhees, 1999), this task has recently regained interest following the work by Chen et al. (2017).In that version of the problem, strong supervision is available to the learning system, in the form of spans corresponding to answers.Chen et al. (2017) proposed to solve the problem by first retrieving support document from Wikipedia, before extracting the answer from the retrieved document.Different methods were proposed to tackle the setting where no gold spans are given to the system, but only the correct answer.Clark and Gardner (2018) proposed to use a global normalization over all the span corresponding to the answer, which was later applied to BERT based models (Wang et al., 2019).Min et al. (2019a) introduced a method based on hard expectationmaximization to tackle noisy supervision from this setting.Wang et al. (2018b) described a technique to aggregate answers from different paragraphs, using confidence and coverage scores.
Passage retrieval is an important step in open domain question answering, and is an active area of research to improve QA systems.Initially, sparse representations based on TF/IDF were used to retrieve support documents (Chen et al., 2017).Lee et al. (2018) introduced a supervised learning method to rerank paragraphs based on BiLSTM, while Wang et al. (2018a) trained a ranking system with reinforcement learning.A second approach to improve the retrieval step of QA systems is to used additional information such as the Wikipedia or Wikidata graphs (Min et al., 2019b;Asai et al., 2020).Recently, multiple works show that retrieval systems entirely based on dense representation and approximate nearest neighbors were competitive with traditional approaches.Such models can be trained using weak supervision in the form of question-answer pairs (Karpukhin et al., 2020), or pretrained using a cloze task and finetuned end-toend (Guu et al., 2020;Lee et al., 2019).
Generative question answering was mostly considered in previous work for datasets requiring to generate answers, such as NarrativeQA (Kočiskỳ et al., 2018), CoQA (Reddy et al., 2019) or ELI5 (Fan et al., 2019).These datasets were generated in a way that answers do not correspond to spans in support documents, thus requiring abstractive models.Raffel et al. (2020) showed that generative models are competitive for reading comprehension tasks such as SQuAD (Rajpurkar et al., 2016), where answers are spans.Roberts et al. (2020) proposed to use large pretrained generative models, without using additional knowledge, for open domain question answering.Closest to our work, Min et al. (2020) and Lewis et al. (2020b) introduced retrieval augmented generative models for open domain question answering.Our approach differs from these works by how the generative model processes the retrieved passages.This allows to scale to large numbers of documents, and to benefit from this large amount of evidence.

Method
In this section, we describe our approach to open domain question answering.It proceeds in two steps, first retrieving support passages before processing them with a sequence to sequence model.(Wang et al., 2019) ---53.060.9 Path Retriever (Asai et al., 2020) 31.7 --56.5 63.8 Graph Retriever (Min et al., 2019b) 34.7 55.8 ---Hard EM (Min et al., 2019a) 28.8 50.9 ---ORQA (Lee et al., 2019) 31.3 45.1 -20.2 -REALM (Guu et al., 2020) 40.4 ----DPR (Karpukhin et al., 2020) 41.5 57.9 -36.7 -SpanSeqGen (Min et al., 2020) 42.5 ----RAG (Lewis et al., 2020b) 44 Retrieval.For the retrieval of support passages, we consider two methods: BM25 (Robertson et al., 1995) and DPR (Karpukhin et al., 2020).In BM25, passages are represented as bag of words, and the ranking function is based on term and inverse document frequencies.We use the implementation from Apache Lucene 1 with default parameters, and tokenize questions and passages with SpaCy. 2 In DPR, passages and questions are represented as dense vector representations, computed using two BERT networks.The ranking function is the dot product between the query and passage representations.Retrieval is performed using approximate nearest neighbors with the FAISS library. 3

Model
Reading.Our generative model for open domain QA is based on a sequence-to-sequence network, pretrained on unsupervised data, such as T5 or BART (Raffel et al., 2020;Lewis et al., 2020a).
The model takes as input the question, as well as the support passages, and generates the answer.More precisely, each retrieved passage and its title are concatenated with the question, and processed independently from other passages by the encoder.We add special tokens question:, title: and context: before the question, title and text of each passage.Finally, the decoder performs atten-1 lucene.apache.org 2 spacy.io 3 github.com/facebookresearch/faisstion over the concatenation of the resulting representations of all the retrieved passages.The model thus performs evidence fusion in the decoder only, and we refer to it as Fusion-in-Decoder.
By processing passages independently in the encoder, but jointly in the decoder, this method differs from Min et al. (2020) and Lewis et al. (2020b).Processing passages independently in the encoder allows to scale to large number of contexts, as it only performs self attention over one context at a time.This means that the computation time of the model grows linearly with the number of passages, instead of quadratically.On the other hand, processing passages jointly in the decoder allows to better aggregate evidence from multiple passages.

Experiments
In this section, we report empirical evaluations of Fusion-in-Decoder for open domain QA.
Datasets.We consider the following datasets, and use the same setting as Lee et al. (2019): • NaturalQuestions (Kwiatkowski et al., 2019) contains questions corresponding to Google search queries.The open-domain version of this dataset is obtained by discarding answers with more than 5 tokens.
• TriviaQA (Joshi et al., 2017) contains questions gathered from trivia and quiz-league • SQuAD v1.1 (Rajpurkar et al., 2016) is a reading comprehension dataset.Given a paragraph extracted from Wikipedia, annotators were asked to write questions, for which the answer is a span from the corresponding paragraph.Evaluation.Predicted answers are evaluated with the standard exact match metric (EM), as introduced by Rajpurkar et al. (2016).A generated answer is considered correct if it matches any answer of the list of acceptable answers after normalization.This normalization step consists in lowercasing and removing articles, punctuation and duplicated whitespace.
Technical details.We initialize our models with the pretrained T5 models (Raffel et al., 2020), available in the HuggingFace Transformers library. 4 We consider two model sizes, base and large, containing respectively 220M and 770M parameters.We fine-tune the models on each dataset independently, using Adam (Kingma and Ba, 2014) with a constant learning rate of 10 −4 and a dropout rate of 10%.We train the model for 10k gradient steps, with a batch size of 64, using 64 Tesla V100 32Gb.We evaluate models every 500 steps and select the best one on the validation set based on the Exact Match score.During training on NaturalQuestions 4 github.com/huggingface/transformers and SQuAD, we sample the target among the list of answers, while for TriviaQA, we use the unique human-generated answer.For TriviaQA, answers in uppercase are normalized by converting all letters in lowercase except the first letter of each word, using the title Python string method.For both training and testing, we retrieve 100 passages (unless said otherwise), and truncate them to 250 word pieces.Following the results of Karpukhin et al. (2020), passages are retrieved with DPR for NQ and TriviaQA, and with BM25 for SQuAD.We generate answers by using greedy decoding.
Comparison to state-of-the-art.In table 1, we compare the results obtained by Fusion-in-Decoder with existing approaches for open domain question answering.We observe that while conceptually simple, this method outperforms existing work on the NaturalQuestion and TriviaQA benchmarks.
In particular, generative models seem to perform well when evidence from multiple passages need to be aggregated, compared to extractive approaches.Our method also performs better than other generative models, showing that scaling to large number of passages and processing them jointly leads to improvement in accuracy.Second, we observe that using additional knowledge in generative models by using retrieval lead to important performance gains.On NaturalQuestions, the closed book T5 model obtains 36.6% accuracy with 11B parameters, while our approach obtains 44.1% with 770M parameters plus Wikipedia with BM25 retrieval.Both methods use roughly the same amount of memory to store information, indicating that text based explicit memories are competitive for knowledge retrieval tasks.
Scaling with number of passages.In Figure 3, we report the performance with respect to the number of retrieved passages.In particular, we observe that increasing the number of passages from 10 to 100 leads to 6% improvement on Trivi-aQA and 3.5% improvement on NaturalQuestions.
On the other hand, the performance of most extractive models seems to peak around 10 to 20 passages (Wang et al., 2019;Yang et al., 2019).We believe that this is evidence that sequence-tosequence models are good at combining informations from multiple passages.
Impact of the number of training passages.In the previous section, the model was trained and evaluated with the same number of passages.To reduce the training computational budget, a simple solution consists in training the model with fewer passages.In Table 2, we report the performance obtained by training with different numbers of passages, while testing with 100 passages.We observe that reducing the number of training passages leads to a decrease of accuracy.Further, we propose to finetune the previous models using 100 passages for 1000 steps.This allows to reduce the accuracy gap, while using significantly less computational resources: we can reach 46.0 EM on NaturalQuestions, using 147 GPU hours, compared to 425 GPU hours when training on 100 passages.

Figure 1 :
Figure 1: A simple approach to open domain question answering.First, it retrieves support text passages from an external source of knowledge such as Wikipedia.Then, a generative encoder-decoder model produces the answer, conditioned on the question and the retrieved passages.This approach scales well with the number of retrieved passages, as the performance keeps improving when retrieving up to one hundred passages.

Figure 2 :
Figure 2: Architecture of the Fusion-in-Decoder method.

Figure 3 :
Figure 3: Performance of Fusion-in-Decoder (base) on valid sets as a function of the number of retrieved passages.

Following
Lee et al. (2019) we use the validation as test, and keep 10% of the training set for validation.We use the Wikipedia dumps from Dec. 20, 2018 for NQ and TriviaQA and from Dec. 21, 2016 for SQuAD.We apply the same preprocessing asChen et al. (2017);Karpukhin et al. (2020), leading to passages of 100 words, which do not overlap.

Table 1 :
Comparison to state-of-the-art.On TriviaQA, we report results on the open domain test set (left), and on the hidden test set (right), competitions.codalab.org/competitions/17208#results).

Table 2 :
Performance depending on the number of passages used during training.Exact Match scores are reported on dev sets.