UnitedQA: A Hybrid Approach for Open Domain Question Answering

To date, most of recent work under the retrieval-reader framework for open-domain QA focuses on either extractive or generative reader exclusively. In this paper, we study a hybrid approach for leveraging the strengths of both models. We apply novel techniques to enhance both extractive and generative readers built upon recent pretrained neural language models, and find that proper training methods can provide large improvement over previous state-of-the-art models. We demonstrate that a simple hybrid approach by combining answers from both readers can efficiently take advantages of extractive and generative answer inference strategies and outperforms single models as well as homogeneous ensembles. Our approach outperforms previous state-of-the-art models by 3.3 and 2.7 points in exact match on NaturalQuestions and TriviaQA respectively.


Introduction
Open-domain question answering (QA) has been a long standing problem in natural language processing, information retrieval, and related fields. Compared with most recent popular "reading comprehension" QA tasks (Rajpurkar et al., 2016(Rajpurkar et al., , 2018, the open-domain QA task is less studied where systems are required to answer a question directly without being provided with the corresponding evidence. In other words, the task evaluates a system's ability to effectively fetch relevant information, consolidate expressed knowledge from multiple sources, and produce valid answers for input questions. One typical framework for open-domain QA is the retrieval-reader framework (Chen et al., 2017;Guu et al., 2020;Karpukhin et al., 2020) where the relevant information is first retrieved from a large text corpus by an information retrieval module, and a neural answering module then navigates multiple * Equal Contribution passages for answer inference 1 . In this work, we mainly focus on this setting for open-domain QA. More specifically, different from recent work (Guu et al., 2020;Karpukhin et al., 2020) on improving the retriever, we investigate potential improvements for the reading part.

G-1 G-2 G-3 E-1 E-2 E-
With the retrieval-reader framework, there are two paradigms of reading, i.e. extractive (Karpukhin et al., 2020;Guu et al., 2020) and generative (Lewis et al., 2020b;Izacard and Grave, 2020) readers. Generally speaking, extractive readers extract contiguous spans out of the retrieved passages while generative ones decode answer strings based on the question and retrieval context. Apparently, extractive and generative readers adopt different answer inference strategies. However, to date, most of existing work on open-domain QA focuses on using either an extractive reader or a generative reader exclusively. In this work, we hypothesize that a hybrid reader combing the extractive and generative readers can be a better option for open domain QA. As shown in Figure 1, compared with prediction agreement among only generative or extractive readers (top-left and bottom-right), the cross prediction agreement between extractive and generative readers (bottom-left) is pretty low (<50%). This indicates that answers produced by those two types of models are different and they can be complementary to each other. Therefore, in this work, we study a simple hybrid approach, UnitedQA, to combine answers from both extractive and generative readers.
For open-domain QA, one of the main challenges for the reader model is to produce answers from a noisy collection of retrieved documents. In other words, reading comprehension with evidence returned by information retrieval systems establishes a weakly-supervised QA setting due to the noise in the heuristics-based labeling (Chen et al., 2017). To address this issue, recent work based on either extractive or generative readers (Guu et al., 2020;Karpukhin et al., 2020;Lewis et al., 2020a;Izacard and Grave, 2020) resorts to different largescale pre-trained neural language models. Because of their self-supervised learning over a sheer size of text, those neural language models have been shown to encode world knowledge in their parameters, and achieve state-of-the art results when finetuned on downstream NLP tasks. Built upon recent state-of-the-art pre-trained neural language models, i.e. T5 (Raffel et al., 2019) and ELECTRA (Clark et al., 2020), in the United-QA, we further study techniques to improve and stabilize the model training for both extractive and generative readers. Specifically, we consider posterior differential regularization (Cheng et al., 2020b) and distant supervision assumptions (Cheng et al., 2020a) to enhance the extractive reader. For the generative reader, we incorporate attention bias (Lewis et al., 2020a) into T5-FID (Izacard and Grave, 2020), and improve unconstrained generation training with adversarial training (Ju et al., 2019;Jiang et al., 2020).
Our experimental results highlight the benefits of the hybrid approach, i.e. combing extractive and generative readers. Based on both improved extractive and generative readers, UnitedQA sets new state-of-the-art results on two popular open-domain QA datasets, i.e. 54.7 and 70.3 in exact match on NaturalQuestions (Kwiatkowski et al., 2019) and TriviaQA (Joshi et al., 2017), respectively. It is worth noting that our UnitedQA model not only outperforms each single model but also brings more pronounced improvements over homogeneous ensembles of either extractive, or generative readers. Additionally, our improved extractive model, UnitedQA-E, outperforms previous state-of-the-art extractive models of a similar size and generative models of a much larger size (>3x). Based on our analyses on both component readers of United-QA, the extractive reader is found to be better at generalizing to unseen cases and the generative reader excels at temporal reasoning.

Method: UnitedQA
In this section, we present the overall pipeline of the UnitedQA system, which consists of three components: Retrieval, Reading, and Re-ranking. First, the retrieval module fetches a list of relevant passages from a Wikipedia dump for a given question. Second, the module of hybrid readers produce answer candidates from the set of retrieved passages. Lastly, the re-ranking module combines the answer candidates with linear interpolation and produce the final answer. Retrieval. Following Karpukhin et al. (2020), we consider two methods, BM25 and dense passage retrieval (DPR), for retrieving the support passages for a given question. For BM25, passages are encoded as bag of words (BOW), and inverse document frequencies are used as the ranking function. For DPR, passages and questions are represented as dense vectors based on two BERT  models. The relevance measure is then computed based on the dot production between the query and passage vectors. In this paper, we use the same implementation from Karpukhin et al. (2020) 2 . Specifically, the English Wikipedia dump from Dec. 20, 2018 is used as the source documents for retrieval, with the removal of semi-structured data, such as tables or lists. Each document is split into disjoint 100-word passages as the basic retrieval unit. The top-100 passages are then passed for reading. Reading. We combine the generative reader and the extractive reader to produce answer candidates over the retrieved passages. Here, we only give a high-level description of our approach. More details regarding our improved extractive and generative models are presented in section 3 and section 4 respectively.
The generative reader is based on a sequenceto-sequence model pre-trained in a forwardgeneration fashion on a large corpus, i.e. T5 (Raffel et al., 2019). Similar to Izacard and Grave (2020), the model takes the question and all of its relevant passages as input, and then generates the answer string token by token. Specifically, the concatenation of all retrieved passages and the corresponding question is used as the encoder input. Then, the decoder performs reasoning over the concatenation of all evidence through an attention mechanism.
Following state-of-the-art extractive QA models Karpukhin et al., 2020), our extractive reader is based on a Transformer neural network pre-trained with a cloze style selfsupervised objective, i.e. ELECTRA (Clark et al., 2020). Here, a pair of a given question and a support passage is jointly encoded into neural text representations. These representations are then used to define scores or probabilities of possible answer begin and end positions, which are in turn used to define probabilities over possible answer spans.
Finally, the answer string probabilities are based on the aggregation over all possible answer spans from the entire set of support passages.

Datasets
We use two representative QA datasets and adopt the same training/dev/testing splits as in previous work Karpukhin et al., 2020). In the following, we give an overview of each dataset and refer interested readers to their papers for more details. NaturalQuestions (Kwiatkowski et al., 2019) is composed of questions by real users to Google Search, each with answers identified by human annotators in Wikipedia. The open-domain version of NaturalQuestions (Lee et al., 2019) only consider questions with short answers, i.e. answers with less than 5 tokens. In the NaturualQuestions, the questions are considered to be more information seeking given that the question askers does not already known the answer beforehand. It has promoted several recent advances in open-domain QA Karpukhin et al., 2020;Min et al., 2019;Guu et al., 2020). In addition, we use another evaluation set, i.e. the dev set introduced recently by the EfficientQA competition 3 , which is constructed in the same way as the original Natu-ralQuestions dataset. TriviaQA (Joshi et al., 2017) contains trivia question-answer pairs that were scraped from the web. Different from NaturalQuestions, the questions here are written with known answers in mind. Specifically, the unfiltered set (Joshi et al., 2017) has been used for developing open-domain QA models Karpukhin et al., 2020;Min et al., 2019;Karpukhin et al., 2020;Guu et al., 2020).

Experiments: Improved Extractive Model
In this section, we explore different approaches to improving the extractive model for open domain QA. Here, we mainly consider recent advances in 1) improving NLP model robustness (Cheng et al., 2020b), 2) extractive QA with weak supervision (Min et al., 2019;Cheng et al., 2020a), and 3) enhanced textual representations (Clark et al., 2020).

Extractive Reader
Different from Karpukhin et al. (2020) where the extractive prediction is based on two decoupled probabilities (passage selection and span extraction Wang et al., 2019), we use a single probability space consisting of token positions of answer spans over all retrieved passages as Cheng et al. (2020a). Since the task is to predict an answer string rather than a particular mention for a given question, one potential benefit of the single probability space approach is that it allows aggregating information across answer spans corresponding to the same string during inference. Given a question q and a set of K retrieved passages p 1 , . . . , p K , the text encoder produces contextualized representations for each questionpassage pair (q, p k ) in the form of "[CLS]question [SEP]passage [SEP]". Specifically, for the ith token in passage p k , the final hidden vector h k i ∈ R d is used as the contextualized token embedding, where d is the vector dimension. The span-begin score for the i-th token is computed as Thus, the probability of the answer span starting with position i is where Z b is the normalizing factor computed by summing over I 1 , . . . , . The span-end score s e (j k ), and the probability P e (j k ) for an end position j in passage k, and the normalizing factor Z e are defined in the same way. The probability of an answer span (i k , j k ) is During training, either the marginal log-likelihood (MML) of all correct answer spans (Karpukhin et al., 2020) or the log-likelihood of the most likely outcome (HardEM) (Min et al., 2019) is maximized.

Improvement Methods
In addition to better textual representations from ELECTRA (Clark et al., 2020), we consider two methods for improving the training of the extractive reader.  (Chen et al., 2017), we investigate the recently developed posterior differential regularization (PDR) (Cheng et al., 2020b) to improve the robustness of the extractive reader. As an extension to recent methods for improving the model local smoothness (Miyato et al., 2018;Sokolić et al., 2017), PDR aims at regularizing the posterior difference between the clean and noisy inputs with regard to the family of f -divergences (Csiszár and Shields, 2004). Different from Cheng et al. (2020b) where only clean supervision setting is considered, in this work, we apply PDR to the weakly supervised open-domain QA scenario. Given it is computationally expensive to enumerate all possible spans, we apply two separate regularization terms for the begin and end position probabilities at the multi-passage level, respectively.

Main results
First, we compare our improved extractive models to two recent models, REALM (Guu et al., 2020) and RAG (Lewis et al., 2020b), which are first pretrained with different retrieval augmented objectives and then fine-tuned for open-domain QA. In addition, we include as baselines DPR (Karpukhin et al., 2020) and T5-FID (Izacard and Grave, 2020), both of which are based on the same retriever as ours. Table 1 shows results for our improved extractive models based on ELECTRA-base (UnitedQA-E base ) and ELECTRA-large (UnitedQA-E large ), respectively, along with recent state-of-the-art models.
Compared with the recent state-of-the-art extractive model (DPR), our base model leads to pronounced 15% relative improvements for both Nat-uralQuestions (+6.2 absolute improvement) and TriviaQA (+8.4 absolute improvement). More importantly, UnitedQA-E base achieves comparable or even better performance with regard to generative models of larger size, i.e. RAG and T5-FID base . Finally, by using a larger text encoder (i.e. ELECTRA-large), our extractive model sets new state-of-the-art results for both NaturalQuestions and TriviaQA.

Ablation Study
In Table 2, we present ablation experiments on the effectiveness of different textual representations and methods for improving the extractive model. Compared with the multi-objective using two MML objectives (Cheng et al., 2020a), we find that a new multi-objective with HardEM at the multi-passage level and MML at the passage level is more effective for open-domain QA. In addition to the multiobjective training, there is a noticeable improvement brought by the regularization method (PDR) which indicates the importance of proper regularization for training with noisy supervision. Last but not least, the large improvement of ELECTRA over BERT indicates the importance of deriving  4 Experiments: Improved Generative Model

Generative Reader
The model architecture of our generative reader is based on T5-Fusion-in-Decoder (Izacard and Grave, 2020). Given a question q and a set of K retrieved passages p 1 , . . . , p K , the encoder model encodes each (q, p k ) pair independently, and produces contextualized representations: h k i ∈ R d for the i-th token of the k-th pair. The decoder then performs attention over the concatenation of the representations of all the retrieved passages, and generates the answer string.

Decoder Attention Bias
The decoder in the T5 transformer model adopts a cross-attention mechanism to compute attention scores between the decoding answer tokens and all the retrieved passage tokens. Specifically, let q i ∈ R d be the query vector of the i-th decoding token 4 , and m k j ∈ R d be the key vector of the j-th token in the k-th retrieved context. The classical multi-head attention scores s k i,j can be calculated as: To take into account the ranking information of the retrieved passage, we revise the Equation 2 by incorporating the attention bias term: b k is a trainable cross attention bias vector in the decoder module: b k ∈ R |Head| .

Adversarial Training
The generative reader is trained by maximizing a sequence-to-sequence objective for a training sample (x, y) as in Equation 4, where x indicates the input of the question and all retrieved passages: x = (q, p 1 ), ..., (q, p K ) , and y the answer string with its tokens as (y 1 , ..., y N ).
L(x, y; θ) = logp θ (y|x) = Adversarial training creates adversarial examples by adding small perturbations to the embedding layer. Assuming the word(-piece) embedding layer is parameterized by a matrix V ∈ R |V |×d , |V | is the vocabulary size, and d is the embeddimension. The adversarial embedding matrixV can be obtained by: SG(·) is the stop-gradient operation. We use the adversarial embedding matrixV to replace the original V in model parameters θ, and obtainθ. Thus the adversarial loss can be calculated as: The overall training objective is the summation of two losses:

Experiments: Hybrid Model
In this part, we study whether it is advantageous to combine the generative and extractive readers in a hybrid fashion. Specifically, we leverage the improved extractive and generative models from section 3 and section 4, respectively. Similar to previous experiments, the same retriever is used for retrieving top 100 passages for each given question for both extractive and generative readers. In order to evaluate the advantage of the hybrid of the extractive and generative models (Unit-edQA), we include two homogeneous ensemble baselines, one consisting of only extractive readers (UnitedQA-E++) and the other hybrid of exclusively generative models (UnitedQA-G++). Each model is trained independently with different random seeds. In our study, a simple linear interpolation over all model predictions is used for producing the final answer. Specifically, we combine three models for each ensemble case. For homogeneous ensemble cases, the majority prediction is used. For the hybrid of extractive and generative readers, we select a three-model combination from the set of three generative and three extractive models base on their performance on the dev set. In addition, we use a scalar a for weighting extractive predictions and 1 − a for scaling generative predictions. The answer with the highest weighted vote is then predicted as the final answer. The results are summarized in lower part of Table 1.
As expected, all ensemble models show an improvement over their single model counterparts.
However, it is worth noting that the two homogeneous ensemble baselines, UnitedQA-E++ and UnitedQA-G++, only provide marginal improvements over the corresponding best single models. The significant improvement brought by our proposed hybrid approach indicates the benefit of combining extractive and generative readers for opendomain QA. Moreover, we evaluate our hybrid model trained on NaturalQuestions directly to the dev and test sets introduced by the EfficientQA competition 5 . As shown in Table 3, our UnitedQA again outperforms both single models and homogeneous ensembles on the dev set, and is the best performing system based on the EfficientQA leaderboard.

Analysis
Given the advantage of the hybrid approach to open-domain QA, we study in detail the behavior of our SOTA extractive and generative readers (i.e. United-E and United-G). First, we evaluate the impact of the retrieval model. Second, following Lewis et al. (2020c), we conduct a break-down evaluation of the readers to investigate what drives their overall performance, i.e. the extent of memorizing vs generalization. Lastly, we carry out a manual inspection of the prediction errors made by the extractive and generative models, respectively. Impact of Retrieval Recall. Here, we vary the number of retrieved passages during inference and report the evaluation results in terms of end-to-end QA exact match score of United-E and United-G along with the corresponding top-k retrieval accuracy. The results are summarized in Table 4. As expected, when the number of retrieved passages increases, both top-k retrieval accuracy and the endto-end QA performance improve. However, there  is a noticeable gap between the improvement of retrieving more passages (i.e., recall) and that of the corresponding end-to-end QA performance, especially for the extractive reader. This is likely caused by additional noise introduced with improved retrieval recall. Specifically, only half of the retriever improvement can be effectively utilized by the extractive model while the generative model can benefit more from retrieving more passages. This suggests that by concatenating all passages in vector space, the generative model are more effective in de-noising in comparison to the extractive model.
Breakdown Evaluation. Following Lewis et al. (2020c), we carry out a breakdown evaluation of model performance over the NaturalQuestions and TriviaQA test sets. Given their superior performance, we again only consider our improved extractive and generative models, i.e. UnitedQA-E large and UnitedQA-G respectively. The evaluation is summarized in Table 5. In comparison to their corresponding overall performance, both the extractive and generative models achieve much better performance on the "Overlap" categories (i.e. "Question Overlap" and "Answer Overlap") for both NaturalQuestions and TrivaQA, which indicates that both models perform well for question and answer memorization. Different from question and answer memorization, there is a pronounced performance drop for both models on the "Answer Overlap Only" category where certain amount of relevance inference capability is required to succeed. Lastly, we see that both extractive and generative models suffer some significant performance degradation for the "No Overlap" column which highlights model's generalization evaluation. Nevertheless, the extractive model demonstrate a better QA generalization by achieving a better per-formance on the "No Overlap" category on both datasets. Error Analysis. Here, we conduct analyses into prediction errors made by the extractive and generative models based on automatic evaluation. For this study, we use the dev set introduced by the Ef-ficientQA competition 7 which is constructed in the same way as the original NaturalQuestions dataset. Specifically, we group prediction errors into three categorizes: 1) common prediction errors made by both the extractive and generative models, 2) prediction errors made by the extractive model, 3) prediction errors produced by the generative model. In the following, we first carry out a manual inspection into the common errors. Then, we compare the prediction errors made by extractive and generative models, respectively. First of all, there is an error rate of 29% of those consensus predictions made by both extractive and generative models according to the automatic evaluation. Based on 30 randomly selected examples, we find that around 30% of those predictions are actually valid answers as shown in the top part of Table 6. In addition to predictions that are answers at different granularity or semantically equivalent ones, some of those prediction errors are likely caused by the ambiguity in questions. As the given example in Table 6, based on the specificity, the model prediction is also a valid answer. This highlights the limitation of the current evaluation metric, which does not accurately estimate the existing open-domain QA system capabilities. As shown in the bottom part of Table 6, most of representative errors are due to the confusion of related concepts, entities or events that are mentioned frequently together with the corresponding gold answers.
Next, all questions from the dev set are categorized based the WH question word, i.e. what, which, when, who, how, where. We then report the relative performance change of each WH category for both extractive and generative models over their corresponding overall prediction accuracy in Figure 2. First, it is easy to see that both extractive and generative models achieve the best performance for entity related who questions, which is likely to be the result of high ratio of questions and answers of this type seen during training. In contrast, the answers to what questions can play a much richer syntactic role in context, making it more difficult for both extractive and generative models to perform well.

Related Work
Open-domain Question Answering. Opendomain QA requires a system to answer questions based on evidence retrieved from a large corpus such as Wikipedia (Voorhees, 1999;Chen et al., 2017). Recent progress has been made towards improving evidence retrieval through both sparse vector models like TF-IDF or BM25 (Chen et al., 2017;Min et al., 2019), and dense vector models based on BERT Karpukhin et al., 2020;Guu et al., 2020). Generally, the dense representations complement the sparse vector methods for passage retrieval as they can potentially give high similarity to semantically related text pairs, even without exact lexical overlap. Unlike most of existing work focusing on a pipeline model,  propose a pre-training objective for jointly training both the retrieval encoder and reader. Their approach outperforms most of recent pipeline methods on multiple open-domain QA datasets, and is further extended by Guu et al. (2020) with asynchronously re-indexing the passages during the training. Instead, in this work, we focus on developing a hybrid approach for open-domain QA.
By simply combing answer predictions from our improved extractive and generative models, our UnitedQA achieves significant improvements over recent state-of-the-art models.
Reading Comprehension with Noisy Labels. There has been a line of work on improving distantly-supervised reading comprehension models by developing learning methods and model architectures that can better use noisy labels. Most of them focus on the document-level QA, where all paragraphs share the same document context. Clark and Gardner (2018) propose a paragraphpair ranking objective for learning with multiple paragraphs so that the model can distinguish relevant paragraphs from irrelevant ones. In (Lin et al., 2018), a coarse-to-fine model is proposed to handle label noise by aggregating information from relevant paragraphs and then extracting answers from selected ones. Min et al. (2019) propose a hard EM learning scheme where only passagelevel loss is considered for document-level QA. More recently, different probabilistic assumptions with corresponding training and inference methods are examined in (Cheng et al., 2020a) again for document-level QA with distant supervision. In our work, we further extend the multi-objective formulation proposed in (Cheng et al., 2020a) with the hard EM learning (Min et al., 2019)   typically from different documents.

Conclusion
In this study, we propose a hybrid model for opendomain QA, called UnitedQA, which combines the strengths of extractive and generative readers. We validate the effectiveness of UnitedQA on two popular open-domain QA benchmarks, NaturalQuestions and TriviaQA. Our results show that the proposed UnitedQA model significantly outperforms single extractive and generative models as well as their corresponding homogeneous ensembles, and creates new state-of-the-art on both benchmarks. We also perform a comprehensive empirical study to investigate the relative contributions of different components of our model and the techniques we use to improve the readers.