Multi-Task Dense Retrieval via Model Uncertainty Fusion for Open-Domain Question Answering

Multi-task dense retrieval models can be used to retrieve documents from a common corpus (e.g., Wikipedia) for different open-domain question-answering (QA) tasks. However, Karpukhin et al. (2020) shows that jointly learning different QA tasks with one dense model is not always beneﬁcial due to corpus inconsistency. For example, SQuAD only focuses on a small set of Wikipedia articles while datasets like NQ and Trivia cover more entries, and joint training on their union can cause performance degradation. To solve this problem, we propose to train individual dense passage retrievers (DPR) for different tasks and aggregate their predictions during test time, where we use uncertainty estimation as weights to indicate how probable a speciﬁc query belongs to each expert’s expertise. Our method reaches state-of-the-art performance on 5 benchmark QA datasets, with up to 10% improvement in top-100 accuracy compared to a joint-training multi-task DPR on SQuAD. We also show that our method handles corpus inconsistency better than the joint-training DPR on a mixed subset of different QA datasets. Code and data are available at https://github.com/ alexlimh/DPR_MUF .


Introduction
Open-domain question-answering requires finding answers to given questions from a large collection of documents (Voorhees and Tice, 2000). Therefore, a first-stage retrieval component that selects a set of potentially answer-containing documents is often involved for the second-stage reading comprehension model (Chen et al., 2017). Traditional term-matching methods such as tf-idf and BM25 (Robertson and Zaragoza, 2009; that leverage an inverted index to construct sparse textual representations have built strong baselines in the first-stage retrieval. * Correspondence to: Minghan Li <alexlimh23@gmail.com>

Items Joint Training Model Fusion
Task Flexibility Training Speed Inference Speed Storage Space Recently, neural-based dense retrievers (Seo et al., 2019;Guu et al., 2020; have been shown to achieve better performance in open-domain questionanswering, but they often fail to generalize outside of the training data distribution. A standard solution known as joint training that learns a single dense retriever on the union of different datasets (Maillard et al., 2021;Wang et al., 2021) provides a solution to a certain extent. However,  has shown that data from different tasks might have conflicts with each other, where joint training on their union can cause performance degradation. For example, SQuAD (Rajpurkar et al., 2016) only focuses on a small set of Wikipedia documents while datasets like NQ  and Trivia (Joshi et al., 2017) cover more entries. Therefore, careful data re-balancing and hyperparameter search are required during training.
In this paper, we propose another solution to multi-task learning, which trains multiple DPR experts on different datasets separately and their predictions are aggregated during test time. This is also known as model fusion (Hoang et al., 2019) which differs from a mixture of experts (Shazeer et al., 2017) as it does not need to learn a gating function on the joint dataset. The model fusion method is easier to incorporate new data for continual learn-ing without introducing conflicts, as each expert trains on independent tasks. In addition, these experts can be trained in parallel to speed up the learning process.
However, the challenge now becomes how to aggregate different expert's predictions without hurting their in-distribution performance. We propose model uncertainty estimation (Loquercio et al., 2020) as a dynamic weighting scheme, which helps the expert to identify whether a question belongs to its expertise. Intuitively, a model that overfits to a training distribution should be more uncertain about the outof-domain data than the in-domain data. For example, the question "How many episodes in Season 2 Breaking Bad?" might get a high uncertainty score from an expert trained on a medical QA dataset.
In practice, we leverage ensemble uncertainty where we train an ensemble of small neural networks for each pre-trained DPR expert (Lakshminarayanan et al., 2017). Specifically, we represent the model uncertainty as the mutual information (Poole et al., 2019) between the ensemble's predictions and parameters. For each question, we retrieve a set of the top-k documents using different DPR experts and then use its corresponding ensemble to compute the uncertainty score of the question. Finally, we aggregate all the expert's predictions into a normalized weighted sum and rerank the retrieved documents. Fig. 1 demonstrates a simplified pipeline of our algorithm and Tbl. 1 compares the differences between the joint training and model fusion solutions.
Extensive experiments show that our final fusion model not only outperforms individual specialists on 5 open-domain QA datasets but also outperforms the performance of the joint-training, multitask DPR model, with up to 10% improvement in top-100 accuracy on SQuAD. Finally, our method manages to handle corpus conflicts on a mixed subset of different QA tasks, which even outperforms an oracle model using Bayesian optimization (Frazier, 2018).

Related Work
Retrieval and QA Traditional retrieval methods such as tf-idf or BM25 generate sparse, highdimensional vectors (Robertson and Zaragoza, 2009; and have been proven effective in various QA tasks (Chen et al., 2017;Yang et al., 2019;Min et al., 2019). Recently, neural re-  Figure 1: An illustration of model uncertainty fusion of 3 DPR experts, each with an ensemble of 3 fullyconnected neural networks. Given a query, each DPR expert first retrieves top-k documents, followed by the uncertainty estimation using the corresponding ensemble. The weighted sum of predictions is then used to rerank the union of the retrieved documents.
trievers have made huge progress in open-domain question-answering (Seo et al., 2019;Guu et al., 2020). Especially, dense passage retriever ) is a popular approach that learns separate question and document representations from task-specific training data. Lewis et al. (2020b); Izacard and Grave (2021) further show that question generation using models such as BART (Lewis et al., 2020a) and T5 (Raffel et al., 2020) can be incorporated into DPR's training. Multi-task DPR (Wang et al., 2019) trains jointly on an extensive selection of retrieval datasets, which leads to better performance on downstream knowledge-intensive tasks.
Uncertainty Estimation Uncertainty estimation has wide applications in areas such as building safe AI systems (e.g., anomaly detection) (Amodei et al., 2016), especially for systems that include neural networks. Bayesian Neural Networks (BNNs) use probability distributions (MacKay, 1992;Neal, 2012) to represents the parameters of a neural net. Despite their compactness, in theory, BNNs have difficulty scaling to a large number of parameters and data points, which only works well in small-scale settings, e.g., MCMC methods (Neal, 2012). To adapt to modern networks' size, Gal and Ghahramani (2016) propose to use Monte Carlo dropout, which estimates model uncertainty by using Dropout (Srivastava et al., 2014) at test time.
Another simple way to estimate uncertainty is ensembling, which aggregates the predictions of indi-vidual ensemble members, and different weight initialization, data sampling, and regularization scheme is applied to encourage diversity in the ensemble (Lakshminarayanan et al., 2017;Snoek et al., 2019;Gustafsson et al., 2020;Pearce et al., 2020;. Despite its simplicity, the ensembling approach scales well to large neural networks and massive datasets, while providing trustworthy uncertainty estimation.

Dense Passage Retrieval
Retrieval/Inference Given a collection of documents {d 1 , d 2 , · · · , d n } and a question answering task, DPR  encodes the questions and documents using a bi-encoder structure where encoders f Q (·) and f D (·) are independent functions that map a question/document into a low-dimensional, real-value vector. Specifically, the similarity s between the question q and document d is defined by the dot product between their encoded vectors v q = f Q (q) and v d = f D (d): which is used as the ranking score. Both f Q and f D use BERT  as the backbone model and the [CLS] vector as the output representation.
Training As pointed out by , training the encoders such that Eq. (1) becomes a good ranking function is essentially a metric learning problem (Kulis, 2012). Formally, let D be the random variable of documents, Q be the r.v. of questions, and C be the r.v. of the set of retrieved documents. Given a specific question q, let d + be the positive context that contains answers for where λ is the inverse temperature coefficient that controls the sharpness of the softmax distribution, which is often set to 1 during training. The negative log likelihood objective based on Eq.
(2) is: . ( The single DPR expert and the joint-training DPR model follow the same training scheme. In the next section, we describe how the second optionmodel uncertainty fusion-is implemented.

Multi-Task Model Fusion
Given m question-answering tasks and m independent experts, the goal of multi-task model fusion is to find the optimal set of weights {w (i) } m i=1 to combine all experts' predictions for each question. We use DPR as the expert model.

Ensemble Uncertainty Estimation
There are mainly two types of uncertainty: model uncertainty and data uncertainty (Malinin and Gales, 2018). Data uncertainty is often caused by mislabelling or missing features, while model uncertainty measures the confidence of the model's predictions given the training data, which is often used to identify whether a sample is within the training domain. Therefore, we use model uncertainty for weighting the experts' predictions, such that we know whether a question belongs to an expert's expertise.
As mentioned in Section 2, there are many ways to represent model uncertainty. In this work, we consider ensemble uncertainty due to its effectiveness and simplicity. The intuition is simple: the ensemble trained in a single domain will "agree" to similar predictions if in-domain samples occur, and will "disagree" otherwise. The disagreement would be more obvious if the functional space of the ensemble is complex enough, e.g., the space of neural networks. To quantify such uncertainty or "disagreement", we use Mutual Information (MI) (Poole et al., 2019) between the ensemble's prediction and its parameters as the proxy. For each DPR expert, we build an ensemble of m classi- (2), where Θ denotes the r.v. of the ensemble parameters, θ i denotes the parameters of the i th ensemble member, and C DPR denotes the collection of contexts retrieved by the DPR expert. For simplicity, we will use p(D | θ i , q, C DPR ) as a shorthand for the distribution. The mutual information I between D and Θ given a question q and a collection of contexts C DPR is: where H(·) denotes the entropy operator, E[·] denotes the expectation operator, and the approximations are done by Monte-Carlo simulation.
The approximated mutual information is also upper-bounded by the log of the number of ensemble members m: which gives us bounded uncertainty estimation for a certain domain. We normalize the mutual information and transform it into a confidence score w for weighting the DPR expert's prediction of q:

Model Uncertainty Fusion
Given m DPR experts that are trained on m different tasks separately, we first encode the question and document representation into dense vectors {v for all m experts, where the superscript represents the expert's id. We then build an ensemble of small neural networks, each uses the corresponding expert's v During inference, given a new question q, we first retrieve m sets of top-k documents {C We then calculate the weights {w (i) } m i=1 for each question according to Eq. (5) using the ensemble vectors {u Finally, we re-rank the union of m sets of top-k documents using the uncertainty-weighted sum of each expert's score. The final score S(q, d j ) of a document d j given question q is: where s (1). If we do not have a score from the i th expert for a document d, we will use the minimum of {s (i) j } k j=1 as the ranking score for d. Fig. 1 visualizes the aforementioned retrieval fusion process of our method during inference.

Uncertainty Calibration
Despite its simplicity and effectiveness, one drawback of ensemble uncertainty is that it doesn't have a closed-form expression and the prediction of each ensemble might have a different range (Pearce et al., 2020). Therefore, the ensemble uncertainty needs to be calibrated before fusion, such that the confidence of an expert matches its prediction accuracy. We use the Expected Calibration Error (ECE) (Guo et al., 2017) as the metric where we search for the best inverse temperature in Eq. (3) on the dev sets for each expert to minimize the ECE. As ECE mainly uses the term "confidence" which is the w in Eq. (5), we switch to "confidence" instead of "uncertainty" in the following. We partition the samples in the dev set into T equally-spaced bins and take the weighted average of the confidenceaccuracy difference: where B i is the i th bin and N is the number of samples. Functions conf(B i ) and acc(B i ) are the average confidence and top-1 accuracy within the i th bin, respectively. Each confidence score is computed by an ensemble according to Eq. (5).

Experimental Setup
We follow the DPR paper  to train and evaluate our dense retrievers. We replicate their results on all benchmark datasets, with a  maximum score difference between ours and their numbers of 1%. This work only focuses on retrieval accuracy as we only improve the retriever. We finally perform sensitivity analysis for the ensemble and uncertainty visualization of our fusion approach. More details are provided in Appendix A.

Models and Training
We first train independent DPR models on the training set of NQ, TriviaQA, WQ, CuratedTREC, and SQuAD-1.1 separately following . We then encode the training sets into dense vectors as the input to the ensemble. We train an ensemble of 20, 2layer fully connected neural networks with 512 units for 100 epochs. We optimize the objective function in Eq. (3) with learning rate of 2e-05 using Adam (Kingma and Ba, 2015). We use different sub-batches and weight initialization to train each ensemble member to encourage diversity. The rest of the hyperparameter setting remains the same as described in .
Inference: Retrieval Fusion Given a question q during inference, a set of top-k documents is first retrieved by each DPR expert. For each expert, we use the corresponding ensemble to predict the expert's retrieved documents and obtain a collection of dot-product scores. We then apply softmax activation on the dot-products, yielding a collection of distributions over the retrieved documents. We calculate the normalized mutual information between the ensemble's predictions and the parameters as in Eq. (4) and Eq. (5), using it as the weight for the expert as described in Eq. (6) given the question.
In addition, we calibrate each ensemble's uncertainty prediction individually using the expected calibration error (ECE) (Guo et al., 2017) according to Eq. (7), as each ensemble from different domains might have a different range of uncertainty. We find the lowest ECE score is achieved with the inverse temperature λ of the softmax activation in Eq. (3) setting to be 1e-3. Finally, we normalize the calibrated uncertainty and re-rank the union of retrieved documents using the uncertainty-weighted sum of experts' scores. For documents that do not have all experts' scores, we use the minimum of the missing expert's prediction as the ranking score.
6 Results and Analysis

Benchmark Dataset Retrieval
Tbl. 2 shows retrieval performance using different types of DPR models on 5 benchmark datasets. We briefly describe each configuration below.

DPR-Single-domain: A single DPR model trained and tested on the same domain.
DPR-Single-worst: A single DPR model trained on one domain and transferred to the target test set in zero-shot and has the worst performance among all experts.

DPR-Multi (w/o SQuAD):
A multi-task DPR model trained on the joint dataset of {NQ, Trivia, WQ, and Trec} without the SQuAD dataset, as implemented in .

DPR-MUF:
Our model uncertainty fusion method using experts from {NQ, Trivia, WQ, Trec, and SQuAD}, which is our main approach.

DPR-MUF (w/o SQuAD):
Our model uncertainty fusion method using experts from {NQ, Trivia, WQ, and Trec} without SQuAD to align with DPR-Multi (w/o SQuAD).

DPR-MUF (w/o domain):
Our model uncertainty fusion method using all experts except DPR-Single-domain, to investigate out-of-domain generalization.
We can see from Tbl. 2 that our model uncertainty fusion method (DPR-UF) achieves the best performance on almost all benchmark QA datasets except Trivia, in terms of top-20/100 accuracy. The original multi-task DPR model does not include SQuAD for joint training as "SQuAD is limited to a small set of Wikipedia documents and thus introduces unwanted bias" . In comparison, our DPR-UF that includes the SQuAD dataset significantly improves the performance on SQuAD as well as on other datasets. In addition, we find that our DPR-UF (w/o SQuAD) not only manages to beat the joint-training DPR trained on {NQ, Triviva, WQ, and Trec}, but also outperforms it on SQuAD by a large margin (10% in top-100 accuracy). We also test the fusion of experts without the one trained on the target domain, i.e., DPR-MUF (w/o domain), whose performance turns out to be maintained at a reasonable level.
One interesting result we find in the experiments is that DPR-MUF without the CuratedTrec/WQ expert outperforms the CuratedTrec/WQ experts on their domain test sets. We suspect that the Curated-Trec and WQ datasets are too small and might be covered by other datasets. Therefore, it is not surprising that the CuratedTrec and WQ experts trained on small data regimes fail to outperform the larger expert union.

Mixed-Dataset Retrieval
In a real-world application, the retriever often needs to deal with questions from different sources instead of just a single task. To test the ability to retrieve out-of-distribution questions, we evenly sample 5 subsets of 3,000 test questions from 4  benchmark datasets (NQ, Trivia, WQ, SQuAD). We average top-20/100 accuracy on the 5 subsets as the final accuracy. In addition, we design two oracle models which serve as references: DPR-Oracle-Indicator: A mixture of experts that knows the domain each question comes from, and uses the corresponding expert for retrieval.
DPR-Oracle-Bayesian: A mixture of experts that uses Bayesian optimization (Frazier, 2018) to search for the weights. We initialize the weights with the indicator function and use scikit-optimize 1 to search for the optimal weight for 50 iterations for each question. This process is not guaranteed to find the best sets of weights as Bayesian optimization does not always find the global optimum.
Although it is not the exact oracle, this is the best model we could find as exhaustive search is impractical due to its exponential time complexity.
Tbl. 3 shows that all single retrievers have severe performance degradation on the randomly mixed dataset, which is expected as they only specialize in their own domain. In contrast, the two multi-task models, DPR-Multi (w/o SQuAD) and DPR-MUF (w/o SQuAD), manage to maintain high scores on the random mixes, which reach the performance of the two oracle models. Moreover, DPR-MUF even outperforms both the indicator oracle and the Bayesian oracle, suggesting the benefits of using uncertainty to fuse the predictions of multiple experts from different domains.

DPR-BM25 Hybrid Retrieval
Karpukhin et al. (2020) show that DPR can be combined with BM25 to further improve retrieval performance.  further fine-tune the parameters for BM25 and obtain better accuracy using the Pyserini IR toolkit . We follow the experimental setting in  where we re-rank the union of the top-1000 passages retrieved by DPR and BM25 separately, using the weighted sum of the two scores as the ranking value. We search for the optimal weights for BM25 and DPR on the dev set of each QA dataset. Tbl. 4 shows the top-k accuracy of hybrid retrievers using the combination of different DPR models and BM25. Our model uncertainty method manages to outperform single DPR experts and multi-task, joint-training DPR on all benchmark QA datasets. Specifically, DPR-UF (w/o SQuAD) has the best performance on NQ and Trivia, while DPR-UF includes all experts, which has the best performance on WQ, CuratedTrec, and SQuAD. We conjecture it's because NQ and Trivia are much larger and therefore the SQuAD expert might have more conflict with BM25.

Ensemble Sensitivity and Latency
In this section, we analyze how sensitive the retrieval performance of the uncertainty fusion method is w.r.t. the ensemble size. Fig. 2 shows the top-100 accuracy and the relative latency of different sizes of ensembles. The accuracy increases as the size of the ensemble grows until it hits 20, which then plateaus or decreases. We conjecture it is because the functional space of the ensemble is not complex enough as we only use a 2 layer neural network with 512 units as the individual component. Therefore, there are only limited ways for the model to overfit the training sets, resulting in the saturation in diversity w.r.t. the ensemble size. However, we find that overall, these results are good enough while having reasonable latency. The latency (ms/question) of the model is measured relative to a standard DPR model, which mainly includes the ensemble forward inference time. We evaluate the inference speed on a server with an Intel Xeon CPU E5-2699 v4 @ 2.20GHz. In summary, the retrieval accuracy is stable w.r.t. the ensemble size, and one can choose the ensemble size to trade-off between accuracy and latency for different application scenarios.

Uncertainty Visualization
We visualize the model uncertainty in this section for better understanding. Fig. 3 shows 5 ensemble predictions of the top-20 documents on 4 samples from NQ with different uncertainty scores. Each strip in a subplot represents one ensemble member and all members in the same subplot share the same documents retrieved by the DPR expert on NQ. As we use small inverse temperature λ (1e-3) for the softmax distribution in Eq. (3), the probability mass of each distribution mainly concentrates on the top-1 document, which is the tallest bar in each strip. If the top-1 predictions from different ensemble members overlap at the same document, we say these members "agree" with each other and therefore the overall ensemble has low uncertainty. The overlap is quantified by Eq. (4) in practice. As we can see from Fig. 3, the ensemble has full uncertainty (1.0/1.0) when their top-1 predictions do not overlap at all, and has zero uncertainty when its members' predictions completely overlap. In other cases, the more overlap or "agreement" on the top-1 prediction, the less uncertain the ensemble is.

Space-Speed-Flexibility Trade-off
Despite the promising results we have shown in the previous section, the model uncertainty fusion method also has its drawback in open-domain question-answering. For now, all the experts are individually trained in their own domain but share a common corpus. That says if we have m experts, then the index size will grow by O(m) compared to a single multi-task, joint-training model. However, we argue that there's no free lunch as the joint-training model suffers from other problems such as data conflict mentioned before, as well as catastrophic forgetting: If new tasks are added, the joint-training model usually requires to re-train on the union of all tasks again to maintain performance on previous tasks, while our model only needs to train on the new task's data and the new expert can be directly added to the current set of models. Therefore, both methods have their pros and cons according to different application scenarios, and it is upon the users to consider the space-speed-flexibility trade-off. For memory and efficiency issues, possible solutions would be either learning a shared, query-agnostic index for all experts or leveraging model compression methods to compress the size of expert models.

Conclusions
In this paper, we propose a model fusion approach for multi-task dense retrieval. Instead of training a single DPR model on the union of datasets from different distributions, we leverage model uncertainty to merge different DPR expert's predictions during test time. For each expert, we train an ensemble of small neural networks on top of the pre-trained expert's dense representations and use the mutual information between the ensemble parameters and predictions as the weight, which can be interpreted as the "disagreement" among the ensemble. We compare our model uncertainty fusion approach with single specialists and the multi-task, joint-training DPR model on 5 benchmark QA datasets, as well as their dataset random mixes to test out-of-distribution performance. Extensive experiments show that our method manages to outperform these approaches in terms of top-20/100 accuracy on most datasets, while it can also be combined with sparse retrieval methods such as BM25 for further performance gains. Our proposed method is simple to implement and effective while enjoying the benefits of continual learning, faster training speed as the experts can be trained in parallel, as well as the flexibility to combine experts from different domains.
For future research directions, one could leverage model compression techniques to reduce the index size, or knowledge distillation to learn a single student model from the experts. Finally, learning a question-agnostic document index can further save storage space and enhance inference speed for this model fusion method.
Inference During inference, we encode all the passages into dense vectors using the passage encoder and index them using FAISS (Johnson et al., 2021), which is an efficient, open-source library for vector searching and indexing that can scale to millions of vectors.

A.2 Uncertainty Weight Distribution
Section 6 shows that weighting the retrieval results from different experts leads to better generalization. In this section, we inspect the weight distribution over experts given a question, and see whether the fusion weights have a sharp distribution (i.e., mainly using a single expert for each question) or a more scattered one (i.e., a rather even mixture of experts). It turns out that both our uncertainty fusion method and the Bayesian oracle in Section 6.2 have more scattered weights for most questions. Fig. 5 shows the weight distribution over experts of some example questions from the NQ, Trivia, SQuAD, and WQ datasets. The distribution of the Bayesian oracle looks a little bit different from the uncertainty fusion method, which we conjecture is because we initialize the weight of the Bayesian optimization with the indicator function for faster searching. Therefore, it results in another solution whose probability often concentrates more on the domain's expert.  Figure 5: Weight distributions of the DPR-MUF model and the Bayesian oracle on some example queries from the NQ, Trivia, SQuAD, and WQ datasets. Both methods include independent experts trained on {NQ, Trivia, SQuAD, WQ, and Trec}. Despite differences in their weight distributions, these methods all have scattered distributions over each expert's prediction, which shows that fusing different expert's retrieval results indeed helps with generalization.