Task-Aware Specialization for Efficient and Robust Dense Retrieval for Open-Domain Question Answering

Given its effectiveness on knowledge-intensive natural language processing tasks, dense retrieval models have become increasingly popular. Specifically, the de-facto architecture for open-domain question answering uses two isomorphic encoders that are initialized from the same pretrained model but separately parameterized for questions and passages. This biencoder architecture is parameter-inefficient in that there is no parameter sharing between encoders. Further, recent studies show that such dense retrievers underperform BM25 in various settings. We thus propose a new architecture, Task-Aware Specialization for dEnse Retrieval (TASER), which enables parameter sharing by interleaving shared and specialized blocks in a single encoder. Our experiments on five question answering datasets show that TASER can achieve superior accuracy, surpassing BM25, while using about 60% of the parameters as bi-encoder dense retrievers. In out-of-domain evaluations, TASER is also empirically more robust than bi-encoder dense retrievers. Our code is available at https://github.com/microsoft/taser.


Introduction
Empowered by learnable neural representations built upon pretrained language models, the dense retrieval framework has become increasingly popular for fetching external knowledge in various natural language processing tasks Guu et al., 2020;. For opendomain question answering (ODQA), the de-facto dense retriever is the bi-encoder architecture , consisting of a question encoder and a passage encoder. Typically, the two encoders are isomorphic but separately parameterized, as they are initialized from the same pretrained model and then fine-tuned on the task.
Despite of its popularity, this bi-encoder architecture with fully decoupled parameterization has some open issues. First, from the efficiency per-spective, the bi-encoder parameterization apparently results in scaling bottleneck for both training and inference. Second, empirical results from recent studies show that such bi-encoder dense retrievers underperform its sparse counterpart BM25 (Robertson and Walker, 1994) in various settings. For example, both  and  suggest the inferior performance on SQuAD (Rajpurkar et al., 2016) is partially due to the high lexical overlap between questions and passages, which gives BM25 a clear advantage. Sciavolino et al. (2021) also find that bi-encoder dense retrievers are more sensitive to distribution shift than BM25, resulting in poor generalization on questions with rare entities.
In this paper, we develop Task-Aware Specialization for dEnse Retrieval, TASER, as a more parameter-efficient and robust architecture. Instead of using two isomorphic and fully decoupled Transformer (Vaswani et al., 2017) encoders, TASER interleaves shared encoder blocks with specialized ones in a single encoder, motivated by recent success in using Mixture-of-Experts (MoE) to scale up Transformer (Fedus et al., 2021). For the shared encoder block, the entire network is used to encode both questions and passages. For the specialized encoder block, some sub-networks are task-specific and activated only for certain encoding tasks. To choose among task-specific sub-networks, TASER uses a task-dependent routing mechanism which can be either deterministic or learned in an end-toend fashion.
We carry out both in-domain and out-of-domain evaluation for TASER. For the in-domain evaluation, we use five popular ODQA datasets. Our best model outperforms BM25 and existing bi-encoder dense retrievers, while using much less parameters. It is worth noting that TASER can effectively close the performance gap on SQuAD between dense retrievers and BM25. One interesting finding from our experiments is that excluding SQuAD Dense Retrieval with TASER Figure 1: The dense retrieval architectures using a bi-encoder (left) and task-aware specialization (right). The question and passage transformer blocks in the bi-encoder are isomorphic to the shared transformer blocks in TASER. A specialized transformer block consists of several expert FFN sub-layers and a router. The router is used to choose among expert FFN sub-layers based on input. Only the deterministic routing Det-R is shown in the figure, which has two two expert FFN sub-layers (a Q-FFN for questions and a P-FFN for passages).
from the multi-set training is unnecessary, which was a suggestion made in  and adopted by most follow-up work. Our outof-domain evaluation experiments use EntityQuestions (Sciavolino et al., 2021) and BEIR (Thakur et al., 2021). Consistent improvements over the doubly parameterized bi-encoder dense retriever are observed in these zero-shot evaluations as well.
The main contributions of this work are as follows. First, we develop TASER as a novel parameter-efficient dense retrieval framework for ODQA and study both deterministic specialization and learned specialization strategies. Using five popular ODQA datasets, we demonstrate that TASER can significantly improve the parameter efficiency of dense retrieval models. Similar improvements over bi-encoder dense retrievers are also observed in out-of-domain evaluations. Finally, TASER achieves new state-of-the-art results for dense retrievers, together with around 40% reduction in model size.

Background
In this section, we provide necessary background about the bi-encoder architecture for dense passage retrieval which is widely used in ODQA  and is the primary baseline model in our experiments.
As illustrated in the left part of Figure 1, the bi-encoder architecture consists of a question encoder and a passage encoder, both of which are usually Transformer encoders (Vaswani et al., 2017). A Transformer encoder is built up with a stack of Transformer blocks. Each block consists of a multi-head self-attention (MHA) sub-layer and a feed-forward network (FFN) sub-layer, with residual connections (He et al., 2016) and layernormalization (Ba et al., 2016) applied to both sublayers. Given an input vector h ∈ R d , the FFN sub-layer produces an output vector as following where W 1 ∈ R m×d , W 2 ∈ R d×m , b 1 ∈ R m , and b 2 ∈ R d are learnable parameters. For a sequence of N tokens, each Transformer block produces N corresponding vectors, together with a vector for the special prefix token [CLS] which can be used as the representation of the sequence. We refer readers to (Vaswani et al., 2017) for other details about Transformer. Typically the question encoder and passage encoder are initialized from a pretrained language model such as BERT , but they are parameterized separately, i.e., their parameters would differ after training.
The bi-encoder model independently encodes questions and passages into d-dimension vectors, using the final output vectors for [CLS] from the corresponding encoders, denoted as q ∈ R d and p ∈ R d , respectively. The relevance between a question and a passage can then be measured in the vector space using dot product, i.e., sim(q, p) = q T p. (2) During training, the model is optimized based on a contrastive learning objective, where p + is the relevant (positive) passage for the given question, and P is the set of irrelevant (negative) passages. During inference, all passages are pre-converted into vectors using the passage encoder. Then, each incoming question is encoded using the question encoder, and a top-K list of most relevant passages are retrieved based on their relevance scores with respect to the question. Although the bi-encoder dense retrieval architecture has achieved impressive results in ODQA, few work has attempted to improve its parameter efficiency. Further, compared to the spare vector space model BM25 (Robertson and Walker, 1994), such bi-encoder dense retrievers sometimes suffer from inferior generalization performance, e.g., when the training data is extremely biased (Lebret et al., 2016; or when there is a distribution shift (Sciavolino et al., 2021). In this paper, we conjecture that the unstable generalization performance is partially related to the sheer number of learnable parameters in the model. Therefore, we develop a task-aware specialization architecture for dense retrieval with parameter sharing between the question and passage encoders, which turns out to improve both parameter efficiency and generalization performance.

Task-Aware Specialization for Dense Retrieval
In this section, we first describe the model architecture of TASER and different specialization strategies in §3.1. Then in §3.2, we describe two training paradigms considered in our experiments, i.e., single-set and multi-set training, following . The details about hard negatives mining are provided in §3.3, which is an effective technique used in (Xiong et al., 2020;Qu et al., 2021) to improve the model performance.

Model Architecture
As shown in the right part of Figure 1, TASER interleaves shared Transformer blocks with specialized ones. The shared Transformer block is identical to the Transformer block used in the bi-encoder architecture, but the entire block is shared for both questions and passages. In the specialized block, we apply MoE-style task-aware specialization to the FFN sub-layer, following (Fedus et al., 2021). Specifically, the specialized block uses multiple expert FFN sub-layers in parallel, each with its own set of parameters, and a router is used to choose among these expert FFN sub-layers. TASER uses one specialized Transformer block after every T shared Transformer blocks in the stack, starting with a shared one at the bottom. Our preliminary study indicates that the model performance is not sensitive to the choice of T , so we use T = 2 for experiments in this paper. Also, all specialized Transformer blocks use the same number of expert FFN sub-layers for simplicity.
In TASER, the router always routes the input to a single expert FFN sub-layer, following (Fedus et al., 2021). We introduce three routing mechanisms: the deterministic routing (Det-R), the sequence-based routing (Seq-R), and the tokenbased routing (Tok-R). Both Seq-R and Tok-R are learned jointly with the task-specific objective.
For Det-R, only two expert FFN sub-layers are needed for ODQA retrieval, one for questions and one for passages. In this case, the router determines the expert FFN sub-layer based on whether the input is a question or a passage.
For Seq-R and Tok-R, the router uses a parameterized routing function where GumbelSoftmax (Jang et al., 2016) outputs a I-dimensional one-hot vector based on the linear projection parameterized by A ∈ R d×I and c ∈ R I , I is the number of expert FFN sub-layers in the specialized Transformer block, and u ∈ R d is the input of the routing function. Here, the routing function is jointly learned with all other parameters using the discrete reparameterization trick. For Seq-R, routing is performed at the sequence level, and all tokens in a sequence share the same u, which is the FFN input vector h [CLS] representing the special prefix token [CLS]. For Tok-R, the router independently routes each token, i.e., for the j-th token in the sequence, u is set to the corresponding FFN input vector h j . Similar to the bi-encoder architecture, TASER with Det-R is trained using the contrastive learning objective L sim defined in Equation 3. For Seq-R and Tok-R, to avoid routing all inputs to the same expert FFN sub-layer, we further apply the entropic regularization where P (i) = Softmax(Ah+c) i is the probability of the i-th expert FFN sub-layer being selected. Hence, the joint training objective is where β is a scalar hyperparameter. In our work, we fix β = 0.01.

Single-Set and Multi-Set Training
In this paper, we consider two training paradigms: single-set and multi-set, following . In the single-set training, a model is trained using only a single dataset, and it is evaluated on the same dataset. Such models are datasetspecific and may not perform well in other datasets.
In the multi-set training, a model is trained by combining training data from multiple datasets to obtain a model that works well across the board. Notably,  suggest to exclude SQuAD (Rajpurkar et al., 2016) from the training data in the multi-set setting, since it is postulated to negatively impact the overall performance due to its low performance in the single-set setting. This suggestion is adopted by follow-up work such as (Ma et al., 2021), (Gao and Callan, 2022) and (Yang et al., 2021). In our experiments, we show that including SQuAD in the multi-set training data would not hurt the performance ( §4.3). In fact, the average performance is better due to its improved performance on SQuAD. Therefore, we recommend future development to include SQuAD during multi-set training.

Hard Negative Mining
Recall that in Equation 3 the objective L sim needs to use a set of negative passages P for each question. There are several ways to construct P. In , the best setting uses two negative passages per question: one is the top passage retrieved by BM25 which does not contain the answer but match most question tokens, and the other is chosen from the gold positive passages for other questions in the same mini-batch. Recent work shows that mining harder negative examples with iterative training can lead to better performance (Xiong et al., 2020;Qu et al., 2021). Hence, in this paper, we also train TASER with hard negatives mining. Specifically, we first train a TASER model with negative passages P 1 same as . Then, we use this model to construct P 2 by retrieving top-100 ranked passages for each question excluding the gold passage. In the single-set training, we combine P 1 and P 2 to train the final model. In the multi-set training, only use P 2 is used to train the final model for efficiency consideration.

Experiments
In this section, we first describe the datasets and evaluation metrics used in our experiments in §4.1. In §4.2, we compare different TASER variants. Our main in-domain evaluation results are provided in §4.3. Finally, the out-of-domain evaluation experiments are discussed in §4.4.

Datasets and Evaluation Metrics
For the in-domain evaluation, we use five popular ODQA datasets: NaturalQuestions (NQ; Kwiatkowski et al., 2019), TriviaQA (Joshi et al., 2017), SQuAD (Rajpurkar et al., 2016), WebQuestions (WebQ; Berant et al., 2013), and CuratedTrec (TREC; Baudiš and Šedivý, 2015). All data splits and the Wikipedia collection for retrieval used in our experiments are the same as . A brief description about them is also provided in Section 4.2 by . The top-K retrieval accuracy, denoted as R@K, is widely used for these datasets, which evaluates whether any gold answer string is contained in the top K retrieved passages. Following previous work, we report both R@20 and R@100.
For the out-of-domain evaluation, we use two benchmarks: EntityQuestions (Sciavolino et al., 2021) and BEIR (Thakur et al., 2021). Different from the five ODQA datasets above, EntityQuestions consists of entity-centric questions with a broader set of entities which have different frequencies in Wikipedia. Hence, it can be used to measure the model sensitivity to entity distribution shift. We also report R@20 and R@100 for EntityQuestions. BEIR is a benchmark for zero-shot evaluation of retrieval systems, constructed from a diverse set of text retrieval datasets. Due to license issues, we cannot obtain all datasets in BEIR. Thus, we identify four interesting datasets and report results on them: ArguAna (Wachsmuth et al., 2018) dia (Hasibi et al., 2017), FEVER (Thorne et al., 2018), and HotpotQA (Yang et al., 2018). In addition to entity distribution shift, these datasets can reflect the model generalization performance with respect to richer query type and document index shifts. For example, DBPedia contains single-hop questions whereas HotposQA has multi-hop questions, and queries in FEVER and ArguAna are short claims and long arguments, respectively. Different from ODQA experiments which exclusively use Wikipeida as the document index, DBPedia articles are used in DBPedia and arguments from ideate.org are used in ArguAna. Further, ArguAna has 3 levels of relevance, unlike the binary relevance level used in other datasets. Following the recommendation of BEIR, we use the normalized cumulative discount gain for top-10 hits (nDCG@10) as the evaluation metric for these datasets.

Comparing TASER Variants
In this part, we compare different TASER variants discussed in §3.1 by evaluating their performance on NQ under the single-set training setting. We use the bi-encoder dense passage retriever (DPR) from  as our baseline. All models including DPR are initialized from the BERT-base . 1 All TASER models are finetuned up to 40 epochs with Adam (Kingma and Ba, 2014) using a learning rate chosen from {3e − 5, 5e − 5}. Model selection is performed on the development set following . Results are summarized in Table 1. TASER Shared is a variant without any task-aware specialization, i.e., there is a single expert FFN sublayer in the specialized Transformer block and the router is a no-op. As shown in Table 1, it outperforms DPR while using only 50% parameters.
Task-aware specialization brings extra improvements, with little increase in model size. Comparing the two learned routing mechanisms, Seq-R achieves slightly better results than Tok-R, indicating specializing FFNs based on sequence-level features such as sequence types is more effective for ODQA dense retrieval. This is consistent with the positive results for Det-R, which consists of two expert FFNs specialized for questions and passages, respectively. We also find that adding more expert FFNs does not necessarily bring extra gains, and I = 2 is sufficient for NQ. Consistent with the results on DPR, the hard negatives mining described in §3.3 can further boost TASER Det-R performance by 3.0 points in test set R@20. Since Det-R achieves the best R@20, our subsequent experiments focus on this simple and effective specialization strategy. In the remainder of the paper, we drop the subscript and simply use TASER to denote models using Det-R.

In-Domain Evaluation
We carry out in-domain evaluations on the five ODQA datasets under the multi-set training setting. Besides BERT-base, coCondenser-Wiki (Gao and Callan, 2022) is also used to initialize TASER models. All TASER models use hard negatives mined from NQ, TriviaQA and WebQ. We combine NQ and TriviaQA development sets for model selection. Other training details are the same as in §4.2.
We also present results of hybrid models that linearly combine the dense retrieval scores with the BM25 scores, sim(q, p) + α · BM25(q, p).
We search the weight α in the range [0.5, 2.0] with an interval of 0.1 based on the combined development set mentioned above. Unlike (Ma et al., 2021), we use a single α for all five datasets instead of dataset-specified weights so that the resulting hybrid retriever still complies with the multi-set setting in a strict sense. The same normalization techniques described in (Ma et al., 2021) is used.

TriviaQA
WebQ TREC SQuAD Model @20 @100 @20 @100 @20 @100 @20 @100 @20 @100     Similar to Ma et al., 2021), we separately retrieve K candidates from TASER and BM25, and then retain the top K based on the hybrid scores, though we use a smaller K = 100. Results for different models are shown in Table 2, and the corresponding model sizes are reported in Table 3. BM25 serves as a strong baseline. Several dense models with single-set training are also included for comparison. For multiset training, all prior work exclude SQuAD from training, following the suggestion from . We instead train models using all five datasets. Specifically, we show that including SQuAD in the multi-set training would not hurt the overall performance while significantly improving the performance on SQuAD. The overall R@100 is also improved. Based on these new results, we suggest future work to include SQuAD in the multi-set training as well.
Comparing models initialized from BERTbase, TASER significantly outperforms DPR  and xMoCo (Yang et al., 2018) across the board, using only around 60% parameters. SPAR (Chen et al., 2022) is also initialized from BERT-base, but it augments DPR with another dense lexical model trained on either Wikipedia or PAQ , which doubles the model sizes (Table 3). TASER is mostly on par with SPAR-Wiki and SPAR-PAQ, except on SQuAD, but its model size is about a quarter of SPAR. DPR-PAQ (Oguz et al., 2021) is only evaluated on NQ under the single-set setting, and it uses RoBERTa-large (Liu et al., 2019) for initialization which has even more parameters.
The coCondenser model (Gao and Callan, 2022   and Callan, 2021) that produces information-rich [CLS] embeddings, together with a corpus-level contrastive learning objective. It significantly outperforms DPR models initialized from BERT-base in the single-set training setting. We show that using coCondenser-Wiki for initialization is also beneficial for TASER under the multi-set setting, especially for SQuAD where R@20 and R@100 are improved by 3.2 and 2.2 points. Notably, SQuAD is the only dataset among the five where DPR underperforms BM25, due to its higher lexical overlap between questions and passages. TASER effectively narrows down the performance gap, suggesting that task-aware specialization indeed makes dense retrieval models more robust in terms of in-domain generalization. Further, TASER results surpass BM25 on all five datasets, and they are either on-par or better than state-of-the-art dense-only retriever models, demonstrating its superior parameter efficiency.
Consistent with previous work, combining BM25 with dense models can further boost the performance, particularly on SQuAD. However, the improvement is more pronounced on DPR compared to TASER and TASER , indicating that TASER is able to capture more lexical overlap features. Finally, TASER + BM25 sets new state-ofthe-art performance on all five ODQA datasets.

Out-of-Domain Evaluation
Here, we use two benchmarks to evaluate the out-of-domain (OOD) generalization ability of TASER and TASER from Table 2 . EntityQuestions (Sciavolino et al., 2021) is used to measure the model sensitivity to entity distributions, as DPR is found to perform poorly on entity-centric questions containing rare entities. BEIR (Thakur et al., 2021) is used to study the model generalization ability in other genres of information retrieval tasks. Specifically, we focus on four datasets from BEIR, i.e., ArguAna (Wachsmuth et al., 2018), DBPedia (Hasibi et al., 2017), FEVER (Thorne et al., 2018), and HotpotQA (Yang et al., 2018). DPR underperforms BM25 on all these datasets. Table 4 presents the results for EntityQuestions. We report macro R@20 scores which are used in (Sciavolino et al., 2021) as well as micro R@20 and R@100 scores which are used in (Chen et al., 2022). As we can see, both TASER and TASER outperform the doubly parameterized DPR Multi , with TASER being slightly better. Similar to the indomain evaluation results, TASER can effectively reduce the performance gap between the dense retrievers and BM25. These results further support our hypothesis that more parameter sharing can improve the model robustness for dense retrievers.
Results on BEIR are reported in Table 5. Similarly, we observe that TASER models consistently improve over DPR Multi across the board. It is worth mentioning that our dense models TASER and TASER can actually match the performance of BM25 on ArguAna and DBpedia. Interestingly, co-Condenser pre-training has mixed results here, i.e., TASER is only better than TASER on HotpotQA and on par or worse on other datasets including NQ. The flipped trend on the in-domain dataset NQ indicates there is some discrepancy between nDCG@10 and R@K used in Table 2. We leave further investigation to future work.

Related Work
Passage retrieval is an important component for ODQA. Early systems (Chen et al., 2017;Yang et al., 2019;Nie et al., 2019;Min et al., 2019;Wolfson et al., 2020) are dominated by sparse vector space models like TF-IDF (Jones, 1972) or BM25 (Robertson andWalker, 1994). Recent seminal work on dense retrieval demonstrates its effectiveness using Transformer-based bi-encoder models by either continual pre-training with an inverse cloze task  or careful fine-tuning .
One line of follow-up work improves dense retrieval models via various continual pre-training approaches. Guu et al. (2020) jointly pre-train the retriever and the knowledge-augmented encoder on the language modeling task.  introduce body first selection and wiki link prediction as extra pre-training tasks, and Izacard et al. (2021) propose to use independent cropping as well. Gao and Callan (2022) combine the Codenser pretraining architecture (Gao and Callan, 2021) that produces information-rich [CLS] vectors with a corpus-level contrastive learning objective. Oguz et al. (2021) propose domain matched pre-training, using synthetic question-answer pairs in generated from Wikipedia pages .
To improve the in-batch negatives during finetuning used in prior work Luan et al., 2021), Xiong et al. (2020) iteratively construct global negatives using the being-optimized bi-encoder dense retriever, and Qu et al. (2021) propose cross-batch negatives and denoised hard negatives using a less efficient but more accurate cross-encoder dense retriever. Yang et al. (2021) focus on improving the contrastive learning objective by extending the momentum contrastive learning (He et al., 2020) to the bi-encoder architecture. Motivated by the success of augmenting dense models with sparse models, Chen et al. (2022) combine the dense retriever with a dense lexical model that mimics sparse retrievers.
All above work focus on improving the accuracy of bi-encoder dense retrievers, whereas our work tackles the parameter efficiency issue. MoE (Jacobs et al., 1991) is recently used to scale up neural models, achieving impressive results on language modeling and machine translations . The Switch Transformer developed by Fedus et al. (2021) brings MoE techniques to Transformer (Vaswani et al., 2017). Our work is the first to apply MoE to dense retrieval models, demonstrating its effectiveness on ODQA.
Unlike most bi-encoder dense retrievers which measure the similarity between a question and a passage using their corresponding [CLS]vectors, ColBERT (Khattab and Zaharia, 2020) develops a late-interaction paradigm and measures the similarity via a MaxSim operator that computes the maximum similarity between a token in a sequence and all tokens in the other sequence. Such architecture has shown promising results in ODQA (Khattab et al., 2021) and the BEIR benchmark (Santhanam et al., 2022). Our work instead focus on the improvement on the underlying text encoders, and the MaxSim operator introduced by ColBERT can be applied on top of TASER.

Conclusion
In this work, we propose a new parameterization framework, TASER, for improving the efficiency and robustness of dense retrieval for ODQA. To improve the parameter efficiency, TASER interleaves shared encoder blocks with specialized ones in a single encoder where some sub-networks are task-specific. As the specialized sub-networks are sparsely activated for different encoding tasks, TASER can provide better parameter efficiency with almost no additional computation cost. Extensive experiments over five popular ODQA datasets and two out-of-domain retrieval benchmarks show that TASER substantially outperforms existing fully supervised bi-encoder dense retrievers on both in-domain and out-of-domain generalization. Similar to bi-encoder models, advanced techniques such as iterative hard negatives mining and ensembling with BM25 can be applied to TASER models to achieve further improvement in retrieval performance. On all five ODQA datasets, TASER models achieve state-of-the-art results, while using much less parameters.

Limitations
In this section, we point out several limitations in this work.
First, our in-domain evaluation experiments focus on passage retrieval for ODQA. While the dense retriever is mostly successful in ODQA, it can be also used in other types of retrieval tasks which may have different input and output format. For example, the KILT benchmark (Petroni et al., 2021) provides several knowledge-intensive tasks other than ODQA. The performance of TASER models trained on such retrieval tasks remain unknown.
Second, compared with traditional sparse vector models like TF-IDF and BM25, the cost of training is an inherent issue of dense retrievers. Although TASER significantly reduce the number of model parameters, the training cost is still high.
Third, in our experiments, we show that the learned routing does not outperform the deterministic routing. This may suggest a better architecture and/or training algorithms for learned routing is needed to fully unleash the power of MoE.
Last, as observed in §4.4, there is still a gap between TASER and BM25 in OOD evaluation. Therefore, how to close this gap will remain a critical topic for future work on dense retrievers.