Large Dual Encoders Are Generalizable Retrievers

It has been shown that dual encoders trained on one domain often fail to generalize to other domains for retrieval tasks. One widespread belief is that the bottleneck layer of a dual encoder, where the final score is simply a dot-product between a query vector and a passage vector, is too limited compared to models with fine-grained interactions between the query and the passage. In this paper, we challenge this belief by scaling up the size of the dual encoder model while keeping the bottleneck layer as a single dot-product with a fixed size. With multi-stage training, scaling up the model size brings significant improvement on a variety of retrieval tasks, especially for out-of-domain generalization. We further analyze the impact of the bottleneck layer and demonstrate diminishing improvement when scaling up the embedding size. Experimental results show that our dual encoders, Generalizable T5-based dense Retrievers (GTR), outperform previous sparse and dense retrievers on the BEIR dataset significantly. Most surprisingly, our ablation study finds that GTR is very data efficient, as it only needs 10% of MS Marco supervised data to match the out-of-domain performance of using all supervised data.


Introduction
Typical neural retrieval models follow a dual encoder paradigm (Gillick et al., 2018;Yang et al., 2020;Karpukhin et al., 2020).In this setup, queries and documents are encoded separately into a shared fixed-dimensional embedding space, where relevant queries and documents are represented in each other's proximity.Then, approximated nearest neighbor search (Vanderkam et al., 2013;Johnson et al., 2021) is applied to efficiently retrieve relevant documents given an encoded input query.
While dual encoders are popular neural retrievers, the expressiveness of the model is limited by a bottleneck layer consisting of only a simple dotproduct between query and passage embeddings.Lu et al. (2021); Khattab and Zaharia (2020) argued that the dot-product (or cosine similarity) between the embeddings might not be powerful enough to capture the semantic relevance.Similarly, Thakur et al. (2021) suggested that dual encoder models have "issues for out-of-distribution data" and models with more interactions between queries and documents have better generalization ability.
In this paper, we challenge this belief by scaling up the dual encoder model size while keeping the bottleneck as a single dot-product with a fixed size.Note that scaling up a dual encoder is different from scaling up pretrained language models, such as BERT (Devlin et al., 2019) and T5 (Raffel et al., 2020), because of the presence of the bottleneck layer.While increasing the model size can greatly increase model capacity, for dual encoders with a fixed bottleneck embedding size, the interactions between queries and documents are still limited by a simple dot-product.
To test this hypothesis, we take advantage of the T5 architecture and checkpoints to build encoders of up to 5 billion parameters while keeping the The research question we ask is: can scaling up dual encoder model size improve the retrieval performance while keeping the bottleneck layers as a single dot-product with a fixed size?Only the encoder is taken from the pre-trained T5 models, and the two towers of the dual encoder share parameters.
bottleneck embedding dimension of 768 in all configurations, as illustrated in Figure 2. Following Ni et al. (2021), we build dual encoders by taking the encoder part of T5.To effectively leverage the power of large models, we collect two billion web question-answer pairs as generic pre-training data.By combining pre-training with generic data and fine-tuning with MS Marco (Nguyen et al., 2016), we are able to train large-scale dual encoder retrieval models.We call the resulting models Generalizable T5-based dense Retrievers (GTR).We assess the zero-shot performance of GTR on the BEIR benchmark (Thakur et al., 2021), which has 18 information retrieval tasks across 9 domains.We showed that scaling up leads to better generalization despite the fixed single-dot product bottleneck.Second, pre-training on community questionanswer pairs and fine-tuning on human curated data are both important to fully utilize the power of the scaled up model.In addition, with scaling and pre-training, we found GTR to be highly data efficient in terms of human annotated queries, as it only needs to use 10% of MS Marco to match the overall out-of-domain performance.

Dual Encoder and dense retrieval
Classic retrieval models, such as BM25 (Robertson and Zaragoza, 2009), rely on lexical overlap: term frequency, inverse document frequency and document length.To allow semantic matching between queries and documents, dense retrieval models, such as dual encoders (Yih et al., 2011;Gillick et al., 2019;Karpukhin et al., 2020), are introduced, where both queries and documents are embedded into low-dimensional dense representations.
A critical challenge for dual encoders is that the performance can be bounded by the dot-product similarity function.As such, there is growing interest in applying lightweight interaction layers to replace the single dot-product function.Luan et al. (2020) propose a multi-vector encoding model to represent a document as a set of vectors and calculate the relevance scores as the maximum inner product over this set.ColBERT (Khattab and Zaharia, 2020) learns embeddings for each token and then uses a "MaxSim" operation to select the best candidate.While these models can achieve significant improvement, dual encoder is still the most popular in practice due to its simpleness and the ability to scale.In this paper, we take a step back and show that performance of single dot-product based methods can be improved significantly.

BEIR generalization task
We use BEIR, a heterogenous benchmark, for zeroshot retrieval evaluation.BEIR has 18 information retrieval datasets1 across 9 domains, including Bio-Medical, Finance, News, Twitter, Wikipedia, StackExchange, Quora, Scientific, and Misc.The majority of the datasets have binary query relevance labels.The other datasets have 3-level or 5-level relevance judgements.We refer readers to BEIR (Thakur et al., 2021) for more details.

T5 dual encoder
We adopt the dual encoder framework and follow prior work (Xiong et al., 2020;Hofstätter et al.,  2021) to initialize from pre-trained language models.We choose the pre-trained T5 model family as our backbone encoder because it provides offthe-shelf pre-trained models with capacities ranging from millions to billions of parameters (Raffel et al., 2020;Xue et al., 2020Xue et al., , 2021)).We illustrate the architectures of our models in Figure 2.
Let paired examples D = {(q i , p + i )} be the training set, where q i is a query and p + i is a ground-truth relevant passage.Following Ni et al. (2021), we encode q i and p + i into embeddings by feeding them to the T5 encoder and taking the mean pooling of the encoder output.In all experiments, we fix the size of the output embeddings to 768.
We train the model using an in-batch sampled softmax loss (Henderson et al., 2017): where sim is cosine similarity, B is a mini-batch of examples, and τ is the softmax temperature.
In addition to in-batch negatives, we also support having additional negatives p − j : L = e sim(q i ,p + i )/τ j∈B e sim(q i ,p + j )/τ + e sim(q i ,p − j )/τ . (2) We apply this loss function in a bi-directional way (Yang et al., 2019) for matchings of both question-to-document and document-to-question.

Multi-stage training
As shown in Figure 3, we use a multi-stage training approach to achieve generalizable retrieval models.
The training process includes a pre-training stage on a web-mined corpus and a fine-tuning stage on search datasets.Although the web-mined corpus is not annotated, it contains a large amount of semistructured data pairs (e.g., question-answer pairs) and can provide rich semantic relevance information to the model in pre-training.On the other hand, the search datasets are curated and well-annotated, and thus, can benefit the fine-tuning stage.
Specifically, for dual encoder pre-training, we initialize the dual encoders from the T5 models and train on question-answer pairs collected from the Web.Recently, Sentence-T5 (Ni et al., 2021) explored different ways to extract strong text embeddings and achieved remarkable performance on SentEval and Sentence Textual Similarity tasks.We follow their setting to encode queries and passages via mean pooling from the T5 encoders and focus on the dense retrieval tasks.
For fine-tuning, our aim is to adapt the model to retrieval using a high quality search corpus so the model can learn to better match generic queries to documents.In this paper, we consider two datasets for fine-tuning: MS Marco (Nguyen et al., 2016) and Natural Questions (NQ) (Kwiatkowski et al., 2019).

Experimental setup 4.1 Training Data
Community QA.To leverage the power of large scale models, we collect input-response pairs and question-answer pairs from online forums and QA websites, such as Reddit and StackOverflow.This results in 2 billion question-answer pairs that we use to pre-train the dual encoder.
MS Marco.The MS Marco dataset (Nguyen et al., 2016) includes 532K query and document pairs, which we use as search data for fine-tuning.This dataset is sampled from Bing search logs and covers a broad range of domains and concepts.
Natural Questions.In the fine-tuning stage, we also consider the Natural Questions dataset (Kwiatkowski et al., 2019) , which has been widely used in related work (Karpukhin et al., 2020;Xiong et al., 2020).This dataset consists of 130K query and passage pairs that are human-annotated.

Configurations
We implement GTR models in JAX2 and train them on Cloud TPU-V3.We consider different sizes of the T5 transformer (Vaswani et al., 2017)  number of parameters are listed in Table 1.Note that we only use the encoder of the T5 models, and thus, the number of parameters are less than half of that of the full model.We take the off-the-shelf checkpoints as the initial parameters and use the same sentencepiece vocabulary model. 3 During pre-training and fine-tuning, we set the batch size to 2048 and the softmax temperature τ to 0.01.We use Adafactor optimizer (Shazeer and Stern, 2018) and set the initial learning rate to 1e-3 with a linear decay.We train the model for 800K steps for pre-training and 20K steps for fine-tuning.
For fine-tuning, we use the hard negatives from RocketQA (Qu et al., 2021) with MS Marco and the hard negatives from Lu et al. (2021) with NQ.These hard negatives were proven to lead to better retrieval performance.By default, we use the complete MS Marco or NQ datasets for fine-tuning.
When evaluating on the BEIR benchmark, we use sequences of 64 tokens for questions and 512 for documents in all datasets but Trec-News, Robust-04, and ArguAna.In particular, we set the document length to 768 for Trec-News and Robust-04 while setting the question length to 512 for Ar-guAna, in accordance with the average query and document lengths in these datasets.

Models for comparison
We consider various baselines, including sparse retrieval models: BM25, DocT5Query, and dense retrieval models: DPR, ANCE, TAS-B, and GenQ (Thakur et al., 2021).We conduct experiments on four different sizes of our GTR models (GTR- GTR-Base outperforms BM25 on 9 tasks with larger GTR models continuing improving on these 9 tasks.GTR-XXL catches-up/surpasses BM25 on another 5 tasks and only under-performs on the remaining 5. Base, GTR-Large, GTR-XL, and GTR-XXL).We also consider three different settings for GTR to investigate the effect of scaling up for different training stages: • GTR: the full GTR models that conduct both pre-training and fine-tuning.• GTR-FT: only fine-tuned on MS Marco without pre-training.• GTR-PT: only pre-trained on CommunityQA without fine-tuning.We evaluate our models on BEIR (Thakur et al., 2021) as discussed in Section 2.2.We consider two retrieval metrics: NDCG@10 and Recall@100 following BEIR.Due to space limitations, we report the Recall@100 results in Appendix A.

Evaluation Results
We present three groups of experiments to study a) the in-domain performance on MS Marco, b) the out-of-domain generalization performance on BEIR, and c) the data efficiency.

Results on MS Marco
As shown in Table 3, with scaling up, the in-domain NDCG@10 scores on MS Marco achieve consistent improvement.We observe a similar effect on other evaluation metrics, including MRR@10 and Recall@1000, and reported the numbers in

Results on BEIR generalization tasks
Also as shown in Table 3, we observe a clear gain on out-of-domain (OOD) performance in terms of NDCG@10 when the model size increases.The GTR-Large model already outperforms the previous best dense retrieval model TAS-B, as well as the best sparse model DocT5Query.Scaling up to GTR-XXL leads to another jump in retrieval performance.On average, the scaling up process demonstrates an encouraging ascending trend that eventually make GTR outperform all baseline methods on all evaluation metrics.This confirms that scaling up is a valid path towards generalizability.Previously, dual encoders failed to match the performance of BM25 for tasks that require better lexical matching capabilities.Thus, we would like to investigate what kind of tasks can get improved by scaling up the model size.Figure 4 presents a detailed comparison of all sizes of GTR models against the BM25 baseline.
For tasks like NQ, where dual encoders have been previously shown to be more effective than BM25, increasing the model size continues to advance the performance of dual encoders.This suggests scaling up can further boost the head start of dense models over sparse models on these datasets.
For tasks like BioASQ and NFCorpus, where dual encoders previously struggled to match the performance of BM25 for inherent reasons, we discover scaling up consistently improves the retrieval performance.In particular, for NFCorpus, our Base model under-performs BM25 but the XL model outperforms BM25 by 5.5% (0.343 vs. 0.325).This exciting finding verifies our assumption that scaling up can further exploit the powerful semantic matching capabilities of the dual encoder models and enable them to ultimately outperform BM25.

Data efficiency for large retrievers
To better understand the data efficiency for large dual encoders, we train GTR models using different proportions of MS Marco during fine-tuning.In particular, we sample a subset of the MS Marco training data by keeping only 10% of the training queries, as well as their relevant (positive) passages and irrelevant (hard negative) passages.
As shown in Table 4, using 10% of the training data reduces the in-domain performance of GTR on MS Marco, as expected.For OOD performance, however, we find some surprising results.We see that the Large GTR-FT (fine-tuning only) model already shows comparable performance even trained with only 10% of the data.Moreover, the full GTR models trained with less data manage to achieve comparable or even better OOD performance than those fine-tuned on the complete MS Marco dataset.This is strong evidence that using 10% of the MS Marco dataset is sufficient to conduct fine-tuning for the full GTR models.This encouraging observation suggests that GTR models enjoy the benefit of data efficiency and can achieve domain adaptation with less fine-tuning data.

Ablation Study and Analysis
In this section we present ablations and analysis to further illustrate the effects of scaling up, the impact of fine-tuning and pre-training, and the GTR model's behavior.

Scaling up in different training stages
The first ablation study aims to investigate how scaling up impacts dual encoder pre-training and fine-tuning.Results are presented in Table 5.
For fine-tuning only models (GTR-FT), scaling up benefits both in-domain and OOD performance.This suggests that, even without pre-training, having larger models is still beneficial to learn the retrieval signals during fine-tuning.For pre-training only models (GTR-PT), we also observe a generally upward trend for zero-shot retrieval tasks.This indicates scaled up models are more capable of absorbing the semantic information in generic training data, and thus, can improve generalization.Finally, with both pre-training and fine-tuning, the full GTR models consistently improve over GTR-FT of all sizes.This shows the power of combining scaling up and a generic pre-training stage.

Importance of the fine-tuning dataset
In Table 5, we compare GTR and GTR-PT on the BEIR benchmark to understand the importance of fine-tuning on MS Marco.The table shows that there is a clear gap between GTR models before and after fine-tuning.The result demonstrates the necessity of leveraging a high quality dataset (e.g., search data) to fine-tune the dual encoders.
Specifically, we compare fine-tuning GTR on NQ vs. MS Marco.NQ only covers Wikipedia documents and is much smaller in size than MS Marco.This allows us to investigate the performance of GTR of being fine-tuned on a less generalizable dataset.Also, fine-tuning on NQ gives a fair comparison with DPR (Karpukhin et al., 2020).
As shown in Table 6, the GTR-Base model finetuned on NQ outperforms the original DPR model that uses a BERT-Base model as the backbone encoder.This demonstrates the effectiveness of our pre-training on the Web dataset, as well as the hard negatives introduced from Lu et al. ( 2021) for NQ.Also, fine-tuning on NQ leads to inferior performance compared to fine-tuning on MS Marco, which is consistent with prior work (Thakur et al., 2021).However, the disadvantage of fine-tuning on NQ is being relieved as the model scales up.This shows that the benefit of scaling up holds for different fine-tuning datasets.Furthermore, when scaling up from Large to XL, we observe a more significant gain when fine-tuning with NQ than with MS Marco, indicating that scaling up helps more when using weaker fine-tuning data.2021) on NDCG@10."CL" denotes their approach with contrastive learning on C4 and Wiki while others denote GTR with different sizes.Note that they only report results on 15 datasets of the BEIR benchmark.

Different pre-training strategies
Concurrently, Izacard et al. (2021) propose to conduct contrastive learning (CL) pre-training with data from C4 and Wiki datasets in an unsupervised way.In particular, their pre-training data is constructed by randomly choosing two spans from a single document and conduct word deletion or replacement to each span.In contrast, GTR uses Web-mined QA data as the pre-training data.
We compare the performance of GTR to their models to gain further insights into using different pre-training data and methods for dual encoders.As shown in Figure 5, on over half of the datasets, models with our pre-training approach under-perform CL-Pretrain with the base size; while as the model size increases, GTR-Large and -XXL models show significant gains over CL-Pretrain.This demonstrates that scaling up can mitigate the disadvantage of using a potentially inferior pre-training method.Note that our pre-training is additive to CL-Pretrain and we can leverage the pre-training on C4 and Wiki to further improve the results.We leave this exploration to future work.

Document length vs. model capacity
Previously, Thakur et al. (2021) showed that models trained with cosine similarity prefer short documents while those trained with dot-product prefer long documents.To investigate whether scaling up affects this observation, we compute the median lengths of the top-10 retrieved documents for all queries and present the results in Figure 6.
Though all GTR models are trained using cosine similarity, we found that scaling up the model  size has influence over the lengths of retrieved documents.We observe an increasing trend of document length for DB-Pedia, Fever, HotpotQA, Signal-1M, Trec-News, and Web-Touche2020 with scaling up.In particular, for Web-Touche2020, the lengths of the retrieved documents grow drastically as the model scales up: The largest GTR-XXL retrieves documents that are on average twice as long compared with the smallest GTR-Base.This contributes to the performance since Thakur et al. (2021) show that the majority of relevant documents in Web-Touche2020 are longer.
On the other hand, the only exception we observe is the Trec-Covid dataset, where GTR-XXL model retrieves much shorter documents than those by its smaller size counterparts.This may explain the inferior performance of GTR-XXL on Trec-Covid shown in Table 3 and Table 8.We leave it as future work to explore the effects of using dot-product as the similarity function for large dual encoders.

Scaling up with different bottleneck size
This section presents a complementary study to reveal the interaction of scaling up model capacity and increasing the bottleneck embedding size.
Specifically, we set the bottleneck embedding size to [256,768,2048,4096].Given this significant increase in the hyperparameter space, we make two minor compromises to be frugal on computational resources: 1) we directly fine-tune on MS Marco without pre-training on the Community QA dataset; 2) we evaluate on six randomly selected OOD datasets, which are BioASQ, DBPedia-entity, NQ, HotpotQA, Fever, and SciFact.
We present results of NDCG@10 in Figure 7.We observe that, under all choices of bottleneck embedding sizes, scaling up model capacity consistently improves model performance.On the other hand, when we increase the bottleneck dimension size, the performance gain is minimum from the dimensionality of 768 to 4096, as similarly observed in (Neelakantan et al., 2022).These observations indicate that, to enhance model generalizability under the single-vector doc-product scheme, scaling up model size can be more effective than increasing the bottleneck dimensionality.We leave the exploration of using better training objectives to improve the scaling law of bottleneck dimensionality as our future work.

Related Work
Neural information retrieval.Document retrieval is an important task in the NLP and information retrieval (IR) communities.Traditionally, lexical based approaches trying to match the queries and documents based on term overlap, such as TF-IDF and BM25 (Robertson and Zaragoza, 2009), have achieved great success in this task.Recently, neural based approaches, which go beyond the simple term matching, are being quickly adopted by the community and achieving state-of-the-art performance on multiple retrieval tasks, such as passage retrieval (Karpukhin et al., 2020), question answering (Ahmad et al., 2019), conversational question answering (Qu et al., 2020) and bitext retrieval (Feng et al., 2020).
Dual encoders for neural retrieval.In the line of neural retrievers, dual encoders have demonstrated great performance compared to traditional sparse models, such as BM25, on a wide range of retrieval tasks (Karpukhin et al., 2020;Gillick et al., 2018).A key to the success of dual encoders is pre-trained language models, from which the dual encoders are initialized.Other techniques, such as negative mining (Xiong et al., 2020;Lu et al., 2021;Sachan et al., 2021) and having large training batch sizes (Qu et al., 2021), are also highly effective.However, little work has been done on discussing the effect of the capacity of the backbone models.
Zero-shot neural retrieval.Recent works have shown great improvement under the zero-shot setting for dual encoders by leveraging distillation and synthetic data generation (Thakur et al., 2021;Hofstätter et al., 2021;Ma et al., 2020).Both these techniques, and scaling up backbone models, are effective ways to close the gap between dual encoders and the upper bound of the singleproduct approaches with fixed-dimension embeddings.On the other hand, multi-vector approaches introduce more interactions between dense embeddings, which could also benefit from scaling up the backbone encoders.We hope that our exploration on scaling up model sizes for single dot-product based methods can lay the groundwork for multivector approaches and further push the frontier of neural information retrieval.

Inference latency
A caveat of scaling up is the increase in latency overhead.Therefore, we investigate the inference speed, in microseconds (ms), for all GTR models with a batch size of 1 and an input length of 128.We found the latency increases from 17 ms, 34 ms, 96 ms to 349 ms.To put it in context, the GTR-Base model has a close latency compared to TAS-B while the largest GTR-XXL model has a similar latency with re-ranking models (Thakur et al., 2021).With the recent work towards making large models efficient with sparsification, distillation and prompttuning, we hope the inference time for large dual encoders can be significantly reduced in the future.

Conclusion
This paper presents the Generalizable T5 Retriever (GTR), a scaled-up dual encoder model with a fixed-size dot-product bottleneck layer.We show that scaling up the model size brings significant improvement on retrieval performance across the board on the BEIR zero-shot retrieval benchmark, especially for out-of-domain generalization.The GTR-XXL model performs at the level of state-ofthe-art performance on BEIR, outperforming many models that use earlier interactions between queries and documents.This sheds light on the research direction to continue enhancing the single vector representation model through better backbone encoders.Moreover, our in-depth analysis reveals the impact of scaling up under the scenarios of different training stages, pre-training strategies, fine-tuning datasets, and bottleneck sizes, as well as how scaling up influences the retrieved document lengths.Our findings can inform future work and is an integral part of the joint effort to improve dual encoder models.

Limitations
In our work, we focus on standard dual encoder training and have not investigated other techniques such as distillation.There have been shown distillation is a strong recipe to improve the dense retrieval models on out-of-domain performance (Santhanam et al., 2021;Formal et al., 2021).We hope to investigate whether the scaling effect could also benefit distillation if we scale up the student dual encoders.In addition, we only focus on English-only corpus and we leave the exploration of scaling up dense retrievers for multi-lingual corpus to future work.

Figure 2 :
Figure2: Architecture of Generalizable T5-based dense Retrievers.The research question we ask is: can scaling up dual encoder model size improve the retrieval performance while keeping the bottleneck layers as a single dot-product with a fixed size?Only the encoder is taken from the pre-trained T5 models, and the two towers of the dual encoder share parameters.

Figure 6 :
Figure 6: Median lengths (in words) of top-10 retrieved documents for all queries.

Figure 7 :
Figure 7: Average NDCG@10 for selected OOD datasets with different bottleneck and model sizes.

Table 1 :
architecture, including Base, Large, XL and XXL.Their Number of parameters in the GTR models.

Table 2 :
Dimension of different models.Most dual encoder models set the embedding dimension to 768.
Table 7 of Appendix A. This shows that increasing model capacity leads to better in-domain performance.

Table 3 :
NDCG@10 BEIR.Best results are marked in bold.GTR models are pre-trained on CommunityQA dataset and the complete MS Marco dataset.GTR models achieve better NDCG when increasing size from Base to XXL, outperforming the previous best sparse model DocT5Query and dense retrieval model TAS-B.

Table 6 :
Comparisons of fine-tuning on MS Marco and NQ with average zero-shot NDCG@10.