SamToNe: Improving Contrastive Loss for Dual Encoder Retrieval Models with Same Tower Negatives

Dual encoders have been used for retrieval tasks and representation learning with good results. A standard way to train dual encoders is using a contrastive loss with in-batch negatives. In this work, we propose an improved contrastive learning objective by adding queries or documents from the same encoder towers to the negatives, for which we name it as"contrastive loss with SAMe TOwer NEgatives"(SamToNe). By evaluating on question answering retrieval benchmarks from MS MARCO and MultiReQA, and heterogenous zero-shot information retrieval benchmarks (BEIR), we demonstrate that SamToNe can effectively improve the retrieval quality for both symmetric and asymmetric dual encoders. By directly probing the embedding spaces of the two encoding towers via the t-SNE algorithm (van der Maaten and Hinton, 2008), we observe that SamToNe ensures the alignment between the embedding spaces from the two encoder towers. Based on the analysis of the embedding distance distributions of the top-$1$ retrieved results, we further explain the efficacy of the method from the perspective of regularisation.


Introduction
The dual encoder architecture applied to information retrieval has shown excellent performance in a wide range of tasks (Gillick et al., 2018;Karpukhin et al., 2020;Ni et al., 2021Ni et al., , 2022)).
Recently, the Information Retrieval community has transitioned towards Deep Learning models that leverage large unsupervised corpus pretraining (Devlin et al., 2019;Raffel et al., 2020), which offers more powerful semantic and contextual representation for queries and documents.These models can be successfully applied to scoring tasks, e.g.Dehghani et al. (2017), or retrieval tasks, e.g.Gillick et al. (2018).In contrast, classic SearchQA show that sharing a projection layer in Asymmetric Dual Encoders (ADE-SPL) (Dong et al., 2022) may not guarantee that the embeddings from the two encoder towers are in coinciding parameter spaces.However SamToNe can effectively achieve that.
retrieval models, such as BM25 (Robertson and Zaragoza, 2009), rely on bag-of-words lexical overlap, term frequency heuristics, inverse document frequency and document length.This type of retrieval models does not require any training and can generalize reasonably well, but they fall short of finding documents that have low term overlap but high semantic similarity.
A dual encoder (Gillick et al., 2018;Yang et al., 2020;Karpukhin et al., 2020;Reimers and Gurevych, 2019) consists of two encoding towers that map queries and documents, respectively, into a shared low-dimensional dense representation, namely, the embedding space.The model is usually optimized by a contrastive loss (Chopra et al., 2005), which moves the embeddings of the queries and documents from the same positive examples closer to each other, and the embeddings from negative examples farther away.Training the dual encoder in batches allows to use, for each question, the passages that answer all the other questions within the batch as negatives (Gillick et al., 2018), namely "in-batch negatives".At indexing time, all the documents in a corpus are encoded via bulk inference and indexed.To run retrieval, a query is encoded and its most relevant documents can be retrieved through Nearest Neighbours Search ( Vanderkam et al., 2013;Johnson et al., 2021) over the embedding space using a measure of similarity, e.g. the dot-product or cosine distance of the embedding vectors.
Motivation.In this work, we consider two major types of dual encoder architectures: "Symmetric Dual Encoder" (SDE)1 , with parameters shared between two encoder towers, and "Asymmetric Dual Encoder" (ADE), with two distinctly parameterized encoder towers.Dong et al. (2022) demonstrated that sharing projection layers can significantly improve the performance of ADEs.They empirically explained the efficacy of SDE and ADE-SPL by claiming that the shared projection layers help mapping the embeddings of the two encoder towers into a coinciding parameter space.
By repeating this embedding space analysis on a variety tasks, we find that ADE-SPL may not be enough to ensure that the embedding spaces from two encoder towers are coinciding, as shown in Figure 1.This motivates us to further improve the dual encoder retrieval quality beyond the architectural change explored in Dong et al. (2022).Although the projection layers are shared, our analyses suggest that an extra mechanism, other than using the standard contrastive loss with in-batch negatives, is required to ensure the adjacency of the embeddings of a ground truth pair.

Contributions.
In this paper, we propose an improved training objective for dual encoder models: contrastive loss with Same Tower Negatives (SamToNe).In Section 3, we demonstrate its usefulness on a variety of Information Retrieval tasks, including both tasks with in-task fine-tuning and a zero-shot benchmark suite.Across all the tasks explored, SamToNe performs competitively comparing to the traditional training setup, with a significant improvement on the metrics averaged across tasks.Finally, through an analysis of the produced embeddings, in Section 4, we further make evident the superiority of SamToNe from the perspective of regularisation.

Method
Dual Encoder Architecture.We follow the standard setup of information retrieval: given a query, q, and a corpus of retrieval candidates, P, the goal is to retrieve k relevant candidates, p k ∈ P. The candidate can be a phrase, a sentence, a passage, or a document.
Recent research (Dong et al., 2022) demonstrated that sharing projection layers can significantly improve the performance of ADEs and we use this shared projection layer for ADEs (ADE-SPL) throughout our experiments.Figure 2 illustrates the SDE and ADE-SPL architectures we use in this work.Our dual encoders are initialized from pre-trained t5.1.1 encoders (Raffel et al., 2020).Following Ni et al. (2022); Dong et al. (2022), we encode a query, q i , or a candidate, p i , by averaging the T5 encoder outputs and projecting them to the final embedding vector.
Contrastive Loss.A standard way to train a dual encoder model is optimizing an in-batch sampled softmax loss for contrastive learning (Henderson et al., 2017): where sim is cosine similarity, B is a mini-batch of examples, and τ is the softmax temperature.p i is the ground-truth relevant passage for the query q i in a batch of retrieval candidates p * , where all the other passages p k (k ̸ = i) are treated as the negative examples for contrastive learning.Bi-directional in-batch sampled softmax loss is commonly applied to improve the embedding quality of both towers, where the contrastive loss is computed for both query to passage matching and passage to query matching (Yang et al., 2019).We use the bi-directional loss throughout this work.
Same Tower Negatives.The in-batch sampled softmax loss is a contrastive loss that only considers the contrastive estimation between the target example pair {q i , p i }, and the in-batch sampled negative pairs {q i , p j } (j ̸ = i).
One way to improve the quality of the retrieval is to improve the contrast among the embeddings of the queries.Therefore, we propose a novel contrastive loss using Same Tower Negatives, which we abbreviate as SamToNe: LS = e sim(q i ,p i )/τ j∈B e sim(q i ,p j )/τ + j∈B,j̸ =i e sim(q i ,q j )/τ , (2 where the second term in the denominator is the contribution from the same tower negatives.SamToNe can be interpreted as a regularized version of the in-batch sampled softmax loss, where the term j∈B,j̸ =i e sim(q i ,q j )/τ is a regularizer.When query embeddings are not well distributed, max sim(q i , q j ) ≫ max sim(q i , p j ), and the second term in the denominator will dominate the contribution from the negative examples.Thus, it will drive the separation of the query embeddings in contrastive learning.In Section 4, we provide empirical evidence of the effects of SamToNe as a regularizer of the embedding space.Ren et al. (2021) proposed an improved contrastive loss, PAIR, which is a hybrid loss e sim(q i ,p i )/τ j∈B,j̸ =i e sim(p i ,p j )/τ (3) penalizes the similarities between passages / documents.Despite both SamToNe and PAIR are penalizing the similarities among the same tower inputs, there are two significant differences.Firstly, SamToNe is hyper-parameter free, while PAIR introduces a new hyper-parameter α.This is because SamToNe introduces the new term from an embedding space regularization prospective (see Section 4 for detailed analysis).Therefore SamToNe can be easily applied to both query and document encoders (see Section 3.4), but PAIR needs to introduce yet another hyper-parameter to be applied to both.Secondly,   single stage training and guaranteed improvement on embedding space quality, make SamToNe much easier to use.

Question-Answering Retrieval Tasks
We evaluate SamToNe on 5 question-answering (QA) retrieval tasks including MS MARCO (Nguyen et al., 2016) and MultiReQA (Guo et al., 2021).For MS MARCO, the retrieval candidates are relevant passages, and for the 4 tasks in Mul-tiReQA, the retrieval candidates are answer sentences.
To make a fair comparison across the results of our experiments, the same fine-tuning hyperparameters are applied to all our model variants.The models are optimized for 20, 000 steps using Adafactor optimizer (Shazeer and Stern, 2018), with softmax temperature τ = 0.01, batch size 512, and a linearly decaying learning rate starting from 10 −3 to 0 at the final step.To compare SamToNe and PAIR, we use the hyperparameter α = 0.1 for PAIR as reported in Ren et al. (2021), and keep all the other experimental setups identical.SamToNe is applied only on the query side, as it is more robust across different datasets.For experiments and analysis on applying SamToNe on both encoder towers, please refer to Section 3.4.We benchmark Model Loss MSMARCO NQ SQuAD TriviaQA SearchQA Average P@1 MRR P@1 MRR P@1 MRR P@1 MRR P@1 MRR P@1 MRR the fine-tuned models using precision at 1 (P @1) and mean reciprocal rank (MRR).
As shown in Table 1, SamToNe greatly improves the retrieval performance of both SDE and ADE-SPL models.Using SamToNe, ADE-SPL models can outperform SDE ones, especially for TriviaQA and SearchQA, by a great margin.Relative to PAIR, SamToNe provides better performance across different datasets in both types of models.

BEIR Generalization Tasks
We further demonstrate the efficacy of the dual encoders trained with SamToNe on BEIR (Thakur et al., 2021), a heterogeneous benchmark for zeroshot evaluations.
BEIR has 18 information retrieval datasets2 across 9 domains, including Bio-Medical, Finance, News, Twitter, Wikipedia, StackExchange, Quora, Scientific, and Misc.The majority of the datasets have binary query relevance labels.The other datasets have 3-level or 5-level relevance judgements.
As BEIR is evaluating generalization capabilities and SDEs are commonly used for general purpose retrieval (Ni et al., 2021), we focus on evaluating the impact of SamToNe on BEIR using the SDE architecture.In this evaluation, we reuse the model fine-tuned with MS MARCO, as described in Section 3.1.
Evaluated with the same setting as GTR (Ni et al., 2021), SamToNe demonstrates strong performance on BEIR, as shown in Table 2 and Figure 4. On average, SamToNe improves NDCG@10 by 1.4% for SDE with XXL size.SDE trained with SamToNe significantly outperform BM-25, a sparse retrieval method, and GTR, a dense retrieval method that shares the same architecture and the same model size as SDE but fine-tuned with different corpora.

Applying SamToNe to Both Towers
Just as with the query tower, SamToNe can be applied to the document tower which leads to better query-document alignment.However, it is common that the training data contains a large fraction of duplicated documents for a diverse set of queries.For example, only 17% of the documents in the train-split are unique for TriviaQA, but 98% for MSMARCO.For datasets with a low rate of unique documents, applying SamToNe on the document side will penalize sim(p i , p j ) with p i = p j and may hinder the performance, as shown in Table 3.

Embedding Space Analysis
As shown in the top row of Figure 1, for MS MARCO and SearchQA, ADE-SPL generates two connected but topologically separable embedding spaces.It requires an extra mechanism, beyond the shared projection layers, to ensure the adjacency of the embeddings from a ground truth pair.SamToNe is proposed as the "force" drawing the embeddings of each ground truth training pair together.Its efficacy is illustrated in the bottom half of Figure 1.

SamToNe: an Embedding Distance Regularizer
To further understand SamToNe's role as a regularizer of embedding distances, we evaluate the distribution of the distances between the embeddings of the queries and their top-1 retrieval results in the test set of MS MARCO and SearchQA.The embedding distance is measured by cosine similarity, where 1.0 means perfect alignment with a range of [−1.0, 1.0].As shown in Figure 5, SamToNe drastically shifts the distribution of the (query, top-1 retrieval result) pairs towards 1.0, demonstrating the regularizing effect of SamToNe over the embedding distances.
By placing the regularizing query-query similarity terms e sim(q i ,q j )/τ and the standard inbatch negative query-document similarity terms e sim(q i ,p j )/τ together in the denominator with same weight, SamToNe pushes the similarity ratio between query-query and query-documents, sim(q i , q j )/sim(q i , p j ), to be centered around 1.0.This is a self-balancing regularization effect.The query and document spaces are set to closely overlap each other and the embeddings of a positive pair are more likely to be located in the same region of the embedding space.
To empirically illustrate this effect, we plotted histograms of the sim(q i ,q j ) sim(q i ,p j ) ratios for randomly selected i and j in Figure 6.The regularization effect only shows when SamToNe is used, but not when PAIR (Ren et al., 2021) is.This is because the self-balancing effect does not exist in a hybrid loss such as PAIR.

Conclusions
Evaluating on QA retrieval tasks and zero-shot generalization benchmarks, we demonstrate that training with SamToNe can significantly improve the dual encoder retrieval quality.With t-SNE maps of query and document embeddings, we show that the embedding spaces from the two encoding towers of models trained with SamToNe are better aligned.Through the distributions of similarity distances between the embeddings of queries and their nearest neighbours, we empirically explain the efficacy of SamToNe from a regularisation prospective.In general, we recommend using SamToNe to train dual encoders for information retrieval tasks.

Limitations
Same tower negatives can be applied to other contrastive losses, e.g.triplet loss (Chechik et al., 2010).As we are focusing on improving the most popular method to train dual encoder models, i.e. the in-batch sampled softmax loss, we leave the application of same tower negatives to other types of contrastive loss as future work.
While SamToNe has proven to be effective to improve the training of dual encoders, its efficacy may depend on the diversity of the queries used as inputs.In dataset with a large portion of similar queries in the training set, one might need to use masking or other techniques to remove them from the negative computation.Such techniques can also improve the efficacy of SamToNe when applied to both the query and document towers, where SamToNe is currently known to hinder the performance on datasets with a low rate of unique documents, as discussed in Section 3.4.
We leave the in-depth exploration of aforementioned considerations for future works.B3.Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?Not applicable.Left blank.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Not applicable.Left blank.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.?Not applicable.Left blank.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.
Appendix Table 3 C Did you run computational experiments?
Section 3 C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?Section 2 and Section 3 The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.

Figure 1 :
Figure 1: Embedding space analyses on MS MARCO and

Figure 2 :
Figure 2: The dual encoder architectures, where the blue components are shared between two encoding paths.

Figure 3 :
Figure 3: The impact of model sizes on the performance of different dual encoder architectures, measured by MRR on the eval set of MS MARCO (left) and SearchQA (right).

(
MRR)(%) when comparing ADE-SPL (t5.1.1-largesize) trained without SamToNe and with SamToNe applied to the query tower (uni-directional) or to both towers (bidirectional).The best-performing models for each task and metric are highlighted in bold.

Figure 5 :
Figure 5: Distributions of cosine similarities between the embeddings of the queries and their nearest neighbour documents, for different models trained with or without SamToNe.

Figure 6 :
Figure 6: Distributions of query-query to query-document similarity ratios for different losses on SearchQA.SamToNe is applied to both query and document sides, and it pushes the ratio to be centered around 1.

Table 1 :
Precision at 1 (P@1)(%) and Mean Reciprocal Rank (MRR)(%) on QA retrieval tasks.The best-performing models for each task and metric are highlighted in bold.

Table 2 :
NDCG @10 for zero-shot evaluation on the BEIR benchmark after fine-tuning on MSMarco.The bestperforming models for each task are highlighted in bold, while the best scores between SDE and SDE w/ SamToNe are underscored.

Table 3 :
Precision at 1 (P@1)(%) and Mean Reciprocal Rank B1. Did you cite the creators of artifacts you used?B2.Did you discuss the license or terms for use and / or distribution of any artifacts?Not applicable.Left blank.