SimANS: Simple Ambiguous Negatives Sampling for Dense Text Retrieval

Sampling proper negatives from a large document pool is vital to effectively train a dense retrieval model. However, existing negative sampling strategies suffer from the uninformative or false negative problem. In this work, we empirically show that according to the measured relevance scores, the negatives ranked around the positives are generally more informative and less likely to be false negatives. Intuitively, these negatives are not too hard (\emph{may be false negatives}) or too easy (\emph{uninformative}). They are the ambiguous negatives and need more attention during training. Thus, we propose a simple ambiguous negatives sampling method, SimANS, which incorporates a new sampling probability distribution to sample more ambiguous negatives. Extensive experiments on four public and one industry datasets show the effectiveness of our approach. We made the code and models publicly available in \url{https://github.com/microsoft/SimXNS}.


Introduction
Dense text retrieval, which uses low-dimensional vectors to represent queries and documents and measure their relevance, has become a popular topic (Karpukhin et al., 2020;Luan et al., 2021) for both researchers and practitioners.It can improve various downstream applications, e.g., web search (Brickley et al., 2019;Qiu et al., 2022) and question answer (Izacard and Grave, 2021).A key challenge for training a dense text retrieval model is how to select appropriate negatives from a large document pool (i.e., negative sampling), as most existing methods use a contrastive loss (Karpukhin et al., 2020;Xiong et al., 2021) to encourage the model to rank positive documents higher than negatives.However, the commonly-used negative sampling strategies, namely random negative sampling (Luan et al., 2021;Karpukhin et al., 2020) (using random documents in the same batch) and top-k hard negatives sampling (Xiong et al., 2021;Zhan et al., 2021) (using an auxiliary retriever to obtain the top-k documents), have their limitations.Random negative sampling tends to select uninformative negatives that are rather easy to be distinguished from positives and fail to provide useful information (Xiong et al., 2021), while top-k hard negatives sampling may include false negatives (Qu et al., 2021), degrading the model performance.
Motivated by these problems, we propose to sample the ambiguous negatives1 that are neither too easy (uninformative) nor too hard (potential false negatives).Our approach is inspired by an empirical observation from experiments (in §3) using gradients to assess the impact of data instances on deep models (Koh and Liang, 2017;Pruthi et al., 2020): according to the measured relevance scores using the dense retrieval model, negatives that rank lower are mostly uninformative, as their gradient means are close to zero; negatives that rank higher are likely to be false negatives, as their gradient variances are significantly higher than expected.Both types of negatives are detrimental to the convergence of deep matching models (Xiong et al., 2021;Qu et al., 2021).Interestingly, we find that the negatives ranked around positive examples tend to have relatively larger gradient means and smaller variances, indicating that they are informative and have a lower risk of being false negatives, thus probably being high-quality ambiguous negatives.
Based on these insights, we propose a Simple Ambiguous Negative Sampling method, namely SimANS, for improving deep text retrieval.Our main idea is to design a sampling probability distribution that can assign higher probabilities to the ambiguous negatives while lower probabilities to the possible false and uninformative negatives, based on the differences of the relevance scores between positives and candidate negatives.We also incorporate two hyper-parameters to better adjust the peak and density of the sampling probability distribution.Our approach is simple and flexible, which can be easily applied to various dense retrieval models and combined with other effective techniques, e.g., knowledge distillation (Qu et al., 2021) and adversarial training (Zhang et al., 2021).
To validate the effectiveness of SimANS, we conduct extensive experiments on four public datasets and one industrial dataset collected from Bing search logs.Experimental results show that SimANS can improve the performance of competitive baselines, including state-of-the-art methods.

Preliminary
Dense Text Retrieval.Given a query q, the dense text retrieval task aims to retrieve the most relevant top-k documents {d i } k i=1 from a large candidate pool D. To achieve it, the dual-encoder architecture is widely used due to its efficiency (Reimers and Gurevych, 2019;Karpukhin et al., 2020).It consists of a query encoder E q and a document encoder E d to map the query q and document d into k-dimensional dense vectors h q and h d , respectively.Then, the semantic relevance score of q and d can be computed using dot product as (1) Recent works mostly adopt pre-trained language models (PLMs) (Devlin et al., 2019) as the two encoders, and utilize the representations of the [CLS] token as dense vectors.
Training with Negative Sampling.The training objective of dense text retrieval task is to pull the representations of the query q and relevant documents D + together (as positives), while pushing apart irrelevant ones D − = D \ D + (as negatives).However, the irrelevant documents are from a large document pool, which would lead to millions of negatives.To reduce the unreachable training cost, negative sampling has been widely used.Previous works either randomly sample negatives (Karpukhin et al., 2020), or select the top-k hard negatives ranked by BM25 or the dense retrieval model itself (Xiong et al., 2021;Qu et al., 2021), denoted as D − .Then, the optimization ob-jective can be formulated as: where L(•) is the loss function.

Motivation Study
We first analyze the uninformative and false negative problems from the perspective of gradients.Then, we perform an empirical study to test how gradients of negatives change w.r.t.ranks according to measured relevance scores using a dense retrieval model, and find that the gradients of negatives ranked near positives have relatively larger means and smaller variances.

Analysis for Gradients of Negatives
Existing dense retrieval methods (Karpukhin et al., 2020;Xiong et al., 2021) commonly incorporate the binary cross entropy (BCE) loss to compute gradients2 , where the relevance scores of a positive and sampled negatives are usually normalized by the softmax function.In this way, the gradients of model parameters θ are computed by where s n (q, d) is the normalized value of s(q, d) and is within [0, 1].Based on it, we review the gradients of uninformative and false negatives.Uninformative negatives can be easily distinguished by dense retrieval models, and are more likely to be selected by random sampling (Xiong et al., 2021).
As their normalized relevance scores are usually rather small, i.e., s n (q, d) −→ 0, their gradient means will be bounded into near-zero values, i.e., ▽ θ l(q, d) −→ 0. Such near-zero gradients are also uninformative and contribute little to model convergence.False negatives are usually semantically similar to positives, and are more likely to be selected by top-k hard negatives sampling (Qu et al., 2021).Therefore, for the gradients of false negatives and positives, the right terms ▽ θ s n (q, d) may be similar, while the left terms are greater than zero and less than 0, respectively.As a result, the variance of gradients will be larger, which may cause the optimization of parameters to be unstable.Furthermore, existing works (Katharopoulos and Fleuret, 2018;Johnson and Guestrin, 2018)  Experimental Setup.We use AR2 (Zhang et al., 2021) as the retrieval model and investigate its gradients on the development set of MS-MARCO Passage Ranking dataset (Nguyen et al., 2016).Concretely, for each query, we rank all negatives according to their relevance scores, and compute the means and variances of gradients of all negatives in the same rank3 .To better show the tendency w.r.t.ranks of relevance scores, we normalize the means and variances of gradients by dividing the maximum values, and only report the results of top 200 ranked negatives.
Results and Findings.As shown in Figure 1, the mean and variance of gradients will gradually decrease with the increase of the rank.Despite that, the gradient means of the top 200 negatives are still in the same order of magnitude (1.0 −→ 0.25), while the gradient variances of the top 10 ranked negatives are significantly larger than others.The reason is that the higher-ranking negatives have larger probabilities to be false negatives.Besides, a surprising finding is that the mean rank of posi-tives is approximate the boundary point of the high gradient variance part and the negatives near it can produce relatively larger gradient means and lower gradient variances.It means that they are highquality ambiguous negatives that can balance the informativeness and the risk of being false negatives.Therefore, it is promising to rely on the relevance scores of positives and candidate negatives to devise more effective negative sampling methods for training dense retrieval models.

Approach
Based on the findings in §3, we conjecture that the ambiguous negatives ranked near positives according to relevance scores are high-quality negatives, as they are neither too easy (uninformative) nor too hard (may be false negatives).Therefore, we propose a simple ambiguous negative sampling method, namely SimANS.

Ambiguous Negative Sampling
To focus on sampling ambiguous negatives, we design a new sampling probability distribution that can estimate the influence of each negative using the dense retrieval models.As follows, we first devise a general sampling distribution and then propose its simple and efficient implementation.
General Sampling Distribution.We draw the following conclusions from our results about how to choose a good sampling probability distribution for negatives: (1) Negatives that are clearly irrelevant and have low relevance scores should be sampled less frequently; (2) Negatives that are highly relevant and have high relevance scores should also be sampled less frequently, because they are more likely to be positives in disguise; (3) Negatives that are uncertain and have relevance scores similar to positives should be sampled more frequently, because they provide useful information and have a lower chance of being false negatives.We propose a general formula for negative sampling probability that reflects these principles: where f (•) is a function to determine the tendency of the probability distribution, b is a hyperparameter to control the peak of the distribution, s(q, d + ) is the mean relevance score of all positives with the query.f (•) should be a monotone decreasing function (e.g., e −x ).In this way, the negatives with the relevance scores close to positives can be assigned with larger probabilities, while others with smaller or larger scores will be punished with smaller probabilities.Such a distribution can satisfy the required three characteristics.
Simple Negative Sampling Distribution.We rely on several empirical priors to determine a simple and efficient implementation of the above sampling probability distribution.Generally, the relevance scores of positives and negatives are bounded by the modulus of dense vectors, hence they are mostly in a same order of magnitude.To ensure that the probabilities of ambiguous negatives should be significantly larger than other ones, we choose the exponential function to implement f (•).As a large proportion of negatives from D \ D + are uninformative ones, their smaller relevance scores would lead to near-zero probabilities using the exponential function.Therefore, we can reduce the computation cost by narrowing the negative candidates into the top-k ranked negatives D − .In addition, to further reduce the cost, we also replace the mean relevance score of all positives s(q, d + ) by the score of a randomly sampled positive s(q, d+ ).Finally, we can reformulate the sampling probability distribution in equation (3) as: where a is a hyper-parameter to control the density of the distribution, d+ ∈ D + is a randomly sampled positive, D − is the top-k ranked negatives.In this way, the complexity of computing the sampling probability distribution will be reduced into O(k), where k ≪ |D| and we set it to 100.

Overview and Discussion
Overview.Given a mini-batch, SimANS contains three major steps to obtain the ambiguous negatives.The first step is the same as previous top-k hard negatives sampling methods (Xiong et al., 2021;Qu et al., 2021) that select the top-k ranked negatives D − from the candidate pool D \ D + using an ANN search tool (e.g., FAISS (Johnson et al., 2019)).Second, we compute the sampling probabilities for all the top-k negatives using equation (4).
To reduce the time cost, we can pre-compute them in the first step.Finally, we sample the ambiguous negatives w.r.t.their sampling probabilities.We present the overall algorithm in Algorithm 1.
Input: Queries and their positive documents {(q, D + )}, document pool D, pre-learned dense retrieval model M 1 Build the ANN index on D using M . 2 Retrieve the top-k ranked negatives D − for each query with their relevance scores {s(q, di)} from D. 3 Compute the relevance scores of each query and its positive documents {s(q, D + )}. 4 Generate the sampling probabilities of retrieved top-k negatives {pi} for each query using Eq. 3.
6 while M has not converged do 7 Sample a batch from {(q, D + , D − )}. 8 Sample ambiguous negatives for each instance from the batch according to {pi}.9 Optimize parameters of M using the batch and sampled negatives.10 end Note that our proposed SimANS is a negative sampling method and applicable to a variety of dense retrieval methods.
Relationship with Other Methods.SimANS aims to sample the ambiguous negatives that rank close to the positives according to relevance scores for improving the training of dense retrieval models.It is a general framework that several previous negative sampling methods can be included: • Choosing negative examples randomly means picking them from a big collection of documents with equal chances for each one.We can also use our method to do this by setting b = s(q, d i ) − s(q, d+ ) and making D − include all the documents in the collection.But this is not a good idea, because most of the documents in the collection are not relevant to the query and do not help us learn from the feedback.They are easy to sample but not useful for training.
• Top-k hard negatives sampling utilizes an auxiliary retriever (e.g., BM25 (Karpukhin et al., 2020) or DPR (Xiong et al., 2021)) to rank all negative candidates and pick the top-k ones as negatives.By setting b = −s(q, d+ ) and a = − inf, our method can also produce extremely large probabilities to the top-k negatives.Whereas, the top-k ones have a higher risk to be false negatives, which are harmful to convergence.

Experimental Setting
We extensively evaluate SimANS by conducting experiments on three public passage retrieval datasets:

Further Analysis
Applying SimANS to Other Models.Since SimANS is a general negative sampling strategy, it can be applied to a variety of dense retrieval methods.Thus, in this part, we implement SimANS on two representative methods, ANCE (Xiong et al., 2021) and RocketQA (Qu et al., 2021), as they adopt effective techniques as asynchronous index refresh and knowledge distillation, respectively.We only replace the negative sampling strategies in these methods with SimANS and conduct experiments on TQ and NQ datasets.As shown in Table 5, our approach can consistently improve the performance of the two methods.It shows that SimANS is general to various dense retrieval methods with different techniques and can provide more highquality negatives to improve their performance.
Variation Study.Our proposed SimANS incorporates a new negative sampling probability distribution that is based on the differences between the query-document relevance scores of positives and negative candidates.To verify the effectiveness of this distribution, we design two variations of SimANS: (1) Doc-Sim that leverages the document-document relevance scores between positives and negative candidates to replace the querydocument relevance scores; (2) Nearest-K that directly picks the top-k nearest negatives according to the differences of query-document relevance scores instead of sampling.We implement these variations on AR2 and conduct experiments on the development set of MS Pas dataset.As shown in Table 6, SimANS outperforms all these variations.It indicates the effectiveness of our devised ambigu-Method NQ TQ MS Pas R@5 R@20 R@100 R@5 R@20 R@100 MRR@10 R@50 R@1k BM25 (Yang et  Method MRR@10 R@100 BM25 0.279 0.807 DPR (Karpukhin et al., 2020) 0.320 0.864 ANCE (Xiong et al., 2021) 0.377 0.894 STAR (Zhan et al., 2021) 0.390 0.913 ADORE (Zhan et al., 2021) 0.405 0.919 AR2 (Zhang et al., 2021) 0.418 0.914 AR2+SimANS 0.431 0.923 ous negative sampling probability distribution.For Doc-Sim, it is likely to select the false negatives that have similar semantics to positives, hurting the model performance.For Nearest-K, as it always selects fixed negatives, it may cause overfitting.
Parameter Tuning.Our SimANS has two important hyper-parameters to tune, a and b, which control the density and peak of the sampling probability distribution, respectively.Here, we investigate the performance change of SimANS on AR2 w.r.t.different a and b on NQ dataset.As shown in Figure 2, our approach achieves the best performance when a = 0.5 and b = 0.It indicates that    when the maximum point of the distribution has the same relevance score as the positive, the nega-   tive sampling probability distribution can produce more high-quality negatives.Moreover, we notice that the model performance is not very sensitive to the two hyper-parameters if they are properly set within a certain range.
Impact of the Sampled Negative Ratio.We investigate the impact of the sampled negative ratio 1 : k on retrieval performance and training latency per batch of SimANS on AR2.As shown in Table 7, with the increase of the sampled negative number, the performance improves consistently while the training latency increases.Besides, SimANS just slightly increases the training latency of AR2.It is because we can pre-compute the sampling probabilities before training, which avoids time-consuming computation during training.
Performance w.r.t.Training Steps.Our approach requires continually training the model parameters that have been pre-trained by the original dense retrieval method.Here, we investigate the performance changes of the dense retrieval method before and after using SimANS w.r.t. the training steps.We conduct experiments on AR2 and show the Hit@1 metric on NQ dataset in Figure 3. First, we can see that with the increase of the training steps, the performance of AR2 on training and test sets improves simultaneously.After applying our SimANS, we can see that the performance further improves, especially in the training set (0.777 −→ 0.791).It indicates that our ap-proach is capable of improving the fitting of the training set, and such an improvement can also generalize to the test set.

Conclusion
We investigated how the gradient statistics of negative documents affect their relevance ranking for dense text retrieval.We discovered that negative documents with high gradient means and low gradient variances are more likely to be ambiguous negatives, which are informative and less prone to false negatives.Based on this insight, we proposed SimANS, a novel negative sampling method that balances the difficulty of negative examples by adjusting their sampling probabilities.SimANS improved the performance of various dense retrieval models on four public and one industrial datasets.We plan to apply our method to other information retrieval tasks, such as personal recommendation, and to develop better pre-training schemes for dense text retrieval in the future.

A Illustration of Ambiguous Negatives
We illustrate the distribution of the dense embeddings of a query with its positive document, too easy, too hard and ambiguous negatives in Figure 4. Too hard negatives have a higher risk of being false negatives, and we can see that their dense embeddings locate closely to the ones of the query and the positive.If we learn to push them away, the distances between the embeddings of the query and the positive may also be enlarged, which is harmful to the goal of pulling the query and its positives together.Besides, too easy negatives locate rather far from the query, hence it is unnecessary to learn to push them even further.As a comparison, the ambiguous negatives have similar distances as the positive, which compose the circular boundary for the document pool consisting of hard negatives required to learn (i.e., push away).In this way, our SimANS can be seen as always sampling the borderline hard negatives from the document pool.By learning to push them away, we can narrow the circular boundary of hard negatives, which helps gradually achieve the goal that pulls the query and positives together while pushing apart negatives.

B More Details on Datasets
We conduct experiments on five datasets, consisting of three passage retrieval datasets: Natural Question (NQ) (Kwiatkowski et al., 2019), Trivia QA (TQ) (Joshi et al., 2017) and MS-MARCO Passage Ranking (MS Pas) (Nguyen et al., 2016), a document retrieval dataset: MS-MARCO Document Ranking (MS Doc) (Nguyen et al., 2016) and a real-world industry dataset Bing.NQ and TQ are open domain question answering datasets collected from Google search logs and authored by trivia enthusiasts, respectively.In the two datasets, each question is paired with an answer span and sev-eral golden passages from Wikipedia articles.Following existing works (Zhang et al., 2021;Sachan et al., 2021), we adopt Recall@k (R@k) as the evaluation metrics, which measures if the top-k ranked documents include the answer span.MS Pas and MS Doc consist of real questions collected from Bing search logs, where each question is paired with several web passages and documents, respectively.As their labels of test sets are not available, we follow existing works (Ren et al., 2021b;Zhan et al., 2021) that report results on their development sets and adopt MRR@10, R@50 and R@1k for MS Pas, MRR@10 and R@100 for MS Doc.Bing is collected from Bing search logs, where each example consists of a user historical query and several documents that the user has clicked.These documents are real-world webpages and may contain hyperlinks and different languages.We select Hit@5, Hit@20 and Hit@100 for evaluation.

C More Details on Baselines
We compare our approach with a variety of methods, including sparse and dense retrieval models.
• BM25 (Yang et al., 2017) is a widely-used sparse retriever based on exact matching.
• Joint and Individual top-k (Sachan et al., 2021) propose to train the dense retrieval model in an end-to-end manner.
• AR2 (Zhang et al., 2021) incorporates an adversarial framework to jointly train the retriever and the ranker.As it has achieved state-of-the-art performance on most datasets, we implement our approach on it to verify its effectiveness.

D Experimental Details
Implementation Details on Public Datasets.For three passage retrieval tasks, we follow the experimental settings in AR2 (Zhang et al., 2021) that selects ERNIE-2.0-base(Sun et al., 2020) as the backbone model.For MS Doc dataset, we leverage the model parameters of STAR (Zhan et al., 2021) to initialize AR2, and then train AR2 with the same hyper-parameters as STAR until convergence.Next, we continue to train the AR2 model parameters with our proposed SimANS, where we set a and b to {(0.5, 1.0), (0.5, 0) , (0.5, 0) , (0.5, 0)} for NQ, TQ, MS Pas and MS Doc datasets, respectively.The learning rate is set to 1e-5 for NQ and 5e-6 for other datasets.The batch size is 256 for MS-Pas and MS-Doc, 64 for NQ and TQ, and the sampling ratio of positives and negatives is 1:15.All other hyper-parameter settings are the same as AR2.All the experiments in this work are conducted on 8 NVIDIA Tesla A100 GPUs.
Implementation Details on Bing Industry Dataset.For the industry dataset, Bing, we adopt mBERT-base (Devlin et al., 2019) as the backbone of the query and document encoders, to deal with multilingual queries and documents.The parameters of the baseline model are trained with randomly sampled negatives using the in-foNCE loss (Karpukhin et al., 2020), namely Base-line+Random Neg, and the sampling ratio of positives and negatives is 1:5.The learning rate is 1e-5, the batch size is 128 and the training step is 100,000.As a comparison, we implement the top-k negatives sampling strategy on the baseline model, namely Baseline+top-k Neg, where we utilize the baseline model to rank and select the top 5 documents that do not contain the query as hard negatives.In our approach, namely Baseline+SimANS, we continue to train the Baseline+top-k Neg model, but apply our SimANS to sample 5 negatives from the top 100 ranked documents.We set a to 1, b to 0, and reuse the other hyper-parameters of the Baseline+top-k Neg model.

E Case Study
In this part, we show four examples of the generated sampling probability distributions by our SimANS.These four examples are randomly selected from the training set of MS Pas dataset.As shown in Figure 5, we can see that SimANS indeed assigns larger probabilities to the negatives that rank near the positive while punishing the higherranking and lower-ranking ones that may be false negatives and uninformative negatives.Furthermore, in Figure 5b where the positive is ranked at the first place, our approach is similar to the topk negatives sampling method that assigns larger probabilities to the higher-ranking hard negatives.

F Related Work
Recent years have witnessed the remarkable performance of dense retrieval methods in text retrieval tasks (Zhan et al., 2020;Hong et al., 2022;Ram et al., 2022;Zhou et al., 2022b).Different from traditional sparse retrieval methods (e.g., TF-IDF and BM25), dense retrieval approaches typically map queries and documents into low-dimensional dense vectors, and then utilize vector distance metrics (e.g., cosine similarity) for retrieval.
To learn an effective dense retrieval model, it is key to sample high-quality negatives paired with the given query and positives for training.Early works (Karpukhin et al., 2020;Min et al., 2020) mostly rely on in-batch random negatives and hard negatives sampled by BM25.After that, a series of works (Qu et al., 2021;Xiong et al., 2021) find that sampling top-k ranked examples by the dense retriever as hard negatives is more helpful to improve the retriever itself.Among them, several methods (Xiong et al., 2021;Zhan et al., 2021) adopt a dynamic sampling strategy that actively samples top-k hard negatives once after an interval during training.However, these top-k negative sampling strategies are easy to select higher-ranking false negatives for training.To alleviate it, previous works have incorporated knowledge distillation (Qu et al., 2021;Ren et al., 2021b;Lu et al., 2022), pre-training (Zhou et al., 2022a;Xu et al., 2022) and other denoising techniques (Mao et al., 2022;Hofstätter et al., 2021).Despite the effectiveness, these methods mostly rely on complicated training strategies or complementary models.
In this work, we propose a simple but effective sampling method that weights the negative candidates with the consideration of their differences of relevance scores with positives.As a result, the ambiguous negatives with similar relevance scores to the positives will receive larger sampling probabili- ties, while the too hard (potential false negatives) and too easy negatives (uninformative) will be punished with smaller probabilities.

Figure 2 :
Figure 2: Performance comparison w.r.t.hyperparameters a and b on NQ dataset.

Figure 3 :
Figure 3: Hit@1 of AR2+SimANS on training and test sets of NQ w.r.t.training steps.

Figure 4 :
Figure 4: An example of the dense embedding distribution of a query with its positive document, too easy, too hard and ambiguous negatives.

Figure 5 :
Figure 5: Illustration of four sampling probability distributions of the top 50 ranked negatives generated by our SimANS on the training set of MS Pas.

Table 1 :
Statistics of the five text retrieval datasets.We simply evaluate the last checkpoint after training and report the results on the development set.As shown in Table4, after applying the top-k hard negatives sampling, the performance of the baseline model is improved by a large margin.
(Devlin et al., 2019)ndustry Dataset.For the Bing industry dataset, we adopt a dual-encoder mBERT(Devlin et al., 2019)as the baseline model to deal with multilingual queries and documents, and implement different negative sampling strate-gies on it.

Table 2 :
Performance on the test sets of NQ and TQ, and the development set of MS Pas.The results of baselines are from original papers.The best and second-best methods are marked in bold and underlined, respectively.

Table 3 :
Performance on MS Doc development set.

Table 4 :
Experimental results on Bing Industry dataset.

Table 5 :
The retrieval performance of applying our method on other baselines on TQ and NQ datasets

Table 6 :
The variation study of our method in AR2 on MS Pas development set.

Table 7 :
The retrieval performance and training latency w.r.t.different sampled negative ratios on NQ dataset.