Clustering-Aware Negative Sampling for Unsupervised Sentence Representation

Contrastive learning has been widely studied in sentence representation learning. However, earlier works mainly focus on the construction of positive examples, while in-batch samples are often simply treated as negative examples. This approach overlooks the importance of selecting appropriate negative examples, potentially leading to a scarcity of hard negatives and the inclusion of false negatives. To address these issues, we propose ClusterNS (Clustering-aware Negative Sampling), a novel method that incorporates cluster information into contrastive learning for unsupervised sentence representation learning. We apply a modified K-means clustering algorithm to supply hard negatives and recognize in-batch false negatives during training, aiming to solve the two issues in one unified framework. Experiments on semantic textual similarity (STS) tasks demonstrate that our proposed ClusterNS compares favorably with baselines in unsupervised sentence representation learning. Our code has been made publicly available.


Introduction
Learning sentence representation is one of the fundamental tasks in natural language processing and has been widely studied (Kiros et al., 2015;Hill et al., 2016;Cer et al., 2018;Reimers and Gurevych, 2019).Reimers and Gurevych (2019) show that sentence embeddings produced by BERT (Devlin et al., 2019) are even worse than GloVe embeddings (Pennington et al., 2014), attracting more research on sentence representation with pretrained language models (PLMs) (Devlin et al., 2019;Liu et al., 2019;Radford et al., 2019).Li et al. (2020a) and Ethayarajh (2019) further find out that PLM embeddings suffer from anisotropy, motivating more researchers to study this issue (Su et al., 2021;Gao et al., 2021).Besides, Gao et al. x Figure 1: An example of in-batch negatives, a hard negative (in blue dotted box) and a false negative (in red dotted box).Cosine similarity is calculated with SimCSE (Gao et al., 2021).In-batch negatives may include false negatives, while lacking hard negatives.(2021) show that contrastive learning (CL) is able to bring significant improvement to sentence representation.As pointed out by Wang and Isola (2020), contrastive learning improves the uniformity and alignment of embeddings, thus mitigating the anisotropy issue.
Most previous works of constrastive learning concentrate on the construction of positive examples (Kim et al., 2021;Giorgi et al., 2021;Wu et al., 2020;Yan et al., 2021;Gao et al., 2021;Wu et al., 2022) and simply treat all other in-batch samples as negatives, which is sub-optimal.We show an example in Figure 1.In this work, we view sentences having higher similarity with the anchor sample as hard negatives, which means they are difficult to distinguish from positive samples.When all the negatives are sampled uniformly, the impact of hard negatives is ignored.In addition, various negative samples share different similarity values with the anchor sample and some may be incorrectly labeled (i.e., false negatives) and pushed away in the semantic space.
Recently quite a few researchers have demonstrated that hard negatives are important for contrastive learning (Zhang et al., 2022a;Kalantidis et al., 2020;Xuan et al., 2020).However, it is not trivial to obtain enough hard negatives through sampling in the unsupervised learning setting.Admittedly, they can be obtained through retrieval (Wang et al., 2022b) or fine-grained data augmentation (Wang et al., 2022a), but the processes are usually time-consuming.Incorrectly pushing away false negatives in the semantic space is another problem in unsupervised learning scenarios, because all negatives are treated equally.In fact, in-batch negatives are quite diverse in terms of similarity values with the anchor samples.Therefore, false negatives do exist in the batches and auxiliary models may be required to identify them (Zhou et al., 2022).In sum, we view these issues as the major obstacles to further improve the performance of contrastive learning in unsupervised scenarios.
Since the issues mentioned above have a close connection with similarity, reasonable differentiation of negatives based on similarity is the key.In the meanwhile, clustering is a natural and simple way of grouping samples into various clusters without supervision.Therefore, in this paper, we propose a new negative sampling method called Clus-terNS for unsupervised sentence embedding learning, which combines clustering with contrastive learning.Specifically, for each mini-batch during training, we cluster them with the K-means algorithm (Hartigan and Wong, 1979), and for each sample, we select its nearest neighboring centroid (cluster center) as the hard negative.Then we treat other sentences belonging to the same cluster as false negatives.Instead of directly taking them as positive samples, we use the Bidirectional Margin Loss to constrain them.Since continuously updating sentence embeddings and the large size of the training dataset pose efficiency challenges for the clustering, we modify the K-means clustering to make it more suitable for training unsupervised sentence representation.
Overall, our proposed negative sampling approach is simple and easy to be plugged into existing methods, boosting the performance.For example, we improve SimCSE and PromptBERT in RoBERTa base by 1.41/0.59,and in BERT large by 0.78/0.88respectively.The main contributions of this paper are summarized as follows: • We propose a novel method for unsupervised sentence representation learning, leveraging clustering to solve hard negative and false negative problems in one unified framework.(Giorgi et al., 2021;Wu et al., 2020;Yan et al., 2021;Gao et al., 2021).Following these works,  (Clark et al., 2020), reaching higher performance.

Negative Sampling
In-batch negative sampling is a common strategy in unsupervised contrastive learning, which may have limitations as we mentioned above.To fix these issues, Zhang et al. (2022a) and Kalantidis et al. (2020) synthesize hard negatives by mixing positives with in-batch negatives.

Preliminaries
Our clustering-based negative sampling method for unsupervised sentence representation can be easily integrated with contrastive learning approaches like SimCSE (Gao et al., 2021) or PromptBERT (Jiang et al., 2022).An illustration of ClusterNS and the original contrastive learning framework is shown in Figure 2.For a sentence x i in one mini-batch {x i } N i=1 (N samples in each mini-batch), SimCSE uses Dropout (Srivastava et al., 2014) and Prompt-BERT uses different prompt-based templates to obtain its positive example x + i .Then they treat the other samples in the mini-batch as "default" negatives and apply the InfoNCE loss (Oord et al., 2018) in Eq. ( 1), where τ is the temperature coefficient. (1)

Boosting Negative Sampling
Our main contribution of this work is to improve the negative sampling method with clustering.To be more specific, we combine clustering with contrastive learning in the training process, recognizing false negatives in the mini-batch and providing additional hard negatives based on the clustering result.The clustering procedure will be introduced in Section 3.3 in detail and for the moment, we assume the samples in each mini-batch have been properly clustered.Supposing that there are by their cosine similarity cos(x i , c ij ) with x i .In this case, c i1 and c iK are the nearest and farthest centroids to x i , respectively.Therefore, x i is the most similar to c i1 and belongs to cluster C i1 .We define the set , whose elements belong to C i1 , the same cluster as x i .
Hard Negatives Zhang et al. (2022a) show that hard negatives bring stronger gradient signals, which are helpful for further training.The critical question is how to discover or even produce such negatives.In our method, the introduced centroids c can be viewed as hard negative candidates.We get rough groups in mini-batch after clustering and sorting by similarity.For the sample x i , we pick the centroid c i2 as its hard negative.The reason is that c i2 gets the highest similarity with x i among all the centroids (except for c i1 which x i belongs to) while having a different cluster.
In this way, all the samples have proper centroids as their hard negatives, and the training objective L cl is as follows: (2) where x − j is the hard negative corresponding to x j , μ is the weight of the hard negative.Note that c i1 is more similar to x i compared with c i2 , which is another candidate for the hard negative.We have compared the different choices in the ablation study described in Section 4.4.

False Negatives
For sample x i , we aim to 1) recognize the false negatives in the mini-batch and 2) prevent them from being pushed away incorrectly in the semantic space.For the former, we treat elements in x * i as false negatives, since they belong to the same cluster and share higher similarity with x i .For the latter, it is unreliable to directly use them as positives, since the labels are missing under the unsupervised setting.However, the different similarity between the anchor sample and others can be summarized intuitively as the following Eq.(3): False negatives have higher similarity with the anchor than normal negatives while lower similarity than the positives.Inspired by Wang et al. (2022a), we introduce the bidirectional margin loss (BML) to model the similarity between the false negative candidates and the anchor: BML loss aims to limit cos(x i , x * i ) in an appropriate range by limiting Δ x i in the interval [−β, −α].Accordingly, we find the potential false negatives in the mini-batch and treat them differently.Combining Eq. ( 2) and Eq. ( 5), we obtain the final training objective function as follows: where λ is a hyperparameter.

In-Batch Clustering
K-means clustering is the base method we use, while we need to overcome computational challenges during the training process.It is very inefficient to cluster the large training corpus.However, we need to do clustering frequently due to the continuously updating embeddings.Therefore, we design the training process with clustering in Algorithm 1. Briefly speaking, we use cosine similarity as the distance metric, cluster the mini-batch and update the centroids with momentum at each step.Loss backward and optimize θ 13: end for Centroids Initialization We show the initialization in line 3-5 in Algorithm 1.The clustering is not performed at the beginning, since high initial similarity of embeddings harms the performance.Instead, we start clustering a few steps after the training starts, being similar to the warm-up process.When initializing, as line 4 shows, we select K samples as initial centroids heuristically: each centroid to be selected should be the least similar to last centroid.
Clustering and Updating We now describe line 7 in detail.First, we assign each sample into the cluster whose centroid have the highest cosine similarity with the sample.After clustering finishes, we calculate a new centroid embedding for each cluster by averaging embeddings of all samples in the cluster with Eq. ( 7), and then update the centroid in the momentum style with Eq. ( 8): where γ is the momentum hyperparameter and N i indicates the number of elements in cluster C i .Finally, based on the clustering results, we calculate the loss and optimize the model step by step (in line 9-12).The method can be integrated with other contrastive learning models, maintaining high efficiency.

Implementation Details
Our code is implemented in Pytorch and Huggingface Transformers.The experiments are run on a single 32G Nvidia Tesla V100 GPU or four 24G Nvidia RTX3090 GPUs.Our models are based on SimCSE (Gao et al., 2021) and PromptBERT (Jiang et al., 2022), and named as Non-Prompt Clus-terNS and Prompt-based ClusterNS, respectively.We use BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) as pre-trained language models for evaluation, with training for 1 epoch and evaluating each 125 steps on the STS-B development set.
We also apply early stopping to avoid overfitting.Hyperparameter settings and more training details are listed in Appendix A.
Our conclusions are as follows, comparing with two baseline models, SimCSE and PromptBERT, all ClusterNS models achieve higher performance, indicating their effectiveness and the importance of negative sampling.For non-prompt models, Clus-terNS surpasses MixCSE and DCLR in BERT large , and for prompt-based models, ClusterNS also surpasses ConPVP and SNCSE in BERT large and RoBERTa base .All these models improve negative samples through sampling or construction, demonstrating our models' strong competitiveness.At last, Prompt-based ClusterNS achieves the stateof-the-art performance of 79.74, which is the best result for models with RoBERTa base .

Ablation Study
Our proposed method focuses on two issues, producing hard negatives and processing false negatives.To verify the contributions, we conduct the ablation studies by removing each of the two components on test sets of the STS tasks, with Non-prompt BERT and RoBERTa models.As we mentioned in Section 3.2, we also replace hard negatives with the most similar centroids to verify our choice of hard negatives (named repl.harder negative), and replace both centroids for hard and false negatives with random clusters to verify our choice of cluster centroids (named repl.random clusters).The results are in Table 2 We observe from Table 2 that removing either component or replacing any part of models lead to inferior performance: 1) Providing hard negatives yields more improvement, since we create high similarity sample leveraging clustering.2) Processing false negatives solely (without hard negatives) even further harm the performance, indicating that providing virtual hard negatives is much easier than distinguishing real false negatives.3) Replacing hard negative with most similar centroids also degrades the performance.Since they belong to the identical cluster, the candidate hard negatives could be actually positive samples.And 4) random clusters are also worse, indicating that the selection of clusters does matter.We discuss more hyperparameter settings in Appendix E.

Analysis
To obtain more insights about how clustering helps the training process, we visualize the variation of diverse sentence pairs similarity during training after clustering initialization in the Non-Prompt ClusterNS-RoBERTa base model, and analyze the results in detail.

In-Batch Similarity
We visualize the average similarity of positive, inbatch negative and hard negative sentence pairs in Figure 3.We observe that similarity of in-batch negative drops rapidly as training progresses, indicating that in-batch negatives are difficult to provide gradient signal.The hard negatives provided by our method maintain higher similarity, which properly handles the issue.Also notice that the similarity of hard negatives is still much smaller than positive pairs, which avoids confusing the model.

Clustering Similarity
Furthermore, we also visualize the similarity related to clustering.In Figure 4, we show the average similarity of sample-nearest centroid pairs, To answer the question what is a good hard negative, we experiment with different similarity levels.We define a symbol σ, the average similarity threshold of in-batch sentence pairs when the centroids initialize.Since the similarity of hard negative pairs depends on σ, we adjust the similarity level with various σ settings.
We show the results in Figure 5 and Table 3.As we set the threshold σ smaller, clustering begins later and hard negatives gets larger similarity (with the anchor sample), meaning that starting clustering too early leads to less optimal hard negative candidates.The best performance is achieved at σ = 0.4, the middle similarity level, verifying the finding in our ablation study, i.e., hard negatives are not the most similar samples.
In Figure 4, similarity of false negative pairs is much smaller comparing with positive pairs, which shows the distinction between positive and false negative samples.False negatives are usually regarded as positive samples in supervised learning, while it is difficult to recognize precisely in the unsupervised setting.We argue that false negatives  retrieved by our methods share similar topics with the anchors, leading to higher similarity than "normal" negatives and lower similarity than positives.We use Eq. ( 5) to constrain false negatives based on this hypothesis.We also do case studies and experiments to approve it in Appendix C.
To verify our choice of the BML loss, we implement experiments on different processing strategies of false negatives.We compare BML loss with two common strategies: use false negatives as positives and mask all the false negatives.Results in

Clustering Evaluation
We also evaluate the quality of sentence embedding through clustering.We first use the DBpedia dataset (Brümmer et al., 2016), an ontology classification dataset extracted from Wikipedia and consists of 14 classes in total.We implement K-means clustering (K=14) on the sentence embeddings of Models RoBERTa SimCSE ClusterNS AMI 0.6926 0.7078 0.7355 DBpedia, and take the adjusted mutual information (AMI) score as the evaluation metric following Li et al. (2020b).The results in Table 5 show that both sentence embedding models improve the AMI score, indicating that the cluster performance is positively correlated with the quality of sentence embeddings.ClusterNS achieves a higher AMI score than SimCSE, verifying its effectiveness.Furthermore, we follow Zhang et al. (2021a) to conduct a more comprehensive evaluation of the short text clustering on 8 datasets 2 , including AgNews (AG) (Zhang and LeCun, 2015), Biomedical (Bio) (Xu et al., 2017), SearchSnippets (SS) (Phan et al., 2008), StackOverflow (SO) (Xu et al., 2017), GoogleNews (G-T, G-S, G-TS) (Yin and Wang, 2016) and Tweet (Yin and Wang, 2016).We perform K-means clustering on the sentence embeddings and take the clustering accuracy as the evaluation metric.Results are shown in Table 6.Our ClusterNS models achieve higher performance than SimCSE in both two models, with an overall improvement of 4.34 in BERT base .Both main experiments and two clustering evaluations show the improvement of our method to the baseline, and verify the effectiveness of improved negative sampling.More details about evaluation metrics are shown in Appendix D.

Alignment and Uniformity
To investigate how ClusterNS improves the sentence embedding, we conduct further analyses on two widely used metrics in contrastive learning proposed by Wang and Isola (2020), alignment and uniformity.Alignment measures the expected distance between the embeddings of positive pairs: And uniformity measures the expected distance between the embeddings of all sentence pairs: 2 https://github.com/rashadulrakib/short-text-clusteringenhancementBoth metrics are better when the numbers are lower.We use the STS-B dataset to calculate the alignment and uniformity, and consider the sentence pairs with score higher than 4 as positive pairs.We show the alignment and uniformity of different models in Figure 6, along with the average STS test results.We observe that ClusterNS strikes a balance between alignment and uniformity, improving the weaker metric at the expense of the stronger one to reach a better balance.For the nonprompt models, SimCSE has great uniformity but weaker alignment compared to vanilla BERT and RoBERTa.ClusterNS optimizes the alignment.On the other hand, Prompt-based ClusterNS optimizes the uniformity since PromptRoBERTa performs the opposite of SimCSE.Besides, RoBERTa may suffer server anisotropy than BERT, meaning that sentence embeddings are squeezed in a more crowded part of the semantic space.Therefore, RoBERTa and PromptRoBERTa-untuned have extreme low value of alignment, but poor uniformity.

Conclusion
In this paper, we propose ClusterNS, a novel approach that focuses on improving the negative sampling for contrastive learning in unsupervised sentence representation learning.We integrate clustering into the training process and use the clustering results to generate additional hard negatives and identify false negatives for each sample.We also use a bidirectional margin loss to constrain the false negatives.Our experiments on STS tasks show improvements over baseline models and demonstrate the effectiveness of ClusterNS.we demonstrate that it is valuable to pay more attention to negative sampling when applying contrastive learning for sentence representation.
pairwise and triple-wise perspective in angular space.In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4892-4903, Dublin, Ireland.Association for Computational Linguistics.
Kun Zhou, Beichen Zhang, Xin Zhao, and Ji-Rong Wen. 2022.Debiased contrastive learning of unsupervised sentence representations.In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6120-6130, Dublin, Ireland.Association for Computational Linguistics.

A Training Details
We do grid search for the hyperparameters and list the searching space below.Our method has two main improvements on hard negative and false negative, respectively.We apply both improvements to most of the models except one of them.We list the information in detail in

B Transfer Tasks
Following previous works, we also evaluate our models on seven transfer tasks: MR (Pang and Lee, 2005), CR (Hu and Liu, 2004), SUBJ (Pang and Lee, 2004), MPQA (Wiebe et al., 2005), SST-2 (Socher et al., 2013), TREC (Voorhees andTice, 2000) and MRPC (Dolan and Brockett, 2005).We evaluate with Non-Prompt ClusterNS models, and use the default configurations in SentEval Toolkit.Results are showed in Table 8.Most of our models achieve higher performance than SimCSE and the auxiliary MLM task also works for our methods.

C False Negative Details
We show the case study in Table 10.As we mentioned in Section 5, our method is able to cluster sentences with similar topics such as religion and music, demonstrating that clustering captures higher-level semantics.However, intra-cluster sentences do not necessarily carry the same meaning and thus they are not suitable to be used as positives directly.
We also show the variation tendency of the false negative rate in Figure 7, which is equivalent to the sample percentage of clusters having more than two elements (i.e., the intra-cluster members are the false negatives of each other).We observe that the false negative rate maintains a high percentage in the whole training process, which verifies the necessity to specific handling the false negatives.

D Clustering Evaluation Details
We use adjusted mutual information (AMI) score or clustering accuracy to evaluate clustering performance.AMI score measures the agreement between ground truth labels and clustering results.Two identical label assignments get the AMI score of 1, and two random label assignments are expected to get AMI score of 0. Clustering accuracy measures the clustering agreement with accuracy metric, which need to map clustering results to ground truth labels with Hungary algorithm in advance.

E Supplement Experiments E.1 Batch Size and Cluster Number
We use large batch sizes and the cluster number K for our models in the main experiments.To show the necessity, we implement the quantitative analysis to compare with small batch sizes and cluster numbers, and show the results in Figure 8 and Figure 9.Both experiments of small batch sizes and cluster numbers perform worse.We attribute the performance degeneration to three factors: 1) Contrastive learning requires large batch sizes in general; 2) Smaller cluster numbers lead to more coarse-grained clusters, weakening the clustering performance; And 3) small batch sizes further restrain the number of clusters.

E.2 Centroids Initialization
We initialize the cluster centroids locally as mentioned in Section 3.3.While some other works adopt global initialization (Li et al., 2020b), they take the embeddings of whole dataset to initialize  Example 1 #1: Jantroon as a word is derived from an Urdu word [UNK] which means Paradise.#2: While the liturgical atmosphere changes from sorrow to joy at this service, the faithful continue to fast and the Paschal greeting, "Christ is risen!#3: There is also a Methodist church and several small evangelical churches.#4: Hindu Temple of Siouxland #5: Eventually, the original marble gravestones had deteriorated, and the cemetery had become an eyesore.#6: Reverend Frederick A. Cullen, pastor of Salem Methodist Episcopal Church, Harlem's largest congregation, and his wife, the former Carolyn Belle Mitchell, adopted the 15-year-old Countee Porter, although it may not have been official.#7: The also include images of saints such as Saint Lawrence or Radegund.
Example 2 #1: Besides Bach, the trio recorded interpretations of compositions by Handel, Scarlatti, Vivaldi, Mozart, Beethoven, Chopin, Satie, Debussy, Ravel, and Schumann.#2: Guitarist Jaxon has been credited for encouraging a heavier, hardcore punk-influenced musical style.#3: Thus, in Arabic emphasis is synonymous with a secondary articulation involving retraction of the dorsum or root of the tongue, which has variously been #4: MP from January, 2001 to date.#5: The song ranked No. #6: The tones originate from Brown's acoustic Martin guitar, which is set up through two preamplifiers which are connected to their own power amplifiers.

Figure 2 :
Figure2: Illustrations of the original contrastive learning and our ClusterNS frameworks.Comparing with the naive framework, for the sample x 1 in the mini-batch, we provide the nearest neighboring centroid c 12 as the hard negative, and regard the samples in the same cluster (such as x 5 ) as false negatives and constrain them by the BML loss.

Figure 3 :
Figure 3: Variation for similarity of positive, in-batch negative and hard negative (Hard Neg.) pairs.

Figure 4 :
Figure 4: Variation for similarity of sample-nearest centroid pairs (Sample-centroid), Sample-hard negative pairs, Sample-false negative pairs (Intra-cluster member pairs), Inter-centroid pairs and In-batch negative pairs.

Figure 6 :
Figure 6: Alignment and uniformity for different sentence embedding models on the STS-B dataset.untuned means the models are not fine-tuned.We mainly use RoBERTa base models.Lower values are better.

Figure 7 :
Figure 7: Variation for false negatives rate of Non-Prompt ClusterNS-RoBERTa base in the training process.

Figure 8 :
Figure 8: Comparisons of different batch size.

Figure 9 :
Figure 9: Comparisons of different cluster number.

Figure 10 :
Figure 10: Variation for similarity of sample-nearest centroid pairs (Sample-centroid), sample-hard negative pairs, inter-centroid pairs, intra-cluster member pairs and in-batch negative pairs in global ClusterNS, corresponding to Figure 4.
• We modify K-means clustering for unsuper-vised sentence representation, making it more efficient and achieve better results.• Experiments on STS tasks demonstrate our evident improvement to baselines and we reach 79.74 for RoBERTa base , the best result with this model.
Algorithm 1 Training with Clustering.
Input: Model parameters: θ; Training dataset: D; Total update steps: T ; Warm-up steps: S 1: for t = 1 to T do

Table 1 :
Overall Results on STS tasks of Spearman's correlation coefficient.All baseline results are from original or relative papers.We use symbol * to mark our models.Best results are highlighted in bold.
Ablation results of our methods (Non-prompt Models) on the test set of STS tasks.

Table 3 :
The average results of STS test sets in different threshold σ with\without L bml , similarity means the average similarity of hard negative pairs at the last checkpoints.

Table 4 :
Table 4 demonstrate the superiority of the BML loss.Comparison of different false negative processing on STS test sets.w/o BML loss is the same as w/o false negative in Table 2.

Table 5 :
AMI score for K-means clustering (K=14) on DBpedia dataset.We use Non-Prompt ClusterNS for comparison.Higher values are better.

Table 7 .
More hyperparameter experiments are discussed in Appendix E.

Table 7 :
Hyperparameter settings that whether to apply both improvements on the models.

Table 9 :
Comparsion of different centroid initialization methods with Non-Prompt ClusterNS-RoBERTa base .

Table 10 :
Illustrative examples in clusters resulting from ClusterNS.Sentences with similar topics are grouped into clusters.