Exploring the Impact of Negative Samples of Contrastive Learning: A Case Study of Sentence Embedding

Contrastive learning is emerging as a powerful technique for extracting knowledge from unlabeled data. This technique requires a balanced mixture of two ingredients: positive (similar) and negative (dissimilar) samples. This is typically achieved by maintaining a queue of negative samples during training. Prior works in the area typically uses a fixed-length negative sample queue, but how the negative sample size affects the model performance remains unclear. The opaque impact of the number of negative samples on performance when employing contrastive learning aroused our in-depth exploration. This paper presents a momentum contrastive learning model with negative sample queue for sentence embedding, namely MoCoSE. We add the prediction layer to the online branch to make the model asymmetric and together with EMA update mechanism of the target branch to prevent the model from collapsing. We define a maximum traceable distance metric, through which we learn to what extent the text contrastive learning benefits from the historical information of negative samples. Our experiments find that the best results are obtained when the maximum traceable distance is at a certain range, demonstrating that there is an optimal range of historical information for a negative sample queue. We evaluate the proposed unsupervised MoCoSE on the semantic text similarity (STS) task and obtain an average Spearman’s correlation of 77.27%. Source code is available here.


Introduction
In recent years, unsupervised learning has been brought to the fore in deep learning due to its ability to leverage large-scale unlabeled data. Various unsupervised contrastive models is emerging, continuously narrowing down the gap between supervised and unsupervised learning. Contrastive learning suffers from the problem of model collapse, where the model converges to a constant value and the samples all mapped to a single point in the feature space. Negative samples are an effective way to solve this problem.
In computer vision, SimCLR from Chen (Chen et al., 2020) and MoCo from He  is known for using negative samples and get the leading performance in the contrastive learning. SimCLR uses different data augmentation (e.g., rotation, masking, etc.) on the same image to construct positive samples, and negative samples are from the rest of images in the same batch. MoCo goes a step further by randomly select the data in entire unlabeled training set to stack up a first-infirst-out negative sample queue.
Recently in natural language processing, contrastive learning has been widely used in the task of learning sentence embedding. One of current state-of-the-art unsupervised method is SimCSE (Gao et al., 2021). Its core idea is to make similar sentences in the embedding space closer while keeping dissimilar away from each other. SimCSE uses dropout mask as augmentation to construct positive text sample pairs, and negative samples are picked from the rest of sentences in the same batch. The mask adopted from the standard Transformer makes good use of the minimal form of data augmentation brought by the dropout. Dropout results in a minimal difference without changing the semantics, reducing the negative noise introduced by augmentation. However, the negative samples in SimCSE are selected from the same training batch with a limited batch size. Our further experiments show that SimCSE does not obtain improvement as the batch size increases, which arouses our interest in using the negative sample queue.
To better digging in the performance of contrastive learning on textual tasks, we build a contrastive model consisting of a two-branch structure and a negative sample queue, namely MoCoSE (Momentum Contrastive Sentence Embedding with negative sample queue). We also introduce the idea of asymmetric structure from BYOL (Grill et al., 2020) by adding a prediction layer to the upper branch (i.e., the online branch). The lower branch (i.e., the target branch) is updated with exponential moving average (EMA) method during training. We set a negative sample queue and update it using the output of target branch. Unlike directly using negative queue as in MoCo, for research purpose, we set an initialization process with a much smaller negative queue, and then filling the entire queue through training process, and update normally. We test both character-level (e.g., typo, back translation, paraphrase) and vector-level (e.g., dropout, shuffle, etc.) data augmentations and found that for text contrastive learning, the best results are obtained by using FGSM and dropout as augmentations.
Using the proposed MoCoSE model, we design a series of experiments to explore the contrastive learning for sentence embedding. We found that using different parts of samples from the negative queue leads to different performance. In order to test how much text contrastive learning benefit from historical information of the model, we proposed a maximum traceable distance metric. The metric calculates how many update steps before the negative samples in the queue are pushed in, and thus measures the historical information contained in the negative sample queue. We find that the best results can be achieved when the maximum traceable distance is within a certain range, reflected in the performance of uniformity and alignment of the learned text embedding. Which means there is an optimal interval for the length of negative sample queue in text contrastive learning model.
Our main contributions are as follows: 1. We combine several advantages of frameworks from image contrastive learning to build a more generic text unsupervised contrastive model. We carried out a detailed study of this model to achieve better results on textual data.
2. We evaluate the role of negative queue length and the historical information that the queue contains in text contrastive learning. By slicing the negative sample queue and using different positions of negative samples, we found those near the middle of the queue provides a better performance.
3. We define a metric called 'maximum traceable distance' to help analyze the impact of negative sample queue by combining the queue length, EMA parameter, and batch size. We found that changes in MTD reflects in the performance of uniformity and alignment of the learned text embedding.

Related Work
Contrastive Learning in CV Contrast learning is a trending and effective unsupervised learning framework that was first applied to the computer vision (Hadsell et al., 2006). The core idea is to make the features of images within the same category closer and the features in different categories farther apart. Most of the current work are using two-branch structure . While influential works like SimCLR and MoCo using positive and negative sample pairs, BYOL (Grill et al., 2020) and SimSiam (Chen and He, 2021) can achieve the same great results with only positive samples. BYOL finds that by adding a prediction layer to the online branch to form an asymmetric structure and using momentum moving average to update the target branch, can train the model using only positive samples and avoid model collapsing. SimSiam explores the possibility of asymmetric structures likewise. Therefore, our work introduces this asymmetric idea to the text contrastive learning to prevent model collapse. In addition to the asymmetric structure and the EMA mechanism to avoid model collapse, some works consider merging the constraint into the loss function, like Barlow Twins (Zbontar et al., 2021), W-MSE (Ermolov et al., 2021), and ProtoNCE .
Contrastive Learning in NLP Since BERT (Devlin et al., 2018) redefined stateof-the-art in NLP, leveraging the BERT model to obtain better sentence representation has become a common task in NLP. A straightforward way to get sentence embedding is by the [CLS] token due to the Next Sentence Prediction task of BERT. But the [CLS] embedding is non-smooth anisotropic in semantic space, which is not conducive to STS tasks, this is known as the representation degradation problem (Gao et al., 2019). BERT-Flow (Li et al., 2020) and BERT-whitening (Su et al., 2021) solve the degradation problem by post-processing the output of BERT. SimCSE found that utilizing contrasting mechanism can also alleviate this problem.
Data augmentation is crucial for contrastive learning. In CLEAR , word and phrase deletion, phrase order switching, synonym substitution is served as augmentation. CERT (Fang and Xie, 2020) mainly using back-and-forth translation, and CLINE  proposed synonym substitution as positive samples and antonym substitution as negative samples, and then minimize the triplet loss between positive, negative cases as well as the original text. ConSERT (Yan et al., 2021) uses adversarial attack, token shuffling, cutoff, and dropout as data augmentation. CLAE (Ho and Nvasconcelos, 2020) also introduces Fast Gradient Sign Method, an adversarial attack method, as text data augmentation. Several of these augmentations are also introduced in our work. The purpose of data augmentation is to create enough distinguishable positive and negative samples to allow contrastive loss to learn the nature of same data after different changes. Works like (Mitrovic et al., 2020) points out that longer negative sample queues do not always give the best performance. This also interests us how the negative queue length affects the text contrastive learning. Figure 1 depicts the architecture of proposed MoCoSE. In the embedding layer, two versions of the sentence embedding are generated through data augmentation (dropout = 0.1 + f gsm = 5e − 9). The resulting two slightly different embeddings then go through the online and target branch to obtain the query and key vectors respectively. The structure of encoder, pooler and projection of online and target branch is identical. We add a prediction layer to the online branch to make asymmetry between online and target branch. The pooler, projection and prediction layers are all composed of several fully connected layers. Finally, the model calculates contrasting loss between query, key and negative queue to update the online branch. In the process, key vector serves as positive sample with respect to the query vector, while the sample from queue serves as negative sample to the query. The target branch truncates the gradient and updated with the EMA mechanism. The queue is a first-in-first-out collection of negative samples with size K which means it sequentially stores the key vectors generated from the last few training steps.

Method
The PyTorch style pseudo-code for training Mo-CoSE with the negative sample queue is shown in Algorithm 1 in Appendix A.3.
Data Augmentation Comparing with SimCSE, we tried popular methods in NLP such as paraphrasing, back translation, adding typos etc., but experiments show that only adversarial attacks and dropout have improved the results. We use FGSM (Goodfellow et al., 2015) (Fast Gradient Sign Method) as adversarial attack. In a white-box environment, FGSM first calculates the derivative of model with respect to the input, and use a sign function to obtain its specific gradient direction. Then, after multiplying it by a step size, the resulting 'perturbation' is added to the original input to obtain the sample under the FGSM attack.
Where x is the input to the embedding layer, θ is the online branch of the model, and L(·) is the contrastive loss computed by the query, key and negative sample queue. ∇ x is the gradient computed through the network for input x, sign() is the sign function, and ε is the perturbation parameter which it controls how much noise it added. EMA and Asymmetric Branches Our model uses EMA mechanism to update the target branch. Formally, denoting the parameters of online and target branch as θ o and θ t , EMA decay weight as η, we update θ t by: Experiments demonstrate that not using EMA leads to model collapsing, which means the model did not converge during training. The prediction layer we added on the online branch makes two branches asymmetric to further prevent the model from collapsing. For more experiment details about symmetric model structure without EMA mechanism, please refer to Appendix A.2. Negative Sample Queue The negative sample queue has been theoretically proven to be an effective means of preventing model from collapsing. Specifically, both the queue and the prediction layer of the upper branch serves to disperse the output feature of the upper and lower branches, thus ensuring that the contrastive loss obtains features with sufficient uniformity. We also set a buffer for the initialization of the queue, i.e., only a small portion of the queue is randomly initialized at the beginning, and then enqueue and dequeue normally until the end.
Where, q refers to the query vectors obtained by the online branch; k refers to the key vectors obtained by the target branch; and l is the negative samples in the queue; τ is the temperature parameter.

Settings
We train with a randomly selected corpus of 1 million sentences from the English Wikipedia, and we conduct experiments on seven standard semantic text similarity (STS) tasks, including STS 2012-2016 (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016, STSBenchmark (Cer et al., 2017) and SICK-Relatedness (Wijnholds and Moortgat, 2021). The SentEval 1 toolbox is used to evaluate our model, and we use the Spearman's correlation to measure the performance. We start our training by loading pre-trained Bert checkpoints 2 and use the [CLS] token embedding from the model output as the sentence embedding. In addition to the semantic similarity task, we also evaluate on seven transfer learning tasks to test the generalization performance of the model. For text augmentation, we tried several vector-level methods mentioned in ConSERT, including position shuffle, token dropout, feature dropout. In addition, we also tried several textlevel methods from the nlpaug 3 toolkit, including synonym replace, typo, back translation and paraphrase.
Training Details The learning rate of MoCoSE-BERT-base is set to 3e-5, and for MoCoSE-BERTlarge is 1e-5. With a weight decay of 1e-6, the batch size of the base model is 64, and the batch size of the large model is 32. We validate the model every 100 step and train for one epoch. The EMA decay weight η is incremented from 0.75 to 0.95 by the cosine function. The negative queue size is 512. For more information please refer to Appendix A.1.
As shown in Table 1  Furthermore, we also evaluate the performance of MoCoSE on the seven transfer tasks provided by SentEval. As shown in Table 2, MoCoSE-BERTbase outperforms most of the previous unsupervised method, and is on par with SimCSE-BERTbase.

Empirical Study
To further explore the performance of the MoColike contrasting model on learning sentence embedding, we set up the following ablation trials.

EMA Decay Weight
We use EMA to update the model parameters for the target branch and find that EMA decay weight affects the performance of the model. The EMA decay weight affects the update process of the model, which further affects the vectors involved in the contrastive learning process. Therefore, we set different values of EMA decay weight and train the model with other hyperparameters held constant. As shown in Table 3 and Appendix A.5, the best result is obtained when the decay weight of EMA is set to 0.85. Compared to the choice of EMA decay weight in CV (generally as large as 0.99), the value of 0.85 in our model is smaller, which means that the model is updated faster. We speculate that this is because the NLP model is more sensitive in the fine-tuning phase and the model weights change   more after each step of the gradient, so a faster update speed is needed.

Projection and Prediction
Several papers have shown (e.g. Section F.1 in BYOL (Grill et al., 2020)) that the structure of projection and prediction layers in a contrastive learning framework affects the performance of the model. We combine the structure of projection and prediction with different configurations and train them with the same hyperparameters. As shown in Table 4, the best results are obtained when the projection is 1 layer and the prediction has 2 layers. The experiments also show that the removal of projection layers degrades the performance of the model.

Data Augmentation
We investigate the effect of some widely-used data augmentation methods on the model performance. As shown in Table 5, cut off and token shuffle do not improve, even slightly hurt the model's performance. Only the adversarial attack (FGSM) has slight improvement on the performance. Therefore, in our experiments, we added FGSM as a default data augmentation of our model in addition to dropout. Please refer to Appendix A.7 for more FGSM parameters results. We speculate that the reason token cut off is detrimental to the model results is that the cut off perturbs too much the vector formed by the sentences passing through the embedding layer. Removing one word from the text may have a significant impact on the semantics. We tried two parameters 0.1 and 0.01 for the feature cut off, and with these two parameters, the results of using the feature cut off is at most the same as  without using feature the cut off, so we discard the feature cut off method. More results can be found in Appendix A.6. The token shuffle is slightly, but not significantly, detrimental to the results of the model. This may be due to that BERT is not sensitive to the position of token. In our experiment, the sentence-level augmentation methods also failed to outperform than the drop out, FGSM and position shuffle.
Among the data augmentation methods, only FGSM together with dropout improves the results, which may due to the adversarial attack slightly enhances the difference between the two samples and therefore enables the model to learn a better representation in more difficult contrastive samples.

Predictor Mapping Dimension
The predictor maps the representation to a feature space of a certain dimension. We investigate the effect of the predictor mapping dimension on the model performance. Table 6.a shows that the predictor mapping dimension can seriously impair the performance of the model when it is small, and when the dimension rises to a suitable range or larger, it no longer has a significant impact on the model. This may be related to the intrinsic dimension of the representation, which leads to the loss of semantic information in the representation when the predictor dimension is smaller than the intrinsic dimension of the feature, compromising the model performance. We keep the dimension of the predictor consistent with the encoder in our experiments. More results can be found in Appendix A.8.

Batch Size
With a fixed queue size, we investigated the effect of batch size on model performance, the results is in Table 6  formance when the batch size is 64. Surprisingly the model performance does not improve with increasing batch size, which contradicts the general experience in image contrastive learning. This is one of our motivations for further exploring the effect of the number of negative samples on the model.

Size of Negative Sample Queue
The queue length determines the number of negative samples, which direct influence performance of the model. We first test the size of negative sample queue to the model performance. With queue size longer than 1024, the results get unstable and worse. We suppose this may be due to the random interference introduced to the training by filling the initial negative sample queue. This interference causes a degradation of the model's performance when the initial negative sample queue becomes longer. To reduce the drawbacks carried out by this randomness, we changed the way the negative queue is initialized. We initialize a smaller negative queue, then fill the queue to its set length in the first few updates, and then update normally. According to experiments, the model achieves the highest results when the negative queue size set to 512 and the smaller initial queue size set to 128.
According to the experiments of MoCo, the increase of queue length improves the model performance. However, as shown in Table 7, increasing the queue length with a fixed batch size decreases our model performance, which is not consistent with the observation in MoCo. We speculate that this may be due to that NLP models updating faster, and thus larger queue lengths store too much outdated feature information, which is detrimental to the performance of the model. Combined with the observed effect of batch size, we further conjecture that the effect of the negative sample queue on model performance is controlled by the model   history information contained in the negative sample in the queue. See Appendix A.9 and A.10 for more results of the effect of randomization size and queue length.
Since the queue is first-in-first out, to test the hypothesis above, we sliced the negative sample queue and use different parts of the queue to participate in loss calculation. Here, we set the negative queue length to 1024, the initial queue size to 128, and the batch size to 256. Thus, 256 negative samples will be push into the queue for each iteration. We take 0 ∼ 512, 256 ∼ 768, 512 ∼ 1024, a concatenated of slice 0 ∼ 256 and 768 ∼ 1024, and all negative sample queues respectively for testing. The experiment results are shown in Table 8.
The experiments show that the model performs best when using the middle part of the queue. So we find that the increase in queue length affects the model performance not only because of the increased number of negative samples, but more because it provides historical information within a certain range.

Maximum Traceable Distance Metric
To testify there are historical information in negative sample queue influencing the model performance, we define a Maximum Traceable Distance Metric d trace to help explore the phenomenon.
The η refers to the decay weight of EMA. The d trace calculates the update steps between the current online branch and the oldest negative samples in the queue. The first term of the formula represents the traceable distance between target and  Figure 3: The batch size does not invalidate the traceable distance. The traceable distance needs to be maintained within a reasonable range even for different batch sizes. This explains why increasing the batch size only does not improve the performance, because increasing the batch size only can cause the distance changes into unsuitable regions.
online branch due to the EMA update mechanism. The second term represents the traceable distance between the negative samples in the queue and the current target branch due to the queue's first-infirst-out mechanism. The longer traceable distance, the wider the temporal range of the historical information contained in the queue. We obtained different value of traceable distance by jointly adjust the decay weight, queue size, and batch size. As shown in Figure 2 and Figure 3, the best result of BERT base is obtained with d trace is set around 14.67. The best result of Bert large shows the similar phenomenon, see Appendix A.11 for details. This further demonstrates that in text contrastive learning, the historical information used should be not too old and not too new, and the appropriate traceable distance between branches is also important. Some derivations about eq.4 can be found in Appendix A.12. However, for an image contrast learning model, like MoCo, experimental results suggests that longer queue size increases the performance. We believe that this is due to the phenomenon of unique anisotropy (Zhang et al., 2020b) of text that causes such differences. The text is influenced by the word frequency producing the phenomenon of anisotropy with uneven distribution, which is different from the near-uniform distribution of pixel points of image data. Such a phenomenon affects the computation of the cosine similarity (Wang and Isola, 2020), and the loss of InfoNCE that we use depends on it, which affects the performance of the model through the accumulation of learning steps. To test such a hypothesis, we use alignment and uniformity to measure the distribution of the representations in space and monitor the corresponding values of alignment and uniformity for different MTDs. As shown in the Figure 4, it can be found that a proper MTD allows the alignment and uniformity of the model to reflects an optimal combination. The change in MTD is reflected in the performance of uniformity and alignment of the learned text embedding, and the increase and decrease of MTD is a considering result of uniformity and alignment moving away from their optimal combination region.

Conclusion
In this work, we propose MoCoSE, it applies the MoCo-style contrastive learning model to the empirical study of sentence embedding. We conducted experiments to study every detail of the model to provide some experiences for text contrastive learning. We further delve into the application of the negative sample queue to text contrastive learning and propose a maximum traceable distance metric to explain the relation between the queue size and model performance.
We train our MoCoSE model using a single NVIDIA RTX3090 GPUs. Our training system runs Microsoft Windows 10 with CUDA toolkit 11.1. We use Python 3.8 and PyTorch version v1.8. We build the model with Transformers 4.4.2 (Wolf et al., 2020) and Datasets 1.8.0 (Lhoest et al., 2021) from Huggingface. We preprocess the training data according to the SimCSE to directly load the stored data in training. We compute the uniformity and alignment metrics of embedding on the STS-B dataset according to the method proposed by Wang (Wang and Isola, 2020). The STS-B dataset is also preprocessed. We use the nlpaug toolkit in our data augmentation experiments. For synonym replace, we use 'ContextualW ordEmbsAug' function with 'roberta-base' as parameter. For typo, we use 'SpellingAug' and back translation we use 'BackT ranslationAug' with parameter 'facebook/wmt19-en-de' and paraphrase we use 'ContextualW ordEmbsF orSentenceAug' with parameter 'xlnet-base-cased'. All the parameter listing here is default value given by official.

A.2 Symmetric Two-branch Structure
We remove the online branch predictor and set the EMA decay weight to 0, i.e., make the structure and weights of the two branches identical. As shown in Figure 5, it is clear that the model is collapsing at this point. And we find that the model always works best at the very beginning, i.e., training instead hurts the performance of the model. In addition, as the training proceeds, the correlation coefficient of the model approaches 0, i.e., the prediction results have no correlation with the actual labeling. At this point, it is clear that a collapse of the model is observed. We observed such a result for several runs, so we adopted a strategy of double branching with different structures plus EMA momentum updates in our design. Subsequent experiments demonstrated that this allowed the model to avoid from collapsing.
We add predictor to the online branch and set the EMA decay weight to 0. We find that the model also appears to collapse and has a dramatic oscillation in the late stage of training, as shown in Figure  6.

A.3 Pseudo-Code for Training MoCoSE
The PyTorch style pseudo-code for training Mo-CoSE with the negative sample queue is shown in Algorithm 1.

A.4 Distribution of Singular Values
Similar to SimCSE, we plot the distribution of singular values of MoCoSE sentence embeddings with SimCSE and Bert for comparison. As illustrated in Figure 7, our method is able to alleviate the rapid decline of singular values compared to other methods, making the curve smoother, i.e., our model is able to make the sentence embedding more isotropic.

A.5 Experiment Details of EMA Hyperparameters
The details of the impact caused by the EMA parameter are shown in the Figure 8. We perform this experiment with all parameters held constant except for the EMA decay weight.

A.6 Details of Different Data Augmentations
We use only dropout as a baseline for the results of data augmentations. Then, we combine dropout with other data augmentation methods and study their effects on model performance. The results are shown in Figure 9.

A.7 Experiment Details of FGSM
We test the effect of the intensity of FGSM on the model performance. We keep the other hyper-parameters fixed, vary the FGSM parameters (1e-9, 5e-9, 1e-8, 5e-8). As seen in Table 9, the average results of the model are optimal when the FGSM parameter is 5e-9.

A.8 Dimension of Sentence Embedding
In both BERT-whitening (Su et al., 2021) and MoCo , it is mentioned that the dimension of embedding can have some impact on the performance of the model. Therefore, we also changed the dimension of sentence embedding in MoCoSE and trained the model several times to observe the impact of the embedding dimension. Because of the queue structure of MoCoSE, we need to keep the dimension of negative examples consistent while changing the dimension of sentence embedding. As shown in the Figure 10, when the dimension of Embedding is low, this causes considerable damage to the performance of the model; while when the dimension rises to certain range, the performance of the model stays steady.

A.9 Details of Random Initial Queue Size
We test the influence of random initialization size of the negative queue on the model performance when queue length and batch size are fixed. As seen in Figure 11, random initialization does have some impact on the model performance.

A.10 Queue Size and Initial Size
We explored the effect of different combinations of initial queue sizes and queue length on the model ,QLWLDO6L]H &RUUHODWLRQ Figure 11: The effect of the initial queue size on the model results when the queue length is 512 and the batch size is 64.
performance. The detailed experiment results are shown in Figure 13. It can be found that model performance rely deeply on initialization queue size. Yet, too large queue size will make the model extremely unstable. This is quite different from the observation of negative sample queue in image contrastive learning.

A.11 Maximum Traceable Distance in
Bert-large 0D[LPXP7UDFHDEOH'LVWDQFH &RUUHODWLRQ Figure 12: The relationship between MTD and correlation of MoCoSE-BERT-large. It can be seen that even at large model, peaks occur within a certain MTD range.
We also train mocose with different batch size and queue size on Bert-large. As shown in Figure 12, we observe the best model performance in MoCoSE-BERT-large within the appropriate Maximum Traceable Distance range (around 22). Once again, this suggests that even on BERT-large, the longer queue sizes do not improve the model performance indefinitely. Which also implies that the history information contained in the negative sample queue needs to be kept within a certain range on BERT-large as well.

A.12 Proof of Maximum Traceable Distance
Here, we prove the first term of the formula for Maximum Traceable Distance. Due to the EMA update mechanism, the weight of target branch is a weighted sum of the online weight in update history. The first term of Maximum Traceable Distance calculate the weighted sum of the historical update steps given a certain EMA decay weight η. From the principle of EMA mechanism, we can get the following equation.
(1 − η) · η i · (i + 1) S n represents the update steps between online and target branch due to the EMA mechanism. Since EMA represents the weighted sum, we need to ask for S n to get the weighted sum.
We calculate the limit of the second part as 1 1−η . Since the limits of both parts exist, we can obtain the limit of S n by the law of limit operations.