Improving Contrastive Learning for Sentence Embeddings from Two Perspectives

,


Introduction
Sentence representation, which transforms sentence semantic information from discrete language space into dense vectors, is one of the most fundamental tasks in natural language processing, as it serves as the central role for a wide range of downstream applications (e.g., information retrieval, semantic comparison, question answering, and language translation).Sentence representation has been constantly evolving (Pennington et al., 2014;Zhang et al., 2020;Carlsson et al., 2021), and it achieves even stronger performance when utilizing pre-trained language models (PLM) (Devlin et al., 2019;Delobelle et al., 2020).Moreover, on top of PLMs, a number of post-processing strategies achieve even better performance.For example, Li et al. (2020) employs a flow-based model and Su et al. (2021) applies the whitening process to flatten a uniform distribution of representations.
In this paper, we improve the sentence embedding models from two perspectives: dropout noise and feature corruption.Specifically, first, we empirically study the effects of dropout randomness on positive pairs and negative pairs in the CL-based objective.We find that modest dropout noise in the positive pairs is beneficial to the model performance whereas dropout noise in negative pairs is harmful.We provide an explanation from the principle of noise contrastive estimation (Gutmann and Hyvärinen, 2012) and the role of dropout in constructing positive pairs.Based on these findings, we propose a simple yet effective strategy, offdropout, which turns off the dropout randomness in negative pairs to further improve the performance.
Second, we revisit the issue of feature corruption on the sentence embedding and empirically study the well-known solution recently proposed by Zbontar et al. (2021); Klein and Nabi (2022) to this problem.Surprisingly, we find that this solution does not improve performance under the contrastive learning framework for sentence embeddings.We further analyze this finding and identify the reason behind it as the rank bottleneck issue in the mini-batch embedding matrix.To tackle this issue, we propose a simple dimension-wise contrastive learning (DCL) to break down the bottleneck, which eventually enhances the baseline performance.
As a result, by combining the proposed offdropout and DCL, we have advanced the SimCSE baseline by 1.9 points.Furthermore, our reproduced results have shown that we advanced the current state-of-the-art model, DiffCSE (Chuang et al., 2022), by 1.4 points.
In general, our contribution is three-fold: 1. We, for the first time, point out that dropout noise from negative pairs has a side effect on model performance and propose an offsampling strategy to alleviate this side effect.
2. We identify the rank bottleneck in the current solution to the feature corruption problem and propose a novel dimension-wise CL objective to avoid the bottleneck.
3. Experimental results on standard benchmarks for sentence embeddings show that the combination of our proposed methods outperforms strong baselines by a margin and achieves a new state-of-the-art.
2 Related Work

Sentence Representation
Early studies for sentence representations leverage the word2vec (Mikolov et al.) ideas.Semantic information can be captured by predicting a sentence from its surrounding sentences (Kiros et al., 2015;Hill et al., 2016;Logeswaran and Lee, 2018).Pagliardini et al. (2018) aggregates the n-gram embeddings using a pooling strategy, which achieves a strong result.With the development of large-scale pre-trained language models (Devlin et al., 2019;Liu et al., 2020), sentence representation methods begin to utilize PLMs' strong language representation ability.For example, Reimers and Gurevych (2019) employs siamese network with PLMs for supervised sentence representation, while Li et al. (2020) and Su et al. (2021) apply post-processing on top of PLM's representations.
All the previous studies on sentence embeddings have concentrated on developing more intricate frameworks based on the SimCSE framework.These advancements include creating more efficient training samples, introducing advanced metrics, and incorporating additional training tasks.In contrast to these existing studies, our research aims to enhance the contrastive learning framework itself.Specifically, we address two issues: the problem of dropout noise in the representation and the feature corruption caused by the correlation between different dimensions of the representation.
The concept of contrastive learning is based on Noise Contrastive Estimation (NCE) (Gutmann and Hyvärinen, 2010), which involves maximizing the probability of target signals by comparing them with randomly sampled noise.While NCE uses nonlinear logistic regression to distinguish between observed data and artificially generated noise using the log-density function, contrastive learning utilizes InfoNCE (Oord et al., 2018) objectives to discriminate between positive similarities and similarities among negative samples within the batch.
The previous research on NCE and contrastive learning primarily concentrates on the noise arising from the sampling of negative examples.However, this study investigates the noise originating from dropout randomness and examines the impact of dropout randomness on sentence embeddings, considering both negative and positive examples.

Feature Corruption Issue
Feature corruption is a non-trivial problem in representation learning, where each dimension of the model shares high similarities with others.This issue hinders the expressive capacity to convey complex information effectively, as the diversity of each dimension value is constrained by such correlation.
Several studies (Li et al., 2020;Su et al., 2021) have attempted to address this issue by achieving a more independent embedding space through postprocessing.However, as demonstrated in Wang et al. (2022), these post-processing methods primarily enhance performance for sentence pairs with low similarity and fail to improve performance for pairs with high similarity.
Recently, Zbontar et al. (2021) proposed Bar-lowTwins as a solution for such a issue in images.Inspired by the redundancy-reduction principle of neuroscientist H. Barlow, BarlowTwins minimizes redundancy between different dimensions, naturally reducing similarity across each dimension.Unlike post-processing methods, this approach addresses the problem in an end-to-end manner.Furthermore, a direct application of BarlowTwins on sentence embeddings (Klein and Nabi, 2022) achieves comparable performance to SimCSE.
In contrast to previous research that simply applies the BarlowTwins objective to the SimCSE framework, our study investigates the rank bottleneck issue of BarlowTwins in the context of sentence representation.We tackle this issue and improve the model's performance accordingly.

Improving Dropout Noise in CL
SimCSE framework plays a central role in recent sentence embedding strategies.It is a simple contrastive learning framework that learns by identifying positive pairs among in-batch negatives.Specifically, for a given sentence x i , let f (•) denotes a pre-trained language model, and it is used to generate two views (z1 i , z 2 i ) of the identical sentences x i via different dropout patterns: where ξ 1 i and ξ 2 i denote two samples from the dropout random variable ξ (Srivastava et al., 2014).
SimCSE (Gao et al., 2021) aims to maximize the agreement between positive pairs z 1 i , z 2 i and minimize the N − 1 in-batch negatives z 1 i , z 2 j using the InfoNCE objective (Oord et al., 2018): Here, s(•, •) is the similarity measure between two inputs (i.e., cos_sim(•, •)/τ , where τ is the temperature).In Equation ( 2), z 1 i , z 2 j is a negative pair, and the dropout random variable ξ is used as an augmentation function for positive pairs, i.e., z 1 i , z 2 i .

Dropout Noise in Negative Estimation
Empirical study on dropout noise In PLMs such as BERT, it is shown that dropout plays an important role in training because of the regularization effect.In CL-based sentence embeddings, the training objective Eq. ( 2) involves 2 × N BERT structures, and thus the role of dropout in Eq. ( 2) might be more complex.This motivates us to study the effect of dropout.
As presented in Eq. ( 1), dropout is determined by the random variable ξ and thus z 1 i (or ) contains some noise due to the random variable ξ.To study the effect of dropout noise, we respectively add more noise (+Noise) or reduce some noise (-Noise) to z 1 i (or z 2 i ) and then study their final performance.
Specifically, to introduce more noise to z 1 i (or z 2 i ), we add a small Gaussian noise as follows: Where g 1 and g 2 are Gaussian with the mean 0 and variance 0.1.On the other hand, according to the Central Limit Theorem (Fischer), the K sample average converges to its expectation with 1/K of the original variance 1 .Therefore, to reduce the noise from z 1 i (or z 2 i ), we could simply use the following mean sampling: where ξ 1,k i and ξ 2,k i are independently sampled from the dropout variable ξ, and thus z 1,- i contains less noise than z 1 i .
Experimental results and findings Since Eq. ( 2) contains positive pair z 1 i , z 2 i and negative pair z 1 i , z 2 j , we individually conduct experiments to estimate the impact of the noise in positive and negative pairs.Respectively, SimCSE+Pos+Noise is achieved by replacing the positive pair s(z 1 i , z 2 i ) by s(z 1,+ i , z 2,+ i ) in Eq. ( 2), and Sim-CSE+Neg+Noise is achieved by replacing the negative pair s(z 1 i , z 2 j ) by s(z 1,+ i , z 2,+ j ) in Eq. ( 2).Similary, SimCSE+Pos-Noise applies s(z 1,- i , z 2,- i ) as the replacement of positive pair s(z 1 i , z 2 i ) and SimCSE+Neg-Noise uses s(z 1,- i , z 2,- j ) to replace negative pair s(z 1 i , z 2 j ).Table 1 shows that increasing the noise level for both positive and negative embeddings may degenerate the performance while reducing the noise level for negative embeddings is helpful for model performance.In summary, we can obtain the following findings: 1) having modest noise in positive pairs is necessary to make CL successful and reducing noise in positive pairs is harmful to the performance; 2) the model performance is related to the noise level of negative pairs: more noise degrades the performance while less noise improves the performance.
Theoretical Explanation Contrastive learning compares the similarity of positive examples with negative ones.This idea is based on Noise Contrastive Estimation (NCE) (Gutmann and Hyvärinen, 2010), where the positive similarity score is the target signal that NCE tries to maximize, while the negative similarity score is the corresponding noise signal.
The InfoNCE loss in Eq. ( 2) follows Noise Contrastive Estimation (NCE) (Gutmann and Hyvärinen, 2010).It shows that the model converges faster and performs better when the sample size is large, as theoretically analyzed in Gutmann and Hyvärinen (2012).In this sense, reducing the noise in embeddings is achieved by mean pooling from multiple embeddings which implicitly increases the sample size with respect to the random variable ξ and potentially leads to improved performance, i.e., replacing z 1 i and z 2 i by z 1,− i and z 2,− i involving K samples (in both positive pairs and negative pairs within Eq. ( 2)) through mean sampling may obtain better performance.
However, enlarging the sample size affects positive and negative pairs differently.As shown in Table 1, reducing noise in positive pairs through mean sampling results in unsatisfactory performance, while it improves performance in negative pairs.The main reason is that, under the SimCSE framework, the positive pairs require diversity as informative pairs for contrastive learning, which is reduced by mean sampling.Otherwise, the training signal in Eq. ( 2) may become trivial if there is no diversity between z 1 i and z 2 i for a positive pair, because s(z 1 i , z 2 i ) > s(z 1 i , z 2 j ) when z 1 i = z 2 i and i = j.In summary, diversity is crucial for positive pairs, while minimizing noise is beneficial for negative pairs to achieve better performance.

Our Solution: Off-Dropout Sampling
Mean sampling significantly reduces the variance and yields better performance.However, K times average sampling requires a time complexity overhead of O(KN ).
To address this overhead, we propose offdropout sampling, which turns off the dropout when sampling negative example representations.Off-dropout sampling produces representations with zero variance.At a high level, off-dropout sampling is empirically equivalent to the mean of infinite times resampling, as demonstrated by Hinton et al. (2012), which is also known as weight scaling inference rule (Goodfellow et al., 2016).Therefore, off-dropout sampling provides unbiased estimation of representation with zero variance, and the sampling overhead is equal to that of default random sampling.Consequently, the InfoNCE objective for off-dropout sampling is: where s(z i , z j ) represents the similarity between negative pairs, and z i , z j represents the representations sampled without dropout.m is a trade-off factor between positive and negative examples.
It should be noticed that reducing the noise in negatives is very different from hyperparameter tuning: In principle, we investigate the sample size and thereby justify if the current sentence embedding methods satisfy the large sample size requirement from the NCE principle; In practice, tuning the dropout rate changes the distribution of dropout patterns, which violates the principle of controlling variables.Therefore, our strategy to reduce the noise in negatives is fundamentally different from parameter tuning in both principle and practice.

Mitigating Feature Corruption
Feature Corruption Issue Feature corruption2 (Chen and He, 2021) illustrates the issue that each dimension of the output representation has high similarity with the other dimensions.Such correlation between dimensions reduces the model's representation capability and undermines downstream performance (Zbontar et al., 2021;Klein and Nabi, 2022).et al. (2021) proposes BarlowTwins as an additive regulation to tackle such an issue, which is a dimension decorrelation objective.Bar-lowTwins tackles feature corruption by minimizing the redundancy between each dimension and aims to produce dimensional-independent representations.Formally, given a cross-correlation matrix C ∈ R D×D , its objective is:

Zbontar
Where D is the total number of dimensions (D=768 for base model), c, d are dimension indices, and z 1 i,c , z 2 i,d are corresponding dimension values of the representation of the i-th sentence from a mini-batch of size N .However, such an objective does not yield gains over SimCSE when applied to sentence embeddings in STS tasks (Klein and Nabi, 2022).

Rank Bottleneck for BarlowTwins
BarlowTwins aims to achieve orthogonalization of all dimensions in the representation by maximizing the diagonal elements of the correlation matrix, denoted as C = (C cd ), while minimizing the non-diagonal elements.In linear algebra, a parametrized matrix can be optimized to become an orthogonal matrix if there exists a parameter that ensures the matrix is of full rank.However, both theoretically and empirically, we observe that C is far from being a full-rank matrix, meaning its rank is close to D.
From a theoretical standpoint, if the denominator of C cd remains constant for any c and d, C can be expressed as the product of a matrix with dimensions D × N and another matrix with dimensions N × D. In this case, we can demonstrate that the rank of C is at most min(N, D).However, in the conventional settings of SimCSE, N is 64 and D is 768. 3 Consequently, the rank of C is at most N , where N D for any parameter.From an empirical perspective, we randomly sample a batch of 64 sentences and compute the rank of their cross-correlation matrix.We observe that the rank of the SimCSE correlation matrix is 64.Consequently, it is impossible to optimize a rank 64 matrix to become a rank 768 identity matrix using the BarlowTwins objective.The rank of the correlation matrix poses a bottleneck for the BarlowTwins objective, making it difficult to optimize C to become a full-rank matrix.Therefore, there is a rank bottleneck issue when optimizing the BarlowTwins objective.This might explain why BarlowTwins does not perform well when applied on top of SimCSE, as demonstrated in Table 2.

Empirical Justification of the Rank Bottleneck
To verify the rank bottleneck hypothesis, one can adjust the batch size or reduce the total number of representation dimensions.However, increasing the batch size will alter the number of in-batch negatives, while reducing the representation dimensions will exacerbate the dimension bottleneck problem.Both methods will modify the default settings of SimCSE and consequently affect its performance.
To address this, we conduct a straightforward experiment without altering the SimCSE framework settings.We maintain the original SimCSE settings but introduce M artificial embeddings to each mini-batch embedding matrix when calculating the BarlowTwins loss value.Thus, contrastive learning at the data level is still performed on N batch size embeddings, while dimension-wise decorrelation is applied to the padded embedding matrix of size N + M .Consequently, we increase the rank of the correlation matrix by M without modifying SimCSE.
We employ this approach to train the model, and the results are presented in Table 2.The table illustrates that the performance of the BarlowTwins objective improves as the number of padding artificial embeddings increases.By introducing these artificial embeddings, we successfully overcome the rank bottleneck issue of the correlation matrix.

Our Solution: Dimension-Wise Contrastive Learning
Previous experiments have confirmed the existence of the rank bottleneck issue in the BarlowTwins objective and have addressed this problem by padding artificial embeddings.However, optimizing parameters with a large number of artificial embeddings reduces training efficiency.Therefore, we propose a Dimension-wise Contrastive Learning (DCL) objective that naturally avoids the rank bottleneck issue.The DCL objective is defined as follows: The term s(z 1 •,c , z 2 •,d ) calculates the crossdimension similarity between the c-th and d-th dimensions.We use dot product with batch normalization to measure similarity: The DCL objective represents dimension-wise contrastive learning.It improves upon the Bar-lowTwins objective in several ways: 1) Intuitively, Eq. 5 is a relative optimization that can be more easily optimized compared to the absolute regression objective (Gutmann and Hyvärinen, 2012); 2) This relative optimization avoids the rank bottleneck issue by only requiring the dimension to be relatively more "self-similar" compared to other dimensions, instead of necessitating a full-rank identity matrix as the only optimal solution.
By combining both proposed strategies with a trade-off factor λ, the final objective function for improving contrastive learning for sentence embeddings is as follows: 5 Experiments

Dataset
We use the default one million randomly sampled sentences from English Wikipedia for unsupervised training, as previous studies (Gao et al., 2021;Chuang et al., 2022;Zhang et al., 2022c;Wu et al., 2022) are all conducted on this corpus.We do not conduct any data selection or sampling strategy during the training.
During the training, the contrastive temperature τ is the same as SimCSE to be 0.05.And the trading-off ratio m is set as 0.9.For DCL, we set temperature τ DCL as 5 and loss coefficient λ as 0.1.We train the model for one epoch with a learning rate 3e −5 for base model and 8e −6 for the large model with the same batch size 64 and sequence length 32.The model is optimized by Adam (Kingma and Ba, 2014) optimizer with default settings without gradient accumulation.

Main Results
The evaluation results are shown in We also explore the contribution of DCL objective and off-dropout sampling in Table 3.It shows that the off-dropout sampling strategy alone is able to improve the sentence semantic representation to 77.13% Spearman correlation score, and DCL objective with normal dropout augmented negative contrastive term achieves 77.40%.

Ablation Study
We investigate the effect of the hyperparameters on the whole system on the STS-B development set of BERT base model in Table 5.We search m in the range {0.5, 0.8, 0.9, 1, 1.1, 1.2}.The optimal value is 0.9.We search the aggregation weight λ for DCL within the range {0.02, 0.05, 0.1, 0.2, 0.5, 1}, and the optimum value is 0.1.We carry out the DCL temperature search in ranging in {1, 2, 5, 10, 20, 50}, and optimal DCL temperature is 5.
Following Gao et al. (2021), we also plot alignuniform joint plot at Appendix A. Further, we con- Comparing with post-processing We compare the single DCL objective with widely applied post-processing methods (i.e.whitening and flow model).Table 4 shows that the DCL objective outperforms all the post-processing methods.
Robustness to other framework In Table 4, we introduce our method into the DiffCSE framework using officially released source code with our proposed methods.As a result, we further advance DiffCSE baseline by 1.4 points based on our reproduced results.
Runtime Efficiency We compare the training time between SimCSE and our proposed Sim-CSE++.It shows that, the off-dropout sampling, and DCL do not introduce noticeable running time overhead compared to the SimCSE baseline.Moreover, we observe that both SimCSE and our proposed SimCSE++ converge to their optimum within the first 5k training steps, which is around 30 minutes of training.Consequently, the overhead of our modification is negligible.

Conclusion
In this paper, we improve CL-based sentence embeddings in dropout noise and feature corruption.
The main findings are: 1) having modest dropout noise is successful for positive pairs and reducing dropout noise from positive pairs is harmful whereas reducing dropout noise from negative pairs is beneficial; 2) the well-known solution to feature corruption does not lead to gains on sentence embedding due to the rank bottleneck issue.Accordingly, we propose off-dropout to eliminate the dropout randomness from negative pairs and dimension-wise CL objective to break the bottleneck to alleviate feature corruption, both of which outperform strong baselines by a margin.

Ethical Considerations
This study focuses on the representation of sentences, the objective of which is to achieve better performance on general domain sentence similarity tasks.Therefore, the training corpus and benchmark datasets are open source and do not contain any personally sensitive information; And we employ widely applied pre-trained language models with commonly used contrastive learning strategies, thereby having no impact on the political, social, or natural environment.

Limitations
The limitations consist of two aspects: for dropout noise, a novel sampling strategy for positive pairs is left unexplored; for DCL, it could be improved by applying more advanced data-wise contrastive learning strategies.

Table 1 :
Avg. Performance on STS benchmark with adding/reducing noise in positive and negative pairs.

Table 2 :
SimCSE performance with BarlowTwins additive objectives.We pad each mini-batch (batch size 64) embedding matrix with a group of artificial representations sampled from standard Gaussian distribution. ,

Table 4 :
Block 1: DCL compared with post-processing methods.NLI is used without labels; Block 2: Postprocessing methods on top of SimCSE lead to unsatisfying performance; Block 3: SimCSE++ is robust to the non-SimCSE framework.1: Using officially released source code, and our method improves its performance with p < 0.005.

Table 5 :
B dev 82.16 82.43 83.77 83.56 83.37 81.76Searching for weight term m, DCL objective weight λ and DCL temperature τ DCL on STS-B development set.

Table 6 :
1 epoch training time for SimCSE and our proposed SimCSE++ duct the qualitative comparison on sentence retrieval tasks in Appendix B to further illustrate our improvement.