WhitenedCSE: Whitening-based Contrastive Learning of Sentence Embeddings

This paper presents a whitening-based contrastive learning method for sentence embedding learning (WhitenedCSE), which combines contrastive learning with a novel shuffled group whitening. Generally, contrastive learning pulls distortions of a single sample (i.e., positive samples) close and push negative samples far away, correspondingly facilitating the alignment and uniformity in the feature space. A popular alternative to the “pushing” operation is whitening the feature space, which scatters all the samples for uniformity. Since the whitening and the contrastive learning have large redundancy w.r.t. the uniformity, they are usually used separately and do not easily work together. For the first time, this paper integrates whitening into the contrastive learning scheme and facilitates two benefits. 1) Better uniformity. We find that these two approaches are not totally redundant but actually have some complementarity due to different uniformity mechanism. 2) Better alignment. We randomly divide the feature into multiple groups along the channel axis and perform whitening independently within each group. By shuffling the group division, we derive multiple distortions of a single sample and thus increase the positive sample diversity. Consequently, using multiple positive samples with enhanced diversity further improves contrastive learning due to better alignment. Extensive experiments on seven semantic textual similarity tasks show our method achieves consistent improvement over the contrastive learning baseline and sets new states of the art, e.g., 78.78% (+2.53% based on BERT{pasted macro ‘BA’}) Spearman correlation on STS tasks.


Introduction
This paper considers self-supervised sentence representation (embedding) learning. It is a fundamental task in language processing (NLP) and can †Corresponding author.
1 Our code will be available at https://github.com/ SupstarZh/WhitenedCSE. Meanwhile, in (d), the positive samples after SGW (red) obtain higher diversity than the original bert features (green). Using these diverse positive samples for contrastive learning, the proposed WhitenedCSE achieves better alignment.
benefit a wide range of downstream tasks (Qiao et al., 2016;Le and Mikolov, 2014;Lan et al., 2019;Logeswaran and Lee, 2018). Two characteristics matter for sentence embeddings, i.e., uniformity (of the overall feature distribution) and alignment (of the positive samples), according to a common sense in deep representation learning (Wang and Isola, 2020). Alignment expects minimal distance between positive pairs, while uniformity expects the features are uniformly distributed in the representation space in overall. From this viewpoint, the popular masked language modeling (MLM) (Devlin et al., 2018;Liu et al., 2019;Brown et al., 2020b;Reimers and Gurevych, 2019) is not an optimal choice for sentence embedding: MLM methods do not explicitly enforce the objective of uniformity and alignment and thus do not quite fit the objective of sentence representation learning.
To improve the uniformity as well as the alignment, there are two popular approaches, i.e., contrastive learning and post-processing. 1) The contrastive learning methods (Yan et al., 2021;Kim et al., 2021; pulls similar sentences close to each other and pushes dissimilar sentences far-away in the latent feature space. Pulling similar sentences close directly enforces alignment, while pushing dissimilar sentences apart implicitly enforces uniformity (Wang and Isola, 2020). 2) In contrast, the postprocessing methods mainly focus on improving the uniformity. They use normalizing flows (Li et al., 2020) or whitening operation (Su et al., 2021)) to project the already-learned representations into an isotropic space. In other words, these methods scatter all the samples into the feature space and thus improve the uniformity.
In this paper, we propose a whitening-based contrastive learning method for sentence representation learning (WhitenedCSE). For the first time, we integrate whitening into the contrastive learning scheme and demonstrate substantial improvement. Specifically, WhitenedCSE combines contrastive learning with a novel Shuffled Group Whitening (SGW). Given a backbone feature, SGW randomly divides the feature into multiple groups along the channel axis and perform whitening independently within each group. The whitened features are then fed into the contrastive loss for optimization.
Although the canonical whitening (or group whitening) is only beneficial for uniformity, SGW in WhitenedCSE improves not only the uniformity but also the alignment. We explain these two benefits in details as below: • Better uniformity. We notice that the pushing effect in contrastive learning and the scattering effect in the whitening have large redundancy to each other, because they both facilitate the uniformity. This redundancy is arguably the reason why no prior literature tries to combine them. Under this background, our finding i.e., these two approaches are not totally redundant but actually have some complementarity is non-trivial. We think such complemenetarity is because these two approaches have different uniformity mechanism and will discuss the differences in Section 3.2.3. In Fig. 1, we observe that while the contrastive learning ( Fig. 1 (b)) already improves the uniformity over the original bert features ( Fig. 1 (a)), applying whitening ( Fig. 1 (c)) brings another round of uniformity improvement.
• Better alignment. In the proposed Whitened-CSE, SGW is featured for its shuffled grouping operation, i.e., randomly dividing a backbone feature into multiple groups before whitening. Therefore, given a same backbone feature, we may repeat SGW multiple times to get different grouping results, and then different whitened features. These "duplicated" features are different from each other and thus increase the diversity of positive samples, as shown in Fig. 1 (d). Using these diverse positive samples for contrastive learning, WhitenedCSE improves the alignment.
Another important advantage of SGW is: since it is applied onto the backbone features, it incurs very slight computational overhead for generating additional positive samples. This high efficiency allows WhitenedCSE to increase the number of positive samples (more than common setting of 2) in a mini-batch with little cost. Ablation study shows that the enlarged positive-sample number brings a further benefit.
Our contributions are summarized as follows: (1) We propose WhitenedCSE for the selfsupervised sentence representation learning task. WhitenedCSE combines the contrastive learning with a novel Shuffled Group Whitening (SGW).
(2) We show that through SGW, WhitenedCSE improves not only the uniformity but also the alignment. Moreover, SGW enables efficient multipositive training, which is also beneficial.
(3) We evaluate our method on seven semantic textual similarity tasks. Experimental results show that WhitenedCSE brings consistent improvement over the contrastive learning baseline and sets new states of the art.
2 Related Work 2.1 Sentence Representation Learning As a fundamental task in natural language processing, sentence representation learning has been extensively studied. Early works mainly based on bag-of-words (Wu et al., 2010;Tsai, 2012) or context prediction tasks (Kiros et al., 2015;Hill et al., 2016), etc. Recently, with the advent of pretrained language model (Devlin et al., 2018;Liu et al., 2019;Brown et al., 2020a), many works tend to directly use PLMs, such as BERT (Devlin et al., 2018), to generate sentence representations. However, some studies (Ethayarajh, 2019;Yan et al., 2021) found that directly use the [CLS] representation or the average pooling of token embeddings at the last layer will suffer from anisotropy problem, i.e., the learned embeddings are collasped into a small area. To alleviate this problem, BERTflow (Li et al., 2020) adopts a standardized flow transformation while BERT-Whitening (Su et al., 2021) adopts a whitening transformation, both of them transform the representation space to a smooth and isotropic space. Most recently, contrastive learning (Chen et al., 2020; has become a powerful tool to obtain the sentence representations.

Contrastive Learning
Contrastive learning (Chen et al., 2020; has achieved great success in sentence representation learning tasks Yan et al., 2021;Kim et al., 2021;. It pulls semantically similar samples together, and pushes the dissimilar samples away, which can be formulated as: where τ is a temperature hyperparameter, h * i , h * j are the positive sample and negative samples respectively. Recently, alignment and uniformity (Wang and Isola, 2020) are proposed to measure the quality of representations. Alignment measures whether the distance between positive samples is close, while uniformity measures the dispersion of embedding in vector space. A typical method called SimCSE  uses dropout as a feature-wise data augmentation to construct the positive sample, and randomly sample negatives from the batch, which can achieve a great balance between alignment and uniformity. Some new works further improved the quality of sentence representations based on SimCSE, such as ESimCSE , MixCSE (Zhang et al., 2022a) and VaSCL (Zhang et al., 2021b), each of them proposed a new data augmentation strategy to construct the positive pair. Besides, DCLR (Zhou et al., 2022) focus on optimizing the strategy of sampling negatives, and ArcCSE (Zhang et al., 2022b) optimized the objective function, etc. In this paper, we find that contrastive learning can be further combined with whitening to obtain better sentence representations.

Whitening Transformation
In computer vision, recent works (Ermolov et al., 2021;Zhang et al., 2021c;Hua et al., 2021) use whitening transformation as an alternative method to the "pushing negatives away" operation in contrastive learning to disperse the data uniformly throughout the spherical space (i.e., the feature space), and then pull the positive samples together , which have achieved great success in unsupervised representation learning.
Whitening (aka., sphering) is a common transformation that transforms a set of variables into a new set of isotropic variables, and makes the covariance matrix of whitened variables equal to the identity matrix. In natural language processing, Su et al. (2021) use whitening as a post-processing method to alleviate the anisotropic problem in pretrained language models. In this paper, we use whitening as an explicit operation to further improve the uniformity of the representation space, and further explore the potential of whitening in improving alignment, so as to obtain a better sentence representation model.

Methods
In this section, we first describe the overall architecture of WhitenedCSE and then present the details of all the modules, including the shuffled group whitening module and the new contrastive learning module.

General Framework
As shown in Fig. 2, WhitenedCSE has three major components: • An BERT-like encoder , which we use to extract features from native sentences, and take the [CLS] token as our native sentence representations.
• Shuffled-group-whitening module, we use it as a complementary module to contrastive learning to further improve the uniformity and alignment of the representation space.
• Multi-positives contrastive module, in this module, we pull distortions of the representations close and push the negative samples away in the latent feature space.
Specifically, given a batch of sentences X, WhitenedCSE use the feature encoder f θ (x i , γ) to map them to a higher dimensional space, where γ is a random mask for dropout , then we take the [CLS] output as the native sentence representations. After this, we feed the na-  Figure 2: Method overview. In the middle column, WhitenedCSE consists of three components, i.e., 1) An BERTlike encoder for generating the backbone features from input samples, 2) A Shuffled Group Whitenning (SGW) module for scattering the backbone features and augmenting the positive feature diversity, and 3) A multi-positive contrastive loss for optimizing the features. In the left column, when the mini-batch flows through these three components sequentially, the feature space undergoes "anisotropy" → "good uniformity + augmented positives" → "pulling close the positives". The right column illustrates the SGW module in details. Specifically, SGW randomly shuffles the backbone feature along the axis and then divides the feature into multiple groups. Afterwards, SGW whitens each group independently and re-shuffles the whitened feature. Given a single backbone feature, we repeat the SGW process several times so as to generate multiple positive features.
tive sentence representations to the shuffled-groupwhitening (SGW) module, in this module we randomly dividing each sentence representation into multiple groups along the axis, then we operate group whitening on each group. We repeat SGW multiple times to get different grouping results, and then different whitened representations. These "duplicated" features are different from each other. Finally, we use multi-positives contrastive loss function to pull one representation and all its corresponding augmentations together, and push it away from others. We will discuss feasible loss function in Section 3.3, and present our final form of loss function.

Preliminaries for Whitening
Given a batch of normalized sentence representations Z ∈ R N ×d , the whitening transformation can be formulated as: where H ∈ R d×N is the whitened embeddings and W ∈ R d×d is the whitening matrix. We denote the covariance matrix of ZZ T as Σ. the goal of whitening is to make the covariance matrix of HH T equal to the identity matrix I, i.e., WΣW T = I. There are many different whitening methods, such as PCA (Jégou and Chum, 2012), ZCA (Bell and Sejnowski, 1997), etc. Group whitening use ZCA as its whitening method to prevent the stochastic axis swapping (Huang et al., 2018). 2 .
ZCA Whitening. The whitening matrix of ZCA whitening transformation can be formulated as: where U ∈ R d×d is the stack of eigenvector of cov(Z, Z T ), and Λ is the correspond eigenvalue matrix. U and Λ are obtained by matrix decomposition. Therefore, Eq. 2 becomes: Group Whitening. Since whitening module needs a large batch size to obtain a suitable estimate for the full covariance matrix, while in NLP, large batch size can be detrimental to unsupervised contrastive learning. To address this problem, we use group whitening (Huang et al., 2018), which controls the extent of whitening by decorrelating smaller groups. Specifically, give a sentence representation of dimension d, group whitening first divide it into k groups (Z 0 , Z 1 , ..., Z k−1 ), i.e., Z k ∈ R N × d k and then apply whitening on each group. That is:

Shuffled Group Whitening
In order to further improve the quality of the sentence representation model, we proposed shuffledgroup-whitening (SGW). We randomly divide the feature into multiple groups along the channel axis, and then perform ZCA whitening independently within each group. After whitening, we do a reshuffled operation to recover features to their original arrangement. The process can be formulated as: This can bring two benefits. One is that it can avoid the limitation that only adjacent features can be put into the same group, so as to better decorrelation and then achieve better uniformity in the representation space. Another is that it brings a disturbance to samples, we can use it as a data augmentation method. Specifically, we repeat SGW multiple times and we can get different grouping results and then different whitened features. These "duplicated" features are different from each other, and thus increase the diversity of postive samples.

Connection to contrastive learning
We find that whitening and contrastive learning are not totally redundant but actually have some complementarity is non-trivia. Specifically, whitening decorrelates features through matrix decomposition, and makes the variance of all features equal to 1, that is, to project features into a spherical space. The "pushing" operation in contrastive learning is to approach a uniform spherical spatial distribution step by step through learning/iteration. Therefore, conceptually, whitening and contrastive learning are redundant in optimizing the uniformity of the representation space. However, contrastive learning achieves uniformity by widening the distance between positive samples and all negative samples, but there is no explicit separation between negative samples. Whitening is the uniform dispersion of the entire samples, so there is complementarity between them. That is, whitening can supplement the lack of contrastive learning for the "pushing" operation between negative samples.

Multi-Positive Contrastive Loss
Since we get multi-positive samples from SGW module, however, the original contrastive loss in Eq. 1 is unable to handle multiple positives. We provide two possible options of contrastive loss which can adapt multi-positives. Given m positive samples, the objective function can be formulated as: where λ m is a hyperparameter, it controls the impact of each positive. Eq. 7 puts the summation over positives outside of the log while Eq. 8 puts the sum of positives inside the log. It should be noted that in Eq. 8, there is a negative sign before the sum of positives, without it, the Eq. 8 will conduct hard mining, which means the maximum of m p=1 e sim(h i ,h + i,p )/τ is mainly determined by max(e −sim(h i ,h + i,p )/τ ). If we add the negative sign, the loss function will be committed to punish the items with less similarity, which is good for bringing all positive samples closer to the anchor samples. In our framework, we adopt Eq. 7 as our final loss function because it can achieve better performance.

Experiment Setup
In this section, We evaluate our method on seven Semantic Textual Similarity(STS) tasks and seven transfer tasks. We use the SentEval (Conneau and Kiela, 2018) toolkit for all of tasks.
Among these methods, SimCSE may be viewed as our direct baseline, because WhitenedCSE may be viewed as being transformed from SimCSE by adding the SGW and replacing the dual-positive contrastive loss with multi-positive contrastive loss.
Therefore, when conduct ablation study, we use SimCSE as our baseline. Implementation details. We use the output of the MLP layer on top of the [CLS] as the our sentence representation. The MLP layer is consist of three components, which are a shuffled group whitening module, a 768 × 768 linear layer and a activation layer. Following SimCSE , we use 1 × 10 6 randomly sampled sentences from English Wikipedia as our training corpus. We start from pretrained checkpoints of BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019). At training time, we set the learning rate as 3e-5, the batch size as 64. We train our model for 1 epoch with temperature τ = 0.05. For BERT-base and BERT-large, we set the number of group size as 384, for RoBERTabase and RoBERTa-large, we set the number of group size as 256. We set the number of positives as 3 for all of models. We evaluate the model every 125 training steps on the development set of STS-B, and keep the best checkpoint for evaluation on test sets. We conduct our experiments on two 3090 GPUs.

STS tasks
We conduct experiments on 7 semantic textual similarity(STS) tasks, and use SentEval toolkit (Conneau and Kiela, 2018) Table 2: Sentence embeddings performance on transfer tasks. We use the accuracy to measure the performance, and report the best in bold.
Spearman's correlation coefficient as our evaluation metrics. The Spearman's correlation uses a monotonic equation to evaluate the correlation of two statistical variables, it varies between -1 and 1 with 0 implying no correlation, and the closer the value is to 1, the closer the two statistical variables are to positive correlation. Tab. 1 shows the evaluation results on 7 STS tasks, from which we can see that WhitenedCSE achieves competitive performance. Compared with SimCSE , WhitenedCSE achieves 2.53 and 1.56 points of improvement based on BERT base and BERT large . It also raise the performance from 76.57% to 78.22 % base on RoBERTa base . Compared with recent works, WhitenedCSE also achieves the best performance in most of the STS tasks.

Transfer tasks
We also conduct experiments on 7 transfer tasks, and use SentEval toolkit (Conneau and Kiela, 2018) for evaluation. For each task, we train a logistic regression classifier on top of the frozen sentence embeddings and test the accuracy on the downstream task. In our experiment settings, we do not include models with auxiliary tasks, i.e., masked language modeling, for a fair comparison.
Tab. 2 shows the evaluation results. Comparied with the SimCSE  baseline, WhitenedCSE achieves 0.59 and 0.33 accuracy improvement on average results based on BERT base and BERT large . Compared with recent works, WhitenedCSE also achieves the best performance in most of the transfer tasks, which further demonstrates the effectiveness of our method.

Alignment and Uniformity
In order to further quantify the improvement in uniformity and alignment of WhitenedCSE, we follow SimCSE , and use alignment loss and uniformity loss (Wang and Isola, 2020) to measure the quality of representations. Alignment is used to measure the expected distance between the embeddings of the positive pairs, and can be formulated as: while uniformity measures how well the embeddings are uniformly distributed in the representation space: We calculate the alignment loss and uniformity loss every 125 training steps on the STS-B development set. From Fig. 3, we can see that compared with SimCSE, WhitenedCSE performs better both on the alignment measure and the uniformity measure. We also find that the uniformity of our models is well optimized at the beginning and remains stable throughout the training process. This further confirms that our method can improve the quality of sentence representation more effectively.

Ablation Analysis
In this section, we further investigate the effectiveness of our proposed model WhitenedCSE. For all experiments, we use BERT base as our base model, and evaluate WhitenedCSE on the STS tasks unless otherwise specified.

Shuffling augments the positive samples
We prove theoretically and practically that SGW can be regarded as an effective data augmentation. We know different whitening transformations will get different whitened results, but all of them are representations for the same sample, so they can be regarded as positive samples for each other. In WhitenedCSE, we operate randomly shuffled on feature dimension, and divide the representations along the feature dimension into k groups. Since each time we use a different permutation, we can get a different representations Z and the corresponding whitening matrix W, We find it can be written as a form of feature-wise disturbance Z * = Z + ϵ: Here, we treat (UΛ −1/2 U T − 1)Z as a perturbation ϵ on the feature dimension. Thus, in Whitened-CSE, we use it as a data augmentation and generate more diverse positive samples. From Tab.4, we can see that, shuffling plays a very important role in the performance of the model.

The importance of Group Whitening
Recently, Su et al. (2021) directly apply whitening on the output of the BERT and have achieved remarkable performance at the time. This lead us to think whether whitening can be directly applied to the output of contrastive learning model to further improve the uniformity of the model representation space.
We consider two different whitening methods: PCA Whitening and ZCA Whitening. The difference between them is that ZCA Whitening uses an additional rotation matrix to rotate the PCA whitened data back to the original feature space,  which can make the transformed data closer to the original input data.
We use the in-batch sentence representations to calculate the mean valuex and the covariance matrix σ, and use the momentum to estimate the overall mean value µ and covariance matrix Σ.
As the results shown in the Tab. 3, we found that directly applying the whitening transformation on contrastive learning models is detrimental to the performance. we attribute this to two reasons: (1) small batch size may not provide enough samples to obtain a suitable estimate for the full covariance matrix. (2) The covariance matrix obtained by high-dimensional features is not necessarily a positive definite matrix (maybe a semi-positive definite matrix), which may leads to errors in matrix decomposition. To alleviate this problem, we use the group whitening to control the extent of whitening. From Tab. 3 we can see that group whitening can significantly improve the performance.

Hyperparameters Analysis
For hyperparameters analysis, we want to explore the sensitivity of WhitenedCSE to these parameters. Concretely, we study the impact of the group size, the number of positive samples. We evaluate our model with varying values, and report the performances on the seven STS tasks.
The influence of group size. In WhitenedCSE, we divide the representation into k groups. However, we know the size of group controls the degree of whitening, and has a great effect on the effectiveness of WhitenedCSE, so we carry out an experiment with k varying from 32 to 384. As shown   in Tab. 4, we can see that the best performance is achieved when k = 384, and the second best performance is achieved when k = 128. When k takes other values, the performance will drop slightly.
The influence of positive samples number. Sampling multi-positive samples can significantly enrich semantic diversity, we want to explore how the number of positive samples affect the performance of our model, so we conduct an experiment with positive number m varying from 2 to 5. From Tab. 5, we can see that when m = 3, our model achieve the best performance. However, due to the limitation of the memory size, we cannot exhaust all the possibilities, but we found that when m ≥ 2, the performance of the model is always better than when m = 2, which confirms that mult-positive samples can bring richer semantics, allowing the model to learn better sentence representations.
The influence of different modules. In Whitened-CSE, using the proposed shuffled-group-whitening method is beneficial, and further using multi-hot positive samples brings additional benefit. We investigate their respective contributions to Whitened-CSE in Tab. 6. In our ablation study, we replace the whitening method with ordinary dropout technique, and still retain the multi-hot positive sample loss. Meanwhile, we use the proposed whitening method alone, and keep the number of positive samples 2. From Tab. 6, we find that multi-hot positive samples based on dropout only brings +0.12% improvement. This is reasonable because when the data augmentation is subtle (i.e., the dropout), using extra positive samples barely increases the diversity. In contrast, the proposed SGW gener-  ates informative data augmentation, and thus well accommodates the multi-hot positive samples.

Conclusion
In this paper, we proposed WhitenedCSE, a whitening-based contrastive learning framework for unsupervised sentence representation learning. We proposed a novel shuffled group whitening, which reinforces the contrastive learning effect regarding both the uniformity and the alignment. Specifically, it retains the role of whitening in dispersing data, and can further improve uniformity on the basis of contrastive learning. Additionally, it shuffles and groups features on channel axis, and performs whitening independently within each group. This kind of operation can be regarded as a disturbance on feature dimension. We obtain multiple positive samples through this operation, and learn the invariance to this disturbance to obtain better alignment. Experimental results on seven semantic textual similarity tasks have shown that our approach achieve consistent improvement over the contrastive learning baseline.

Limitations
In this paper, we limit the proposed WhitenedCSE for sentence embedding learning. Conceptually, WhitenedCSE is potential to benefit contrastive learning on some other tasks, e.g., self-supervised image representation learning and self-supervised vision-language contrastive learning. However, we did not investigate the self-supervised image representation learning because this domain is currently dominated by masked image modeling. We will consider extending WhitenedCSE for visionlanguage contrastive learning when we have sufficient training resources for the extraordinary largescale text-image pairs.

Ethics Statement
This paper is dedicated to deep sentence embedding learning and proposes a new unsupervised sentence representation model, it does not involve any ethical problems. In addition, we are willing to open source our code and data to promote the better development of this research direction