A Sentence is Worth 128 Pseudo Tokens: A Semantic-Aware Contrastive Learning Framework for Sentence Embeddings

Contrastive learning has shown great potential in unsupervised sentence embedding tasks, e.g., SimCSE (CITATION).However, these existing solutions are heavily affected by superficial features like the length of sentences or syntactic structures. In this paper, we propose a semantic-aware contrastive learning framework for sentence embeddings, termed Pseudo-Token BERT (PT-BERT), which is able to explore the pseudo-token space (i.e., latent semantic space) representation of a sentence while eliminating the impact of superficial features such as sentence length and syntax. Specifically, we introduce an additional pseudo token embedding layer independent of the BERT encoder to map each sentence into a sequence of pseudo tokens in a fixed length. Leveraging these pseudo sequences, we are able to construct same-length positive and negative pairs based on the attention mechanism to perform contrastive learning. In addition, we utilize both the gradient-updating and momentum-updating encoders to encode instances while dynamically maintaining an additional queue to store the representation of sentence embeddings, enhancing the encoder’s learning performance for negative examples. Experiments show that our model outperforms the state-of-the-art baselines on six standard semantic textual similarity (STS) tasks. Furthermore, experiments on alignments and uniformity losses, as well as hard examples with different sentence lengths and syntax, consistently verify the effectiveness of our method.


Introduction
Sentence embedding serves as an essential technique in a wide range of applications, including semantic search, text clustering, text classification, etc. (Kiros et al., 2015;Logeswaran and Lee, 2018;Conneau et al., 2017;Cer et al., 2018;Reimers and Gurevych, 2019;Gao et al., 2021). Contrastive * Corresponding author.   In comparison, discrete augmentation obtains positive instances with word deletion or reordering Meng et al., 2021), which may misinterpret the meaning. The continuous method treats embeddings of the same original sentence as positive examples and augments sentences with the different encoding functions (Carlsson et al., 2021;Gao et al., 2021).
learning works on learning representations such that similar examples stay close whereas dissimilar ones are far apart, and thus is suitable for sentence embeddings due to its natural availability of similar examples. Incorporating contrastive learning in sentence embeddings improves the efficiency of semantic information learning in an unsupervised manner  and has been shown to be effective on a variety of tasks (Reimers and Gurevych, 2019;Gao et al., 2021;Zhang et al., 2020). In contrastive learning for sentence embeddings, a key challenge is constructing positive instances. Both discrete and continuous augmentation methods have been studied recently. Methods in Wu et al. (2018); Meng et al. (2021) perform discrete operations directly on the original sentences, such as word deletion and sentence shuffling, to get positive samples. However, these methods may lead to unacceptable semantic distortions or even complete misinterpretations of the original statement. In contrast, the SimCSE method (Gao et al., 2021) obtains two different embeddings in the continuous embedding space as a positive pair for one sentence through different dropout masks (Srivastava et al., 2014) in the neural network for representation learning. Nonetheless, this method overly relies on superficial features existing in the dataset like sentence lengths and syntactic structures and may pay less reflection on meaningful semantic information. As an illustrative example, the sentencepair in Fig. 1 "A caterpillar was caught by me." and "I caught a caterpillar." appear to organize differently in expression but convey exactly the same semantics.
To overcome these drawbacks, in this paper, we propose a semantic-aware contrastive learning framework for sentence embeddings, termed Pseudo-Token BERT (PT-BERT), that is able to capture the pseudo-token space (i.e., latent semantic space) representation while ignoring effects of superficial features like sentence lengths and syntactic structures. Inspired by previous works on prompt learning and sentence selection (Li and Liang, 2021;Liu et al., 2021;Humeau et al., 2020), which create a pseudo-sequence and have it serve the downstream tasks, we present PT-BERT to train pseudo token representations and then to map sentences into pseudo token spaces based on an attention mechanism.
In particular, we train additional 128 pseudo token embeddings, together with sentence embeddings extracted from the BERT model (i.e., gradient-encoder), and then use the attention mechanism (Vaswani et al., 2017) to map the sentence embedding to the pseudo token space (i.e., semantic space). We use another BERT model (i.e., momentum-encoder ) to encode the original sentence, adopt a similar attention mechanism with the pseudo token embeddings, and finally output a continuously augmented version of the sentence embedding. We treat the representations of the original sentence encoded by the gradientencoder and the momentum-encoder as a positive pair. In addition, the momentum-encoder also generates negative examples, dynamically maintains a queue to store these negative examples, and updates them over time. By projecting all sentences onto the same pseudo sentence, the model greatly reduces the dependence on sentence length and syntax when making judgments and makes the model more focused on the semantic level information.
In our experiments, we compare our results with the previous state-of-the-art work. We train PT-BERT on 10 6 randomly sampled sentences from English Wikipedia and evaluate on seven standard semantic textual similarity (STS) tasks (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016 (Marelli et al., 2014). Besides, we also compare our approach with a framework based on an advanced discrete augmentation we proposed. We obtain a new stateof-the-art on standard semantic textual similarity tasks with our PT-BERT, which achieves 77.74% of Spearman's correlation. To show the effectiveness of pseudo tokens, we calculate the align-loss and uniformity loss (Wang and Isola, 2020) and verify our approach on a sub-dataset with hard examples sampled from STS-(2012-2016). We have released our source code 1 to facilitate future work.

Related Work
In this section, we discuss related studies with repect to the contrastive learning framework and sentence embedding.

Contrastive Learning for Sentence Embedding
Contrastive learning. Contrastive learning (Hadsell et al., 2006) has been used with much success in both natural language processing and computer vision (Yang et al., 2019;Klein and Nabi, 2020;Gao et al., 2021). In contrast to generative learning, contrastive learning requires learning to distinguish and match data at the abstract semantic level of the feature space.   uses a queue to maintain the dictionary of samples which allows the model to compare the query with more keys for each step and ensure the consistency of the framework. It updates the parameter of the dictionary in a momentum way.
Discrete and continuous augmentation. By equipping discrete augmentation that modifies sentences directly on token level with contrastive learning, significant success has been achieved in obtaining sentence embeddings. Such methods include word omission (Yang et al., 2019), entity replacement (Xiong et al., 2020), trigger words (Klein and Nabi, 2020) and traditional augmentations such as deletion, reorder and substitution Meng et al., 2021). Examples with diverse expressions can be learned during training, making the model more robust to expressions of different sentence lengths and styles. However, these approaches are limited because there are huge difficulties in augmenting sentences precisely since a few changes can make the meaning completely different or even opposite. Researchers have also explored the possibility of building sentences continuously, which instead applies operation in embedding space. CT-BERT (Carlsson et al., 2021) encodes the same sentence with two different encoders. Unsup-SimCSE (Gao et al., 2021) compares the representations of the same sentence with different dropout masks among the mini-batch. These approaches continuously augment sentences while retaining the original meaning. However, positive pairs seen by SimCSE always have the same length and structure, whereas negative samples are likely to act oppositely. As a result, sentence length and structure are highly correlated to the similarity score of examples. During training, the model has never seen positive samples with diverse expressions, so that in real test scenarios, the model would be more inclined to classify the synonymous pairs with different expressions as negatives, and those sentences with the same length and structures are more likely to be grouped as positive pairs. This may cause a biased encoder.

Pseudo Tokens
In the domain of prompt learning (Liu et al., 2021;Jiang et al., 2020;Li and Liang, 2021;Gao et al., 2020), the way to create prompt can be divided into two types, namely discrete and continuous ways. Discrete methods usually search the natural language template as the prompt (Davison et al., Petroni et al., 2019), while the continuous way always directly works on the embedding space with "pseudo tokens" (Liu et al., 2021;Li and Liang, 2021). In retrieval and dialogue tasks, the current approach adopts "pseudo tokens", namely "poly codes" (Humeau et al., 2020), to jointly encode the query and response precisely and ensure the inference time when compared with the Cross-Encoders and Bi-Encoders (Wolf et al., 2019;Mazaré et al., 2018;Dinan et al., 2019). The essence of these methods is to create a pseudo-sequence and have it serve the downstream tasks without the need for humans to understand the exact meaning. The parameters of these pseudo tokens are independent of the natural language embeddings, and can be tuned based on a specific downstream task. In the following sections, we will show the idea to weaken the model's consideration of sentence length and structures by introducing additional pseudo token embeddings on top of the BERT encoder.

Methods
In this section, we introduce PT-BERT, which provides novel contributions on combining advantages of both discrete and continuous augmentations to advance the state-of-art of sentence embeddings. We first present the setup of problems with a thorough analysis on the bias introduced by the textual similarity theoretically and experimentally. Then we show the details of Pseudo-Token representation and our model's architecture.

Preliminary
Consider a sentence s, we say that the augmentation is continuous if s is augmented by different encoding functions, f (·) and f ′ (·). Sentence em-  Figure 2: The model is divided into two parts, the upper part (Encoder) updates the learnable parameters with gradient, while the bottom (Momentum Encoder) inherits parameters from the upper part with momentum-updating. We repeatedly input the same sequence of pseudo tokens while processing the original sentences. An additional BERT attention mapping the pooler-output of BERT to pseudo sequence representation, extending the sentence embedding to a fixed length and mapping the syntactic structure to the style of the pseudo sentence. The two attentions in the figure are the same and with identical parameters.
beddings h = f (s) and h ′ = f ′ (s) are obtained by these two functions. With a slight change of the encoding function (e.g., encoders with different dropout masks), h ′ can be seen as a more precisely augmented version of h compared with the discrete augmentation. Semantic information of h ′ should be the same as h. Therefore, h and h ′ are a pair of positive examples and we could randomly sample a sentence to construct negative example pairs.
Previous state-of-the-art models (Gao et al., 2021) adopt the continuous strategy that augments sentences with dropout (Srivastava et al., 2014). It is obvious that all the positive examples in SimCSE have the same length and structure while negative examples act oppositely. In this way, SimCSE will inevitably take these two factors as hints during test. To further verify this conjecture, we sort out the positive pairs with a length difference of more than five words and negative pairs of less than two words from STS-(2012-2016). Table 1 shows that the performance of SimCSE plummets on this dataset. Besides, we also find that SimCSE truncates all training corpus into 32 tokens, which shortens the discrepancy of the sentence's length. After we scale the max length that SimCSE could accept from 32 to 64 and 128, the performance degrades significantly during the test even though the model is supposed to learn more from the complete version of sentences(See Table 2). The reason for this result may lie in the fact that, without truncation, all positive pairs still have the same length, whereas the difference in length between the negative and positive ones is enlarged. Therefore, the encoder will rely more on sentence length and make the wrong decision.

Pseudo-Token BERT
We realize it is vital to train an unbiased encoder that captures the semantics and also would not introduce intermediate errors. This motivates us to propose the PT-BERT, as evidence shows that the encoder may fail to make predictions when trained on a biased dataset with same-length positive pairs, by learning the spurious correlations that work only well on the training dataset (Arjovsky et al., 2019;Nam et al., 2020).
Pseudo-Token representations. The idea of PT-BERT is to reduce the model's excessive dependence on textual similarity when making predic-tions. Discrete augmentation achieves this goal by providing both positive and negative examples with diverse expressions. Therefore the model does not jump to conclusions based on sentence length and syntactic structure during the test.
Note that we achieve this same purpose in a seemingly opposite way: mapping the representations of both positive and negative examples to a pseudo sentence with the same length and structure. We take an additional embedding layer outside the BERT encoder to represent a pseudo sentence {0, 1, ..., m} with fixed length m and syntax. This embedding layer is fully independent of the BERT encoder, including the parameters and corresponding vocabulary. Random initialization is applied to this layer, and each parameter will be updated during training. The size of this layer depends on the vocabulary of pseudo tokens(length of pseudo sentences). Besides, adopting the attention mechanism (Vaswani et al., 2017;Bahdanau et al., 2015;Gehring et al., 2017), we take the pseudo sentence embeddings as the query states of cross attention while key and value states are the sentence embeddings obtained from the BERT encoder. This allows the pseudo sentence to attend to the core part and ignore the redundant part of original sentence while keeping the fixed length and structure. Fig. 2 illustrates the framework of PT-BERT. Denoting the pseudo sentence embedding as P and the sentence embedding encoded by BERT as Y, we obtain the weighted pseudo sentence embedding of each sentence by mapping the sentence embedding to the pseudo tokens with attention: where d k is the dimension of the model, W Q , W K , W V are the learnable parameters with R d k ×d k , i denotes the i-th sentence in the dataset. Then we obtain the final embedding h i with the same attention layer by mapping pseudo sentences back to original sentence embeddings: Finally, we compare the cosine similarities between the obtained embeddings of h and h ′ using Eq. 4 , where h ′ are the samples encoded by the momentum-encoder and stored in a queue.
Model architecture. Instead of inputting the same sentence twice to the same encoder, we follow the architecture proposed in Momentum-Contrast (MoCo)  such that PT-BERT can efficiently learn from more negative examples. Samples in PT-BERT are encoded into vectors with two encoders: gradient-update encoder (the upper encoder in Fig. 2) and momentum-update encoder (the momentum encoder in Fig. 2). We dynamically maintain a queue to store the sentence representations from momentum-update encoder.
This mechanism allows us to store as much negative samples as possible without re-computation. Once the queue is full, we replace the "oldest" negative sample with a "fresh" one encoded by the momentum-encoder.
Similar to the works based on continuous augmentation, at the very beginning of the framework, PT-BERT takes input sentence s and obtains h i and h ′ i with two different encoder functions. We measure the loss function with: where h i denotes the representations extracted from the gradient-update encoder, h ′ i represents the sentence embedding in the queue, and M is the queue size. Our gradient-update and momentumupdate encoder are based on the pre-trained language model with the same structure and dimensions as BERT-base-uncased (Devlin et al., 2019). The momentum encoder will update its parameters similar to MoCo: where θ k is the parameter of the momentumcontrast encoder that maintains the dictionary, θ q is the query encoder that updates the parameters with gradients, and λ is a hyperparameter used to control the updating process.
Relationship with prompt learning. Rather than directly perform soft prompting in the embedding space (Li and Liang, 2021;Qin and Eisner, 2021;Liu et al., 2021) of the model, our method follows the "plug and play" fashion that project the representations to pseudo sentences only during the period of training. During inference time, PT-BERT predicts the results only with its BERT backbone. Our original intention of designing this procedure is to make the model predict sentence   embedding precisely without adding extra computation. In some tasks, fixed-LM tuning (Li and Liang, 2021) in soft prompting becomes competitive only when the language models been scaled to big enough (Lester et al., 2021). While the prompt+LM (Ben-David et al., 2021;Liu et al., 2021) tuning adds more burdens for both the period of training and inference. Both prompt+LM and fixed-LM prompt tuning require storing separate copies of soft prompts for different tasks, while our approach only saves the trained BERT model, which draws on some ideas in prompt learning and makes our considerations in computational and memory efficiency and generality.

Experiments
In this section, we perform the standard semantic textual similarity (STS) (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016 tasks to test our model. For all tasks, we measure the Spearman's correlation to compare our performance with the previous stateof-the-art SimCSE (Gao et al., 2021). In the following, we will describe the training procedure in detail.

Training Data and Settings
Datasets. Following SimCSE, We train our model on 1-million sentences randomly sampled from English Wikipedia, and evaluate the model every 125 steps to find the best checkpoints. Note that we do not fine-tune our model on any dataset, which indicates that our method is completely unsupervised.
Hardware and schedule. We train our model on the machine with one NVIDIA V100s GPU. Following the settings of SimCSE (Gao et al., 2021), it takes 50 minutes to run an epoch.

Implementations
We implement PT-BERT based on Huggingface transformers (Wolf et al., 2020) and initialize it with the released BERT base (Devlin et al., 2019). We initialize a new embedding for pseudo tokens with 128×768. During training, we create a pseudo sentence {0, 1, 2, ..., 127} for every input and map the original sentence to this pseudo sentence by attention. With batches of 64 sentences and an additional dynamically maintained queue of 256 sentences, each sentence has one positive sample and 255 negative samples. Adam (Kingma and Ba, 2014) optimizer is used to update the model parameters. We also take the original dropout strategy of BERT with rate p = 0.1. We set the momentum for the momentum-encoder with λ = 0.885.

Evaluation Setup
We evaluate the fine-tuned BERT encoder on STS-B development sets every 125 steps to select the best checkpoints. We report all the checkpoints based on the evaluation results reported in Table 4. The training process is fully unsupervised since no training corpus from STS is used. During the evaluation, we also calculate the trends of alignment-loss and uniformity-loss. Losses were compared with SimCSE (Gao et al., 2021) under the same experimental settings. After training and evaluation, we test models on 7 STS tasks: STS 2012-2016 (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016, STS Benchmark (Cer et al., 2017) and SICK-Relatedness (Marelli et al., 2014). We report the result of Spearman's correlation for all the experiments.

Main Results and Analysis
We first compare PT-BERT with our baseline: MoCo framework + BERT encoder (MoCo-BERT).
MoCo-BERT could be seen as a version of PT-BERT without pseudo token embeddings. Then we apply traditional discrete augmentations such as reorder, duplication, and deletion on this framework. We also compare our work with CLEAR  that substitutes and deletes the token spans. Besides, we argue that the performance of these methods is too weak. We additionally propose an advanced discrete augmentation approach that produces positive examples with the guidance of Semantic Role Labeling (SRL) (Gildea and Jurafsky, 2002;Palmer et al., 2010) information, instead of random deletion and reordering. SRL-guided augmentation could compensate the errors caused by these factors, acting as a combination of deletion, duplication, and reordering with better accuracy. SRL is broadly used to identify the predicateargument structures of a sentence, it detects the arguments associated with the predicate or verb of a sentence and could indicate the main semantic information of who did what to whom. For the sentences with multiple predicates, we keep all the sets with order [ARG0, PRED, ARGM − NEG, ARG1] and concatenate them into a new sequence. For the sentences without recognized predicateargument sets, we keep the original sentence as positive examples. In addition to the work based on discrete approaches, we also compare with Sim-CSE (Gao et al., 2021) which continuously augment sentences with dropout. In Table 3, PT-BERT with 128 pseudo tokens further pushed the state-ofthe-art results to 77.74% and significantly outperformed SimCSE over six datasets.
In Fig 3, we observe that PT-BERT also achieves better alignment and uniformity against SimCSE, which indicates that pseudo tokens really help the learning of sentence representations. In detail, alignment and uniformity are proposed by (Wang and Isola, 2020) to evaluate the quality of representations in contrastive learning. The calculation of these two metrics are shown in the following formulas: where (x, x + ) is the positive pair, (x, y) is the pair consisting of any two different sentences in the whole sentence set, f (x) is the normalized representation of x. We employ the final embedding h to calculate these scores.
According to the above formulas, lower alignment loss means a shorter distance between the positive samples, and low uniformity loss implies the diversity of embeddings of all sentences. Both are our expectations for the representations based on contrastive learning. To evaluate our model's performance on alignment and uniformity, we compare it with SimCSE on the STS-benchmark dataset (Cer et al., 2017), and the result is shown in Figure 3. The result demonstrates that PT-BERT outperforms SimCSE on these two metrics: our model has a lower alignment and uniformity than SimCSE in almost all the training steps, which indicates that the representations produced by our model are more in line with the goal of the contrastive learning.

Ablation Studies
In this section, we first investigate the impact of different sizes of pseudo token embeddings. Then we would like to report the performance difference caused by queue size under the MoCo framework.
Pseudo Sentence Length Different lengths of pseudo tokens can affect the ability of the model to express the sentence representations. By mapping the original sentences to various lengths of pseudo tokens, the performance of PT-BERT could be different. In this section, we keep all the parts except the pseudo tokens and their embeddings unchanged. We scale the pseudo sequence length from 64 to 360. Table 5(a) shows a comparison between different lengths of pseudo sequence in PT-BERT. We find that during training, PT-BERT performs better when attending to pseudo sequences with 128 tokens. Too few pseudo tokens do not fully explain the semantics of the original sentence, while too many pseudo tokens increase the number of parameters and over-express the sentence.
Queue Size The introduction of more negative samples would make the model's training more reliable. By training with different queue sizes, we report the result of PT-BERT with different performances due to the number of negative samples. In Table 5(b), queue size q = 4 performs best. However, the difference in performance between the three sets of experiments is not large, suggesting that the model can learn well as long as it can see enough negative samples.

Exploration on Hard Examples with Different Length
To prove the effectiveness of PT-BERT that could weaken the hints caused by textual similarity, we further test PT-BERT on the sub-dataset introduced in Sec. 3.1. We sorted out the positive pairs with a length difference of more than five words and negative pairs of less than two words from STS-(2012STS-( -2016. PT-BERT significantly outperforms SimCSE with 3.36% Spearman's correlation, indicating that PT-BERT could handle these hard examples better than SimCSE. This further proves that PT-BERT could debias the spurious correlation introduced by sentence length and syntax, and focus more on the semantics.

Conclusion
In this paper, we propose a semantic-aware contrastive learning framework for sentence embeddings, termed PT-BERT. Our proposed PT-BERT approach is able to weaken textual similarity information, such as sentence length and syntactic structures, by mapping the original sentence to a fixed pseudo sentence embedding. We provide analysis of these factors on methods based on continuous and discrete augmentation, showing that PT-BERT augments sentences more accurately than discrete methods while considering more semantics instead of textual similarity than continuous approaches. Lower uniformity loss and alignment loss prove the effectiveness of PT-BERT and further experiments also show that PT-BERT could handle hard examples better than existing approaches. Providing a new perspective to the continuous data augmentation in sentence embeddings, we believe our proposed PT-BERT has great potential to be applied in broader downstream applications, such as text classification, text clustering, and sentiment analysis.