HiCL: Hierarchical Contrastive Learning of Unsupervised Sentence Embeddings

In this paper, we propose a hierarchical contrastive learning framework, HiCL, which considers local segment-level and global sequence-level relationships to improve training efficiency and effectiveness. Traditional methods typically encode a sequence in its entirety for contrast with others, often neglecting local representation learning, leading to challenges in generalizing to shorter texts. Conversely, HiCL improves its effectiveness by dividing the sequence into several segments and employing both local and global contrastive learning to model segment-level and sequence-level relationships. Further, considering the quadratic time complexity of transformers over input tokens, HiCL boosts training efficiency by first encoding short segments and then aggregating them to obtain the sequence representation. Extensive experiments show that HiCL enhances the prior top-performing SNCSE model across seven extensively evaluated STS tasks, with an average increase of +0.2% observed on BERT-large and +0.44% on RoBERTa-large.


Introduction
Current machine learning systems benefit greatly from large amounts of labeled data.However, obtaining such labeled data is expensive through annotation in supervised learning.To address this issue, self-supervised learning, where supervisory labels are defined from the data itself, has been proposed.Among them, contrastive learning (Chen et al., 2020a(Chen et al., ,b,c, 2021;;He et al., 2020;Grill et al., 2020;Chen and He, 2021) has become one of the most popular self-supervised learning methods due to its impressive performance across various domains.The training target of contrastive learning is to learn a representation of the data that maximizes the similarity between positive examples and minimizes the similarity between negative examples.
Despite the success, existing methods augment data at the level of the full sequence (Gao et al., 2021b;Wang et al., 2022).Such methods require calculating the entire sequence representation, leading to a high computational cost.Additionally, it also makes the task of distinguishing positive examples from negative ones too easy, which doesn't lead to learning meaningful representations.Similarly, methods like CLEAR (Wu et al., 2020) demonstrated that pre-training with sequence-level naïve augmentation can cause the model to converge too quickly, resulting in poor generalization.
In contrast, Zhang et al. (2019) considered modeling in-sequence (or local) relationships for language understanding.They divide the sequence into smaller segments to learn intrinsic and underlying relationships within the sequence.Since this method is effective in modeling long sequences by not truncating the input and avoiding loss of information, it achieves promising results.Given this success, a natural question arises: is it possible to design an effective and efficient contrastive learning framework by considering the local segmentlevel and global sequence-level relationships?
To answer the question, in this paper, we propose a hierarchical contrastive learning framework, HiCL, which not only considers global relationships but also values local relationships, as illustrated in Figure 1.Specifically, given a sequence (i.e., sentence), HiCL first divides it into smaller segments and encodes each segment to calculate local segment representation respectively.It then aggregates the local segment representations belonging to the same sequence to get the global sequence representation.Having obtained local and global representations, HiCL deploys a hierarchical contrastive learning strategy involving both segment-level and sequence-level contrastive learning to derive an enhanced representation.For local contrastive learning, each segment is fed into the model twice to form the positive pair, with segments from differing sequences serving as the negative examples.For global contrastive learning, HiCL aligns with mainstream baselines to construct positive/negative pairs.
We have carried out extensive experiments on seven STS tasks using well-representative models BERT and RoBERTa as our backbones.We assess the method's generalization capability against three baselines: SimCSE, ESimCSE, and SNCSE.As a result, we improve the current state-of-the-art model SNCSE over seven STS tasks and achieve new state-of-the-art results.Multiple initializations and varied training corpora confirmed the robustness of our HiCL method.
Our contributions are summarized below: • To the best of our knowledge, we are the first to explore the relationship between local and global representation for contrastive learning in NLP.• We theoretically demonstrate that the encoding efficiency of our proposed method is much faster than prior contrastive training paradigms.• We empirically verify that the proposed training paradigm enhances the performance of current state-of-the-art methods for sentence embeddings on seven STS tasks.

Preliminaries: Contrastive Learning
In this paper, we primarily follow SimCLR's framework (Chen et al., 2020a) as our basic contrastive framework and describe it below.
(1) Benefiting from human-defined data augmentations, it can generate numerous positive and negative examples for training without the need for explicit supervision, which is arguably the key reason why self-supervised learning can be effective.
Positive instance Designing effective data augmentations to generate positive examples is a key challenge in contrastive learning.Various methods such as back-translation, span sampling, word deletion, reordering, and synonym substitution have been explored for language understanding tasks in prior works such as CERT (Fang et al., 2020), DeCLUTR (Giorgi et al., 2021), and CLEAR (Wu et al., 2020).Different from previous approaches that augment data at the discrete text level, Sim-CSE (Gao et al., 2021b) first applied dropout (Srivastava et al., 2014) twice to obtain two intermediate representations for a positive pair.Specifically, given a Transformers model, E θ (parameterized by θ), (Vaswani et al., 2017) and a training instance x i , h i = E θ,p (x i ) and h + i = E θ,p + (x i ) are the positive pair that can be used in Eq. 1, where p and p + are different dropout masks.This method has been shown to significantly improve sentence embedding performance on seven STS tasks, making it a standard comparison method in this field.
Negative instance Negative instance selection is another important aspect of contrastive learning.SimCLR simply uses all other examples in the same batch as negatives.However, in DCLR (Zhou et al., 2022), around half of in-batch negatives were similar to SimCSE's training corpus (with a cosine similarity above 0.7).To address this issue, SNCSE (Wang et al., 2022) introduced the use of negated sentences as "soft" negative samples (e.g., by adding not to the original sentence).Additionally, instead of using the [CLS] token's vector representation, SNCSE incorporates a manually designed prompt: "The sentence of x i means [MASK]" and takes the [MASK] token's vector to represent the full sentence x i .This approach has been shown to improve performance compared to using the [CLS] token's vector.In this paper, we compare against SNCSE as another key baseline, not only because it is the current state-of-the-art on evaluation tasks, but also because it effectively combines contrastive learning with techniques like prompt tuning (Gao et al., 2021a;Wu et al., 2022c).

Momentum Contrast
The momentum contrast framework differs from SimCLR by expanding the negative pool through the inclusion of recent instances, effectively increasing the batch size without causing out-of-memory issues.ESimCSE (Wu et al., 2022b) proposes a repetition operation to generate positive instances and utilizes momentum contrast to update the model.We include it as a baseline for comparison to assess the ability of our model to adapt to momentum contrast.

Overview
Figure 1 shows an overview of HiCL.Our primary goal is to incorporate additional underlying (local) information in traditional, unsupervised text contrastive learning.Two objectives are combined to achieve this goal.
Given a set of sequences {seq 1 , seq 2 , . . ., seq n } in a batch B, we slice each seq i into segments {seg i,1 , seg i,2 , . . ., seg i,l i } of slicing length L, where n is the batch size, and l i = 1 + ⌊(|seq i | − 1)/L⌋ is the number of segments that can be sliced in seq i .The slicing is performed using a queue rule: every consecutive L tokens (with no overlap) group as one segment, and the remaining tokens with length no greater than L form a separate segment.In other words, Unlike traditional contrastive learning, which encodes the input sequence seq i directly, we encode each sub-sequence seg i,j using the same encoder and obtain its representation: h i,j = E θ (seg i,j ), where E θ is a Transformer (Vaswani et al., 2017), parameterized by θ.We aggregate the h i,j representations to obtain the whole sequence representation h i by weighted average pooling, where the weight of each segment seg i,j is proportional to its length |seg i,j |: h i = j h i,j × w i,j , where In Section 5.1, we explore other pooling methods, such as unweighted average pooling, and find that weighted pooling is the most effective.According to Table 2, most (99.696%)input instances can be divided into three or fewer segments.Therefore, we do not add an extra transformer layer to get the sequence representation from these segments, as they are relatively short.
To use HiCL with SNCSE, we slice the input se-quence in the same way, but add the prompt to each segment instead of the entire sequence.We also apply the same method to the negated sentences.

Training Objectives
Local contrastive Previous studies have highlighted the benefits of local contrastive learning for unsupervised models (Wu et al., 2020;Giorgi et al., 2021).By enabling the model to focus on short sentences, local contrastive learning allows the model to better match the sentence length distribution, as longer sentences are less common.Building on the work of Gao et al. (2021b), we use dropout as a minimum data augmentation technique.We feed each segment seg i,j twice to the encoder using different dropout masks p and p + .This results in positive pairs h i,j = E θ,p (seg i,j ) and h + i,j = E θ,p + (seg i,j ) for loss computation.As mentioned in Section 1, defining negatives for segments can be challenging.Using segments from the same sequence as negatives carries the risk of introducing correlations, but treating them as positive pairs is not ideal either.We chose not to use segments from the same sequence as either positive or negative pairs and we will show that this approach is better than the other alternatives in Section 5.2.Hence, for segment seg i,j , we only consider as negatives, segments from other sequences {seg k, * , k ∈ {B \ i}}.The local contrastive L l is formalized as: (2)

Global contrastive
The global contrastive objective is the same as that used by most baselines, which tries to pull a sequence's representation h i closer to its positive h + i while keeping it away from in-batch negatives h − j∈B\i , as defined by the global contrastive loss L g in Eq. 1.
Overall objective The overall objective is a combination of local and global contrastive loss, L = αL l + (1 − α)L g , where weight α ∈ {0.01, 0.05, 0.15} is tuned for backbone models.Our adoption of a hybrid loss, with a lower weighting assigned to the local contrastive objective, is motivated by the potential influence of the hard truncation process applied to the sequences.This process can result in information loss and atypical sentence beginnings that may undermine the effectiveness of the local contrastive loss.Meanwhile, a standalone global contrastive loss is equally inadequate, as it omits local observation.We conduct an analysis in Section 5.4 to discuss the intricate relationship between two objectives.

Encoding Time Complexity Analysis
According to our slicing rule, all front segments seg i,j<l i in sequence seq i have length L and the last segment Hence, the encoding time complexity for HiCL is O(L2 × (l i − 1) + |seg l i | 2 ), while the conventional methods take: which is (l i −1) times more than that for HiCL.The longer the training corpus, the higher the benefit using HiCL.This suggests that our approach has a variety of use cases, particularly in pre-training, due to its efficient encoding process.
The practical training time is detailed in Appendix A.1.In short, we are faster than baselines when maintaining the same sequence truncation size 512.For example, SimCSE-RoBERTa large takes 354.5 minutes for training, while our method only costs 152 minutes.

Experimental Setup
Evaluation tasks Following prior works (Gao et al., 2021b;Wang et al., 2022), we mainly evaluate on seven standard semantic textual similarity datasets: STS12-16 (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016)), STS Benchmark (Cer et al., 2017), and SICK-Relatedness (Marelli et al., 2014).When evaluating using the SentEval tookit 2 , we adopt SimCSE's evaluation setting without any additional regressor and use Spearman's correlation as the evaluation matrix.Each method calculates the similarity, ranging from 0 to 5, between sentence pairs in the datasets.In our ablation study, we extend our evaluation to encompass two lengthy datasets (i.e., Yelp (Zhang et al., 2015) and IMDB (Maas et al., 2011)) and seven transfer tasks (Conneau and Kiela, 2018).Due to space consideration, we have provided detailed results for the seven transfer learning tasks in Appendix A.2. Implementing competing methods We re-train three previous state-of-the-art unsupervised sentence embedding methods on STS tasks -Sim-CSE (Gao et al., 2021b), ESimCSE (Wu et al., 2022b), and SNCSE (Wang et al., 2022) -with our novel training paradigm, HiCL.We employ the official implementations of these models and adhere to the unsupervised setting utilized by Sim-CSE.This involves training the models on a dataset comprising 1 million English Wikipedia sentences for a single epoch.More detailed information can be found in Appendix A.3.We also carefully tune the weight α of local contrastive along with the learning rate in Appendix A.4.

Main Results
Table 1 presents the results of adding HiCL to various baselines on the seven STS tasks.We observe that (i) All prior studies in the field have followed a common practice of employing the same random seed for a single run to ensure fair comparisons.We have rigorously adhered to this convention when presenting our findings in Table 1.However, to further assess the robustness of our method, we have extended our investigation by conducting multiple runs with varying random seeds.As shown in Table 15, we generally observe consistency between the multi-run results and the one-run results.The nine backbone models on which HiCL demonstrated superior performance under the default random seed continue to be outperformed by HiCL.

Relationships between Segments from Same Sequence
When optimizing the local contrastive objective, an alternative approach is to follow the traditional training paradigm, which treats all other segments as negatives.However, since different parts of a sequence might describe the same thing, segments from the same sequence could be more similar compared to a random segment.As a result, establishing the relationship between segments from the same sequence presents a challenge.We explore with three different versions: 1) considering them as positive pairs, 2) treating them as negative pairs, and 3) categorizing them as neither positive nor negative pairs.The results in Table 4 of SimCSE and SNCSE indicate that the optimal approach is to treat them as neither positive nor negative -a conclusion that aligns with our expectations.An outlier to this trend is observed in the results for ESimCSE.We postulate that this anomaly arises due to ESimCSE's use of token duplication to create positive examples, which increases the similarity of segments within the same sequence.Consequently, this outcome is not entirely unexpected.Given that ESimCSE's case is special, we believe that, in general, it is most appropriate to label them as neither positive nor negative.

Optimal Slicing Length
To verify the impact of slicing length on performance, we vary the slicing length from 16 to 40 for HiCL in SimCSE-RoBERTa large and SNCSE-RoBERTa large settings (not counting prompts).We were unable to process longer lengths due to memory limitations.From the results in Figure 2, we find that using short slicing length can negatively impact model performance.
As we showed in Section 3.3, the longer the slicing length, the slower it is to encode the whole input sequence, because longer slicing lengths mean fewer segments.Therefore, we recommend setting the default slicing length as 32, as it provides a good balance between performance and efficiency.We acknowledge that using a non-fixed slicing strategy, such as truncation by punctuation, could potentially enhance performance.Nevertheless, in pursuit of maximizing encoding efficiency, we have opted for a fixed-length truncation approach, leaving the investigation of alternative strategies to future work.

How Hierarchical Training Helps?
One might argue that our proposed method could benefit from the truncated portions of the training  corpus -the part exceeding the length limitations of baselines and thus, unprocessable by them.To address this concern, we reconstruct the training corpus in such a way that any sequence longer than the optimal slicing length ( 32) is divided and stored as several independent sequences, each of which fits within the length limit.This allows the baseline models to retain the same information as HiCL.The aforementioned models are indeed using a local-only loss, given they implement contrastive loss on the segmented data.Table 6 shows that HiCL continues to exceed the performance of single segment-level contrastive learning models, indicating its superior performance is not solely reliant on the reduced data.The lower performance exhibited by these models effectively emphasizes the significance of incorporating a hybrid loss.The intriguing insight from above observations is that the omitted data does not improve the performance of the baseline models; in fact, it hinders performance.We hypothesize that this is due to a hard cut by length resulting in some segments beginning with unusual tokens, making it more difficult for the encoder to accurately represent their meaning.
We further verify this hypothesis by doing an ablation study on HiCL with various values of α in the overall objective.Recall that our training objective is a combination of local contrastive and global contrastive loss.HiCL model with α = 0 is identical to the baselines, except that it incorporates the omitted information.As shown in Table 7, training with just global contrastive loss with complete information yields poorer results.Similarly, when α = 1, the HiCL model focuses solely on local contrastive loss, but also performs poorly, which indicates that global contrastive loss is an essential component in learning sentence representation.Table 7: Ablation study over α on SimCSE-RoBERTalarge at SICK-R dev set (Spearman's correlation).
It's crucial to clarify that our approach isn't just a variant of the baseline with a larger batch.As Table 2 indicates, in 99.9% of instances, the training data can be divided into four or fewer segments.Comparing SimCSE-BERT base with a batch size quadruple that of our method (as shown in Table 10), it's evident that SimCSE at a batch size of 256 trails our model's performance with a batch size of 64 (74.00 vs. 76.35).Simply amplifying the batch size for baselines also leads to computational issues.For example, SimCSE encounters an "Out of Memory" error with a batch size of 1024, a problem our model, with a batch size of 256, avoids.Therefore, our approach is distinct and offers benefits beyond merely adjusting batch sizes.

HiCL on Longer Training Corpus
To further verify the effectiveness of HiCL, we add a longer corpus, WikiText-103, along with the original 1 million training data.WikiText-103 is a dataset that contains 103 million tokens from 28,475 articles.We adopt a smaller batch size of 64 to avoid out-of-memory issue.Other training details followed the instructions in Section 4.1.As shown in Table 5, HiCL shows more improvement (+0.48%) compared to the version only trained on short corpus (+0.25%).This indicates that HiCL is more suitable for pre-training scenarios, particularly when the training corpus is relatively long.

HiCL on Longer Test Datasets
The datasets that were widely evaluated, such as STS12-16, STS-B, and SICK-R, primarily consist of short examples.However, given our interest in understanding how HiCL performs on longer and more complex tasks, we further conduct evaluations on the Yelp (Zhang et al., 2015) and IMDB (Maas et al., 2011) datasets.

A Variant of HiCL
As we discussed in Section 5.2, the best approach to treat relationships between in-sequence segments is to consider them as neither positive nor negative.However, this would result in losing information about them belonging to the same sequence.
To overcome this, we consider modeling the relationship between sequences and segments.Since each segment originates from a sequence, they inherently contain an entailment relationship -the sequence entails the segment.We refer to this variant as HiCLv2.Additional details are provided in Appendix A.6.As shown in Table 9, explicitly modeling this sequence-segment relationship does not help the model.We think that it is probably because this objective forces the representation of each sequence to be closer to segments from the same sequence.When the representation of each sequence is a weighted average pooling of segments, it pulls segments from the same sequence closer, which is another way of regarding them as positive.As seen in the results in Section 5.2, treating segments from the same sequence as positive would negatively impact the performance of SimCSE and SNCSE backbones.Thus, it is not surprising that HiCLv2 failed to show as much improvement as HiCL.Table 9: Average performance of HiCLv2 over seven STS datasets.

Baseline Reproduction
One might wonder why there are notable discrepancies between our reproduced baselines and the numbers reported in the original papers.For instance, our SimCSE-BERT base achieved a score of 74.00, while the original paper reported 76.25.Indeed, this difference comes from the different hyperparameters we adopt.Different baselines adopt various configurations such as batch size, training precision (fp16 or fp32), and other factors.Recognizing that these factors significantly influence the final results, our aim is to assess different baselines under consistent conditions.To clarify, it would be misleading to evaluate, say, SimCSE-BERT base with a batch size of 64 while assessing SNCSE-BERT base with a batch size of 256.Such discrepancies could obscure the true reasons behind performance gaps.Therefore, we use a unified batch size of 256 for base models and 128 for large models.
To eliminate concerns about whether the proposed method can still work at baseline's optimal hyperparameters, we reassess the SimCSE-BERT base model in Table 10.Regardless of whether we use SimCSE's optimal settings or our uniform configuration, our method consistently outperforms the baseline.
Lastly, we want to mention that some baselines actually benefit from our standardized setup.
For example, our reproduction of SimCSE-RoBERTa base saw an increase, going from the originally reported 76.57 to 77.09.6 Related Work Contrastive learning The recent ideas on contrastive learning originate from computer vision, where data augmentation techniques such as AUG-MIX (Hendrycks et al., 2020) and mechanisms like end-to-end (Chen et al., 2020a), memory bank (Wu et al., 2018), and momentum (He et al., 2020) have been tested for computing and memory efficiency.In NLP, since the success of SimCSE (Gao et al., 2021b), considerable progress has been made towards unsupervised sentence representation.This includes exploring text data augmentation techniques such as word repetition in ES-imCSE (Wu et al., 2022b) for positive pair generation, randomly generated Gaussian noises in DCLR (Zhou et al., 2022), and negation of input in SNCSE (Wang et al., 2022) for generating negatives.Other approaches design auxiliary networks to assist contrastive objective (Chuang et al., 2022;Wu et al., 2022a).Recently, a combination of prompt tuning with contrastive learning has been developed in PromptBERT (Jiang et al., 2022) and SNCSE (Wang et al., 2022).With successful design of negative sample generation and utilization of prompt, SNCSE resulted in the state-of-the-art performance on the broadly-evaluated seven STS tasks.Our novel training paradigm, HiCL, is compatible with previous works and can be easily integrated with them.In our experiments, we re-train SimCSE, ESimCSE, and SNCSE with HiCL, showing a consistent improvement in all models.
Hierarchical training The concept of hierarchical training has been proposed for long-text processing.Self-attention models like Transformer (Vaswani et al., 2017) are limited by their input length, and truncation is their default method of processing text longer than the limit, which leads to information loss.To combat this issue, researchers have either designed hierarchical Transformers (Liu and Lapata, 2019;Nawrot et al., 2022) or adapted long inputs to fit existing Transform-ers (Zhang et al., 2019;Yang et al., 2020).Both solutions divide the long sequence into smaller parts, making full use of the whole input for increased robustness compared to those using partial information.Additionally, hierarchical training is usually more time efficient.SBERT (Reimers and Gurevych, 2019) employs a similar idea of hierarchical training.Instead of following traditional fine-tuning methods that concatenate two sentences into one for encoding in sentence-pair downstream tasks, SBERT found that separating the sentences and encoding them independently can drastically improve sentence embeddings.To our knowledge, we are the first to apply this hierarchical training technique to textual contrastive learning.

Conclusion
We introduce HiCL, the first hierarchical contrastive learning framework, highlighting the importance of incorporating local contrastive loss into the prior training paradigm.We delve into the optimal methodology for navigating the relationship between segments from same sequence in the computation of local contrastive loss.Despite the extra time required for slicing sequences into segments, HiCL significantly accelerates the encoding time of traditional contrastive learning models, especially for long input sequences.Moreover, HiCL excels in seamlessly integrating with various existing contrastive learning frameworks, enhancing their performance irrespective of their distinctive data augmentation techniques or foundational architectures.We employ SimCSE, ESimCSE, and SNCSE as case studies across seven STS tasks to demonstrate its scalability.Notably, our implementation with the SNCSE backbone model achieves the new state-of-the-art performance.This makes our hierarchical contrastive learning method a promising approach for further research in this area.

A.1 Training Time
The practical training time can be complex, as the actual encoding time does not strictly follow a quadratic rule.However, our method demonstrates advantages in terms of efficiency when maintaining the same sequence truncation size.For example, while SimCSE-RoBERTa-large takes approximately 354.5 minutes for training, our method achieves the same task in just 152 minutes.The acceleration in time stems from two primary factors: 1) Savings in encoding time, as discussed in Section 3.
A comparison between SimCSE-BERT base (using its reported optimized hyperparameters and without adding MLM loss) and our model can be found in Table 11.While our approach may not demonstrate enhancements across all tasks, it does register improvements in 5 out of the 7 tasks, with each exceeding a 0.4% increase.The aggregate improvement stands at 0.39%.

A.3 Experimental setup
Different baselines utilize varying training setups, including batch size, training precision (fp16 or fp32), and other factors.In order to maximally en-sure a fair comparison, we unify the training setup across all competing methods and strictly follow each baseline's hyperparameter tuning process to re-tune the optimal hyperparameters accordingly.
Specifically, we employ a batch size of 256 for base models and 128 for large models.All base models are trained using full precision (fp32), while large models are trained using half precision (fp16) to mitigate potential memory issues with certain models.It is worth noting that disparities in performance between our re-run models and the original baselines may arise due to our intentional parameter adjustments to facilitate a direct comparison with other baselines.
Following the procedures used by SimCSE and SNCSE, we evaluate the model every 125 training steps on the development set of STS-B and select the best checkpoint for the final evaluation on the test sets.

A.5 Multiple Runs
In addition to the conventional practice of comparing models under the same random seed, we test the generalizability of our proposed method by using different random seeds, as presented in Table 15.An inherent challenge is the decision whether to retune the hyperparameters, considering the optimized ones under one initialization may vary under a different one.With the aim of investigating whether the optimized hyperparameters can be effectively transferred across various random seeds, we have opted not to retune the hyperparameters. 4iCL improves performance across the identical nine backbone models as shown in Table 1, thereby demonstrating its robust generalization capabilities.The average scores in the multi-run setting are uniformly lower than those in the one-run setting, possibly due to the lack of sufficient hyperparameter tuning.This lack of tuning may also result in a reduced performance gap between HiCL and the baseline models.

A.6 A Variant of HiCL
Sequence-segment entailment.The sequencesegment entailment objective is designed as an alternative way to model the relationship between segments from the same sequence.Intuitively, segments from the same sequence are likely to be more similar to each other than to a random segment.However, this is not always the case, as segments from the same sequence can have contrary meaning.Modeling this relationship is difficult because it has both a high degree of correlation, yet no clear relationship between segments.To tackle this problem, we instead focus on modeling the relationship between a sequence and its segments.This is more straightforward, as we know that seg i,j comes from seq i , and therefore they naturally form an entailment relationship.By doing this, we also retain the information about whether two segments come from the same sequence.Figure 3 provides an overview of this variant framework.
We employ a third contrastive objective to model the entailment relationship.Specifically, a segment seg i,j is entailed by sequence seq i but should not be entailed by seq k , ∀k ̸ = i.Therefore, we treat seg i,j and seq i as a positive pair, and all other sequences in the batch as negative pairs with seg i,j .We optimize the following InfoNCE loss function: L e = − log e sim(h i,j ,h i )/τ e sim(h i,j ,h i )/τ + k̸ =i e sim(h i,j ,h k )/τ (3) Overall objective of HiCLv2 The overall objective of HiCLv2 is a combination of local contrastive, global contrastive, and entailment loss, given by L = αL l + βL e + (1 − α − β)L g , where α and β are the weights.

A.7 Computing Infrastructure
All models were trained on a single 48GB memory NVIDIA A40 GPU for one epoch, using the same initialization, i.e., the same random seed as used by all baselines, for one run.The server has the following configuration: Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz x86-64 with CentOS 7 Linux operating system.PyTorch 1.7.1 is used as the programming framework.
Table 15: Results of various contrastive learning methods on seven semantic textual similarity (STS) datasets.We report average results across 3 runs with different initialization.Each method is evaluated on full test sets by Spearman's correlation, "all" setting.Bold marks the best result among all competing methods under the same backbone model.

Figure 1 :
Figure 1: The overview of the HiCL framework with local contrastive and global contrastive objective.

Figure 2 :
Figure 2: Comparison between different slicing lengths over three backbone models by Spearman's correlation.

Table 1 :
Main results of various contrastive learning methods on seven semantic textual similarity (STS) datasets.Each method is evaluated on full test sets by Spearman's correlation, "all" setting.Bold marks the best result among all competing methods under the same backbone model.

Table 2 :
The input length distribution of training corpus.Special tokens [CLS] and [SEP] counted.Forming the representation for the entire sequence from segment representations is crucial to our task.We experiment with both weighted average pooling (weighted by the number of tokens in each segment) and unweighted average pooling.Since most sequences are divided into three or fewer segments (as shown in Table2), we did not include an additional layer (either DNN or Transformer) to model relationships with ≤ 3 inputs.Therefore, we opted to not consider aggregating through a deep neural layer, even though this approach might work in scenarios where the training sequences are longer.Results in Table3indicate that under three different backbones, weighted pooling is a better strategy for extracting the global representation from local segments.

Table 3 :
Development set results of SICK-R (Spearman's correlation) for different pooling matrices.

Table 4 :
Development set results of SICK-R (Spearman's correlation) for processing relationship of segments from same sequence.

Table 5 :
Performance on seven STS tasks for methods trained on wiki-103 and 1 million English Wikipedia sentences.Each method is evaluated on full test sets by Spearman's correlation, "all" setting.
Table 8 provides an overview of these two datasets.Specifically, we test with SimCSE-BERT base backbone model and follow the evaluation settings outlined in SimCSE, refraining from any further fine-tuning on Yelp and IMDB.The results are compelling, with our proposed method consistently outperforming SimCSE, achieving a performance gain of +1.97% on Yelp and +2.27% on IMDB.

Table 8 :
Statistics and evaluation results on Yelp and IMDB datasets.

Table 10 :
A comparison between original SimCSE and reproduced SimCSE on seven semantic textual similarity (STS) datasets.Each method is evaluated on full test sets by Spearman's correlation, "all" setting.

Table 12 :
Training minutes for models with 512 sequence truncation size.
b for base models, l for large models.

Table 11 :
Transfer task results of sentence embedding performance (evaluted as accuracy).Bold marks the best result among competing methods under the same backbone model.Optimized β is shown in Table14.

Table 14 :
Optimized hyperparameters of HiCLv2 for four models in Section 5.7.lr: learning rate.