Virtual Augmentation Supported Contrastive Learning of Sentence Representations

Despite profound successes, contrastive representation learning relies on carefully designed data augmentations using domain-specific knowledge. This challenge is magnified in natural language processing, where no general rules exist for data augmentation due to the discrete nature of natural language. We tackle this challenge by presenting a Virtual augmentation Supported Contrastive Learning of sentence representations (VaSCL). Originating from the interpretation that data augmentation essentially constructs the neighborhoods of each training instance, we, in turn, utilize the neighborhood to generate effective data augmentations. Leveraging the large training batch size of contrastive learning, we approximate the neighborhood of an instance via its K-nearest in-batch neighbors in the representation space. We then define an instance discrimination task regarding the neighborhood and generate the virtual augmentation in an adversarial training manner. We access the performance of VaSCL on a wide range of downstream tasks and set a new state-of-the-art for unsupervised sentence representation learning.


Introduction
Universal sentence representation learning has been a long-standing problem in Natural Language Processing (NLP). Leveraging the distributed word representations (Bengio et al., 2003;Mikolov et al., 2013;Collobert et al., 2011;Pennington et al., 2014) as the base features to produce sentence representations is a common strategy in the early stage. However, these approaches are tailored to different target tasks, thereby yielding less generic sentence representations (Yessenalina and Cardie, 2011;Socher et al., 2013;Kalchbrenner et al., 2014;Cho et al., 2014). This issue has motivated more research efforts on designing generic sentence-level learning objectives or tasks. Among them, supervised learning on the Natural Language Inference (NLI) datasets (Bowman et al., 2015a;Williams et al., 2017;Wang et al., 2018) has established benchmark transfer learning performance on various downstream tasks (Conneau et al., 2017;Cer et al., 2018;Reimers and Gurevych, 2019a;. Despite promising progress, the high cost of collecting annotations precludes its wide applicability, especially when the target domain has scarce annotations but differs significantly from the NLI datasets . On the other hand, unsupervised learning of sentence representations has seen a resurgence of interest with the recent successes in self-supervised contrastive learning. These approaches rely on two main components, data augmentation and an instance-level contrastive loss. The popular contrastive learning objectives Chen et al. (2020);  and their variants thereof have empirically shown their effectiveness in NLP. However, the discrete nature of the text makes it challenging to establish universal rules for effective text augmentation generation.
Various contrastive learning based approaches have been proposed for sentence representation learning, where the main difference lies in how the augmentations are generated (Fang and Xie, 2020;Giorgi et al., 2020;Meng et al., 2021;Yan et al., 2021;Kim et al., 2021;. Somewhat surprisingly, a recent work  shows that Dropout (Srivastava et al., 2014), i.e., augmentations obtained by feeding the same instance to the encoder twice, outperforms common data augmentations obtained by operating on the text directly, including cropping, word deletion, or synonym replacement. Again, this observation validates the inherent difficulty of attaining effective data augmentations in NLP. This paper tackles the challenge by presenting a neighborhood-guided virtual augmentation strategy to support contrastive learning. In a nutshell, data augmentation essentially constructs the neighborhoods of each instance, with the semantic content being preserved. We take this interpretation in the opposite direction by leveraging the neighborhood of an instance to guide augmentation generation. Benefiting from the large training batch of contrastive learning, we approximate the neighborhood of an instance via its K-nearest in-batch neighbors. We then define an instance discrimination task within this neighborhood and generate the virtual augmentation in an adversarial training manner. We run in-depth analyses and show that our VaSCL model leads to a more dispersed representation space with the data semantics at different granularities being better captured. We evaluate our model on a wide range of downstream tasks and show that our model consistently outperforms the previous state-of-the-art results by a large margin.

Related Work
Universal Sentence Representation Learning Arguably, the simplest and most common approaches for attaining sentence representations are bag-of-words (Harris, 1954) and variants thereof. However, bag-of-words suffers from data sparsity and a lack of sensibility to word semantics. In the past two decades, the distributed word representations (Bengio et al., 2003;Mikolov et al., 2013;Collobert et al., 2011;Pennington et al., 2014) have become the more effective base features for producing sentence representations. The downside is that these approaches are tailored to the target tasks (Yessenalina and Cardie, 2011;Socher et al., 2013;Kalchbrenner et al., 2014;Cho et al., 2014), and thereby the resulting sentence representations attain limited transfer learning performance.
More recent efforts focus on directly designing the sentence-level learning objectives or tasks. On the supervised learning regime, Conneau et al. (2017);Cer et al. (2018) empirically show the effectiveness of leveraging the NLI task (Bowman et al., 2015a;Williams et al., 2017) to promote generic sentence representations. The task involves classifying each sentence pair into one of three categories: entailment, contradiction, or neutral. Reimers and Gurevych (2019b) further bolster the performance by using the pre-trained transformer (Devlin et al., 2018;Liu et al., 2019) as backbone.
On the other end of the spectrum, Hill et al. (2016);Bowman et al. (2015b) propose using the denoising or variational autoencoders for sentence representation learning. Kiros et al. (2015); Hill et al. (2016) extend the distributional hypothesis to the sentence level and train an encoder-decoder to construct the surrounding context for each sentence. Alternatively, Logeswaran and Lee (2018) present a model that learns to discriminate the target context sentences from all contrastive ones.
Contrastive Learning Contrastive learning has been the pinnacle of recent successes in sentence representation learning.  substantially advance the previous state-of-the-art results by leveraging the entailment sentences in NLI as positive pairs for optimizing the properly designed contrastive loss functions. Nevertheless, we focus on unsupervised contrastive learning and form the positive pairs via data augmentation since such methods are more costeffective and applicable across different domains and languages. Along this line, several approaches have been proposed recently, where the augmentations are obtained via dropout , back-translation (Fang and Xie, 2020), surrounding context sampling (Logeswaran and Lee, 2018;Giorgi et al., 2020), or perturbations conducted at different semantic-level Yan et al., 2021;Meng et al., 2021).

Consistency Regularization
Our work is also closely related to consistency regularization, which is often used to promote better performance by regularizing the model output to remain unchanged under plausible input variations that are often induced via data augmentations. Bachman et al. (2014); Sajjadi et al. (2016); Samuli and Timo (2017); Tarvainen and Valpola (2017) show randomized data augmentations such as dropout, cropping, rotation, and flipping yield effective regularization. Berthelot et al. (2019Berthelot et al. ( , 2020; Verma et al. (2019) improve the performance by applying Mixup  and its variants on top of stochastic data augmentations. However, data augmentation has long been a challenge in NLP as there are no general rules for effective text transformations. An alternative that comes to light when considering the violation of consistency regularization can in turn be used to find the most sensitive perturbation for a model. Therefore, we utilize consistency regularization to promote informative virtual augmen-

Contrastive Learning w/ Dropout
A Neighborhood Constrained Contrastive Learning w/ Virtual Augmentation Figure 1: Illustration of VaSCL. For each instance x i in a randomly sampled batch, we optimize (i) an instance-wise contrastive loss with the dropout induced augmentation obtained by forwarding the same instance twice, i.e., x i and x i denote the same text example; and (2) a neighborhood constrained instance discrimination loss with virtual augmentation (see Section 3.2). tation for a training instance in the representation space while leveraging its approximated neighborhood to regularize the augmentation sharing similar semantic content as its original instance.

Preliminaries
Self-supervised contrastive learning often aims to solve the instance discrimination task. In our scenario, let f denote the transformer encoder that maps the i th input sentence x i to its representation vector e i = f (x i ) 1 . Further let h be the contrastive learning head and z i = h(f (x i )) denote the final output for x i . Let B = {i, i } M i=1 denote the indices of a randomly sampled batch of paired examples, where x i , x i are two independent variations of the i th instance. A popular loss function (Chen et al., 2020) for contrastive learning is defined as follows, where τ is the temperature hyper-parameter and sim(·) denotes the cosine similarity, i.e., sim(·) = z T i z i / z i 2 z i 2 . Similarly, B (z i , z i ) is defined by exchanging the roles of z i and z i in the above equation. Intuitively, Equation (1) defines the log-likelihood of classifying the i th instance as its positive i among all 2M -1 candidates within the same batch B. Therefore, minimizing the above log-loss guides the encoder to map each positive pair close in the representation space, and negative pairs further apart.
Dropout based contrastive learning As Equation (1) implies, the success of contrastive learning relies on effective positive pairs construction. However, it is challenging to generate strong and effective data transformations in NLP due to the discrete nature of natural language. This challenge is further demonstrated in a recent work , which shows that augmentations obtained by Dropout (Srivastava et al., 2014), i.e., z i , z i obtained by forwarding the same instance x i twice, outperforms the common text augmentation strategies such as cropping, word deletion, or synonym replacement. Dropout provides a natural data augmentation by randomly masking its inputs or the hidden layer nodes. The effectiveness of using Dropout as pseudo data augmentations can be traced back to Bachman et al. (2014); Samuli and Timo (2017); Tarvainen and Valpola (2017). Nevertheless, the augmentation strength is weak with Dropout only. There is room for improvement, which we investigate in the following section.

Neighborhood Constrained Contrastive
Learning with Virtual Augmentation In essence, data augmentation can be interpreted as constructing the neighborhood of a training instance, with the semantic content being preserved.
In this section, we take the interpretation in the opposite direction and leverage the neighborhoods of each instance to generate the augmentation. To be more specific, letB = {i} M i=1 denote the indices of a randomly sampled batch with M examples. We first approximate the neighborhood N (i) of the i th instance as its K-nearest neighbors in the representation space, N (i) = {k : e k has the top-K similarity with e i among all other M-1 instances inB We then define an instance-level contrastive loss regarding the i th instance and its neighborhood as follows, In the above equation, z δ i = h(e δ i ) denotes the output of the contrastive learning head with the perturbed representation e δ i = e i + δ i as input. Here, the initial perturbation δ i is chosen as isotropic 2 Gaussian noise. As it implies, Equation (2) shows the negative log-likelihood of classifying the perturbed i th instance as itself rather than its neighbors. Then the augmentation of the i th instance is retained by identifying the optimal perturbation that maximally disturbs its instance-level identity within the neighborhood. That is, For the i th instance, denote N A (i) as the augmented neighborhood that consists of its K nearest neighbors and their associated augmentations. That is, N A (i) = {k, k * } K k=1 with e k and e k * denoting the original representation and the augmented representation of the k th nearest neighbor of instance i, respectively. Here, each augmentation e k * is obtained by solving Equations (3) with respect to the neighborhood N (k) of e k . We then discriminate the i th instance and its augmentation from the augmented neighborhood N A (i), Here both terms on the right hand side are defined in the same way as Equation (2) with respect to the augmentation e * i and the augmented neighborhood Putting it all together Therefore, for each randomly sampled minibatch B with M samples, we minimize the following: The last two terms of the right hand side are defined in Equation 4. Notice that, B(zi, z i ) is defined in the same way as Equation (1) except that z i , z i are retained by feeding the i th instance inB to the encoder twice. In summary, two instance discrimination tasks are posed for each training example: i) discriminating each instance and its dropout induced variation from the other in-batch instances; and ii) separating each instance and its virtual augmentation from its K nearest neighbors and their associated virtual augmentations.

Experiment
In this section, we mainly evaluate VaSCL against SimCSE  which leverages the dropout (Srivastava et al., 2014) induced noise as data augmentation. We show that VaSCL consistently outperforms SimCSE on various downstream tasks that involve semantic understanding at different granularities. We carefully study the regularization effects of VaSCL and empirically demonstrate that VaSCL leads to a more dispersed representation space with semantic structure better encoded. Please refer to Appendix A for details of our implementations and the dataset being used.

Evaluation Datasets
In addition to the popular semantic textual similarity (a.k.a STS) related tasks, we evaluate two additional downstream tasks, short text clustering and few-shot learning based intent classification. Our motivation is twofold. First, these two tasks provide a new evaluation aspect that complements the pairwise similarity-oriented STS evaluation by assessing the high-level categorical semantics encoded in the representations. Second, two desired challenges are posted as short text clustering requires more effective representations due to the weak signal each text example manifests; and intent classification often suffers from data scarcity since the intents can vary significantly over different dialogue systems and the intent examples are costly to collect.  (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016, the STS Benchmark (Cer et al., 2017), and the SICK-Relatedness (Marelli et al., 2014). For each sentence pair in these datasets, a fine-grained similarity score ranges from 0 to 5 is provided.
Short Text Clustering Compared with general text clustering, short text clustering has its own challenge due to lack of signal. Nevertheless, texts containing only a few words grow at unprecedented rates from a wide range of popular resources, including Reddit, Stackoverflow, Twitter, and Instagram. Clustering those texts into groups of similar texts plays a crucial role in many real-world applications such as topic discovery (Kim et al., 2013), trend detection (Mathioudakis and Koudas, 2010), and recommendation (Bouras and Tsogkas, 2017). We evaluate six benchmark datasets for short text clustering. As shown in Table 4, the datasets present the desired diversities regarding both the cluster sizes and the number of clusters contained in each dataset.
Intent Classification Intent classification aims to identify the intents of user utterances, which is a critical component of goal-oriented dialog systems. Attaining high intent classification accuracy is an important step towards solving many downstream tasks such as dialogue state tracking Zhang et al., 2019) and dialogue management (Gao et al., 2018;Ham et al., 2020). A practical challenge is data scarcity because differ-ent systems define different sets of intents, and it is costly to obtain enough utterance samples for each intent. Therefore, few-shot learning has attracted much attention under this scenario, which is also our main focus. We evaluate four intent classification datasets originating from different domains. We summarize the data statistics in Appendix B.1.

Evaluation Setup
Semantic Textual Similarity. Same as Reimers and Gurevych (2019b); , in Table 1 we report the Spearman correlation 3 between the cosine similarity of the sentence representation pairs and the ground truth similarity scores. Short Text Clustering. We evaluate the sentence representations using K-Means (MacQueen et al., 1967;Lloyd, 1982) given its simplicity and report the clustering accuracy 4 averaged over ten independent runs in Table 2. Intent Classification. We freeze the transformer and fine-tune a linear classification layer with the softmax-based cross-entropy loss. We merge the training and validation sets, from which we sample K training and validation samples per class. We report the mean and standard deviation of the testing classification accuracy evaluated over five different splits in Table 3. 5 We set the learning rate to 1e-04 and batch size to 32. For each task, we train the model with 1000 iterations   Table 3: Few-shot learning evaluation of Intent Classification. Each result is aggregated over 5 independent splits. We choose RoBERTa-base as backbone.

Evaluation Results
We report the evaluation results in Tables 1, 2, and 3. As we can see, both SimCSE and VaSCL largely improve the performance of the pre-trained language models, while VaSCL consistently outperforms SimCSE on most tasks. To be more specific, we attain 0.6% − 2.1% averaged absolute improvement over SimCSE on seven STS tasks and 1.8% − 6.9% averaged absolute improvement on six short text clustering tasks. We also achieved considerable improvement over SimCSE on intent classification tasks under different few-shot learning scenarios. We do not include the evaluation on ATIS in Table 3 as this dataset is highly imbalanced with one single class account for more than 73% of the data. Please refer to Appendix C for details.

Analysis
To better understand what enables the good performance of VaSCL, we carefully analyze the representations at different semantic granularities.

Neighborhood Evaluation on Categorical Data
We first evaluate the neighborhood statistics on StackOverflow (Xu et al., 2017) which contains 20 balanced categories, each with 1000 text instances. For each instance, we retrieve its K nearest (top-K) neighbors in the representation space, among which those from the same class as the instance itself as treated as positives. In Figure 2a, we report both the percentage of true positives and the average distance of an instance to its top-K neighbors. For each top-K value, the evaluation is averaged over all 20,000 instances. As indicated by the small distance values reported in Figure 2a, the representation space of the original RoBERTa model is tighter and is incapable of uncovering the categorical structure of data. In contrast, both VaSCL and SimCSE are capable of scattering representations apart while better capturing the semantic structures. Compared with SimCSE, VaSCL leads to even more dispersed representations with categorical structures being better encoded. This is also demonstrated by the better performance attained on both clustering and few-shot learning reported in Tables 2&3.

Fine-grained Semantic Understanding
We then compare VaSCL against SimCSE and RoBERTa on encoding more fine-grained semantic concepts. We randomly sample 20,000 premises from the combined set of SNLI (Bowman et al., 2015a) and MNLI (Williams et al., 2017), where  the associated entailment and contradiction hypotheses are also sampled for each premise instance. In Figure 2b, we report both the distributions of the pairwise distances of the entailment or the contradiction pairs (left). While on the right-hand side, we plot the distance of each premise to its entailment hypothesis over that to its contradiction hypothesis (right). We observe the same trend that both SimCSE and VaSCL well separate different instances apart in the representation space while better discriminating each premise's entailment hypothesis from the contradiction one. Figure 2b also demonstrates that VaSCL outperforms SimCSE on better capturing the fine-grained semantics when separating different instances apart. This advantage of VaSCL is further validated by Table 1, where VaSCL consistently outperforms SimCSE on the STS tasks that require pairwise semantic inference on an even more fine-grained scale.

Explicit Data Augmentation
To better evaluate our virtual augmentationoriented VaSCL model, we compare it against different explicit data augmentation strategies that directly operate on the discrete text. Specifically, we consider the following approaches: 6 WDel (random word deletion) removes words from the input text randomly; WNet (WordNet synonym substitute) transforms a text instance by replacing its words with the WordNet synonyms (Morris et al., 2020;Ren et al., 2019); and CTxt (contextual synonyms substitute) leverages the pre-trained transformers to find top-n suitable words of the input text for substitution (Kobayashi, 2018). For each strategy, we evaluate three augmentation strengths by partially changing 5%, 10%, and 20% words of each text instance. For a positive pair (x i , x i ), x i denotes the original text and x i is the associated augmentation. We also explore the case where both x i and x i are the transformations of the original text, which we find yielding worse performance.
Virtual Augmentation Performs Better The performance of explicit text augmentation is evaluated using the standard dropout for training, i.e., "SimCSE w/ {WDel/WNet/CTxt)}" in Figure 3. As Figure 3a shows, contrastive learning with moderate explicit text augmentations, i.e., augmentation strength less than 20%, does yield better sentence representations when compared with the original RoBERTa model. Nevertheless, both virtual augmentation strategies, i.e., SimCSE & VaSCL, substantially outperform all three explicit text augmentation strategies on almost all downstream tasks. Although a bit surprising, especially considering the performance gap between SimCSE and explicit augmentations, this comparison provides a new perspective on interpreting the underlying challenge of designing effective transformations that operate on the discrete text directly. Figure 3a also empirically demonstrates that VaSCL outperforms SimCSE no matter in the presence of explicit text augmentations or not. The only exception occurs when the explicit augmentation strength is too large, i.e., 20% of the words of each text are perturbed. One possible explanation is that undesired noises are generated by the large perturbations on discrete texts directly, which can violate the coherent semantics maintained by a neighborhood and hence make it hard for VaSCL to generate effective virtual augmentations.

VaSCL Outperforms SimCSE
New Linguistic Patterns Are Required Another observation drawn from Figure 3a    in most cases, this is undesired as we expect a winwin outcome that moderate explicit augmentations could further enhance VaSCL. We hypothesize that new and informative linguistic patterns are missing for the expected performance gain.
To validate our hypothesis, in Figure 3b we report the cosine similarity between each original training example and its augmentation evaluated on the representation spaces of different models. Our observation is twofold. First, the representations induced by RoBERTa and the one trained with contextual synonyms substitution ("SimCSE w/ CTxt") are very similar in all three settings, which also explains why "SimCSE w/ WDel" attains similar performance as RoBERTa on the downstream tasks. We attribute this to the fact that CTxt leverages the transformer itself to generate augmentations which hence carry limited unseen and effective linguistic patterns. Second, as indicated by the comparatively smaller similarity values in Figure 3b, the incorporation of explicit augmentations tightens the representation spaces of both SimCSE and VaSCL, which also results in a worse performance of downstream tasks. One possible explanation is that all the three explicit augmentations are weak and noisy, which harms both the instance discrim-ination force and the semantic relevance of each neighborhood.

Conclusion
In this paper, we present a virtual augmentationoriented contrastive learning framework for unsupervised sentence representation learning. Our key insight is that data augmentation can be interpreted as constructing the neighborhoods of each training instance, which can, in turn, be leveraged to generate effective data augmentations. We evaluate VaSCL on a wide range of downstream tasks and substantially advance the state-of-the-art results. Moreover, we conduct in-depth analyses and show that VaSCL leads to a more dispersed representation space with the data semantics at different granularities being better encoded.
On the other hand, we observe a performance drop of both SimCSE and VaSCL when combined with the explicit text augmentations. We suspect this is caused by the linguistic patterns generated by explicit augmentations being less informative yet noisy. We hypothesize effective data augmentation operations on the discrete texts could complement our virtual augmentation approach if new and informative linguistic patterns are generated. models with SimCSE. We found 3e-05 yields better performance. For both SimCSE and VaSCL, we set the batch size to 1024, train all models over five epochs and evaluate the development set of STS-B every 500 iterations. We report all our evaluations on the downstream tasks with the associated checkpoints attaining the best performance on the validation set of STS-B. As we can see here, SNIPS are limited to only a small number of classes, which oversimplifies the intent detection task and does not emulate the true environment of commercial systems. The remaining three datasets contain much more diversity and are more challenging.  • SearchSnippets is extracted from web search snippets, which contains 12340 snippets associated with 8 groups Phan et al. (2008).

B.2 Short Text Clustering Dataset
• StackOverflow is a subset of the challenge data published by Kaggle 8 , where 20000 question titles associated with 20 different categories are selected by Xu et al. (2017).  • Biomedical is a subset of PubMed data distributed by BioASQ 9 , where 20000 paper titles from 20 groups are randomly selected by Xu et al. (2017).
• GoogleNews contains titles and snippets of 11109 news articles related to 152 events (Yin and Wang, 2016). ATIS (Hemphill et al., 1990) is a benchmark for the air travel domain. This dataset is highly imbalanced, with the largest class containing 73% of all the training and validation examples. Moreover, more than 60% classes have less than 20 examples. We thereby exclude this task in our evaluation.