CLINE: Contrastive Learning with Semantic Negative Examples for Natural Language Understanding

Despite pre-trained language models have proven useful for learning high-quality semantic representations, these models are still vulnerable to simple perturbations. Recent works aimed to improve the robustness of pre-trained models mainly focus on adversarial training from perturbed examples with similar semantics, neglecting the utilization of different or even opposite semantics. Different from the image processing field, the text is discrete and few word substitutions can cause significant semantic changes. To study the impact of semantics caused by small perturbations, we conduct a series of pilot experiments and surprisingly find that adversarial training is useless or even harmful for the model to detect these semantic changes. To address this problem, we propose Contrastive Learning with semantIc Negative Examples (CLINE), which constructs semantic negative examples unsupervised to improve the robustness under semantically adversarial attacking. By comparing with similar and opposite semantic examples, the model can effectively perceive the semantic changes caused by small perturbations. Empirical results show that our approach yields substantial improvements on a range of sentiment analysis, reasoning, and reading comprehension tasks. And CLINE also ensures the compactness within the same semantics and separability across different semantics in sentence-level.


Introduction
Pre-trained language models (PLMs) such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) have been proved to be an effective way to improve various natural language processing tasks. However, recent works show that PLMs suffer from * Equal contribution. This work was mainly done when Dong Wang was an intern at Tencent AI Lab. † Corresponding authors. poor robustness when encountering adversarial examples (Jin et al., 2020;Garg and Ramakrishnan, 2020;Zang et al., 2020;Lin et al., 2020a). As shown in Table 1, the BERT model can be fooled easily just by replacing ultimately with a similar word lastly.
To improve the robustness of PLMs, recent studies attempt to adopt adversarial training on PLMs, which applies gradient-based perturbations to the word embeddings during training (Miyato et al., 2017;Zhu et al., 2020;Jiang et al., 2020) or adds high-quality adversarial textual examples to the training phase (Wang and Bansal, 2018;Michel et al., 2019). The primary goal of these adversarial methods is to keep the label unchanged when the input has small changes. These models yield promising performance by constructing high-quality perturbated examples and adopting adversarial mechanisms. However, due to the discrete nature of natural language, in many cases, small perturbations can cause significant changes in the semantics of sentences. As shown in Table 1, negative sentiment can be turned into a positive one by changing only one word, but the model can not recognize the change. Some recent works create contrastive sets (Kaushik et al., 2020;Gardner et al., 2020), which manually perturb the test instances in small but meaningful ways that change the gold label. In this paper, we denote the perturbated examples without changed semantics as adversarial examples and the ones with changed semantics as contrastive examples, and most of the methods to improve robustness of PLMs mainly focus on the former examples, little study pays attention to the semantic negative examples.
The phenomenon makes us wonder can we train a BERT that is both defensive against adversarial attacks and sensitive to semantic changes by using both adversarial and contrastive examples? To answer that, we need to assess if the current robust models are meanwhile semantically sensitive. We conduct sets of pilot experiments (Section 2) to compare the performances of vanilla PLMs and adversarially trained PLMs on the contrastive examples. We observe that while improving the robustness of PLMs against adversarial attacks, the performance on contrastive examples drops.
To train a robust semantic-aware PLM, we propose Contrastive Learning with semantIc Negative Examples (CLINE). CLINE is a simple and effective method to generate adversarial and contrastive examples and contrastively learn from both of them. The contrastive manner has shown effectiveness in learning sentence representations (Luo et al., 2020;Gao et al., 2021), yet these studies neglect the generation of negative instances. In CLINE, we use external semantic knowledge, i.e., WordNet (Miller, 1995), to generate adversarial and contrastive examples by unsupervised replacing few specific representative tokens. Equipped by replaced token detection and contrastive objectives, our method gathers similar sentences with semblable semantics and disperse ones with different even opposite semantics, simultaneously improving the robustness and semantic sensitivity of PLMs. We conduct extensive experiments on several widely used text classification benchmarks to verify the effectiveness of CLINE. To be more specific, our model achieves +1.6% absolute improvement on 4 contrastive test sets and +0.5% absolute improvement on 4 adversarial test sets compared to RoBERTa model (Liu et al., 2019). That is, with the training on the proposed objectives, CLINE simultaneously gains the robustness of adversarial attacks and sensitivity of semantic changes 1 . 1 The source code of CLINE will be publicly available at https://github.com/kandorm/CLINE

Pilot Experiment and Analysis
To study how the adversarial training methods perform on the adversarial set and contrastive set, we first conduct pilot experiments and detailed analyses in this section.

Model and Datasets
There are a considerable number of studies constructing adversarial examples to attack large-scale pre-trained language models, of which we select a popular method, TextFooler (Jin et al., 2020), as the word-level adversarial attack model to construct adversarial examples. Recently, many researchers create contrastive sets to more accurately evaluate a model's true linguistic capabilities (Kaushik et al., 2020;Gardner et al., 2020). Based on these methods, the following datasets are selected to construct adversarial and contrastive examples in our pilot experiments and analyses: IMDB (Maas et al., 2011) is a sentiment analysis dataset and the task is to predict the sentiment (positive or negative) of a movie review.
SNLI (Bowman et al., 2015) is a natural language inference dataset to judge the relationship between two sentences: whether the second sentence can be derived from entailment, contradiction, or neutral relationship with the first sentence.
To improve the generalization and robustness of language models, many adversarial training methods that minimize the maximal risk for labelpreserving input perturbations have been proposed, and we select an adversarial training method FreeLB (Zhu et al., 2020) for our pilot experiment. We evaluate the vanilla BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), and the FreeLB version on the adversarial set and contrastive set. Table 2 shows a detailed comparison of different models on the adversarial test set and the contrast test set. From the results, we can observe that, compared to the vanilla version, the adversarial training method FreeLB achieves higher accuracy on the adversarial sets, but suffers a considerable performance drop on the contrastive sets, especially for the BERT. The results are consistent with the intuition in Section 1, and also demonstrates that adversarial training is not suitable for the contrastive set and even brings negative effects. Intuitively, adversarial training tends to keep labels unchanged while the contrastive set tends to make small but label-  Table 2: Accuracy (%) on the adversarial set (Adv) compared to the contrastive set (Rev) of Vanilla models and adversarially trained models.

IMDB Contrastive Set
Jim Henson's Muppets were a favorite of mine since childhood. This film on the other hand makes me feel dizziness in my head. You will see cameos by the then New York City Mayor Ed Koch. Anyway, the film turns 25 this year and I hope the kids of today will learn to appreciate the lightheartedness of the early Muppets Gang over this. It might be worth watching for kids but definitely not for knowledgeable adults like myself. Label: Negative Prediction: Positive changing modifications. The adversarial training and contrastive examples seem to constitute a natural contradiction, revealing that additional strategies need to be applied to the training phase for the detection of the fine-grained changes of semantics. We provide a case study in Section 2.3, which further shows this difference.

Case Study
To further understand why the adversarial training method fails on the contrastive sets, we carry out a thorough case study on IMDB. The examples we choose here are predicted correctly by the vanilla version of BERT but incorrectly by the FreeLB version. For the example in Tabel 3, we can observe that many parts are expressing positive sentiments (red part) in the sentence, and a few parts are expressing negative sentiments (blue parts). Overall, this case expresses negative sentiments, and the vanilla BERT can accurately capture the negative sentiment of the whole document. However, the FreeLB version of BERT may take the features of negative sentiment as noise and predict the whole document as a positive sentiment. This result in-dicates that the adversarially trained BERT could be fooled in a reversed way of traditional adversarial training. From this case study, we can observe that the adversarial training methods may not be suitable for these semantic changed adversarial examples, and to the best of our knowledge, there is no defense method for this kind of adversarial attack. Thus, it is crucial to explore the appropriate methods to learn changed semantics from semantic negative examples.

Method
As stated in the observations in Section 2, we explore strategies that could improve the sensitivity of PLMs. In this section, we present CLINE, a simple and effective method to generate the adversarial and contrastive examples and learn from both of them. We start with the generation of adversarial and contrastive examples in Section 3.1, and then introduce the learning objectives of CLINE in Section 3.2.

Generation of Examples
We expect that by contrasting sentences with the same and different semantics, our model can be more sensitive to the semantic changes. To do so, we adopt the idea of contrastive learning, which aims to learn the representation by concentrating positive pairs and pushing negative pairs apart. Therefore it is essential to define appropriate positive and negative pairs. In this paper, we regard sentences with the same semantics as positive pairs and sentences with opposite semantics as negative pairs. Some works (Alzantot et al., 2018;Tan et al., 2020; attempt to utilize data augmentation (such as synonym replacement, back translation, etc) to generate positive instances, but few works pay attention to the negative instances. And it is difficult to obtain opposite semantic instances for textual examples.

Batman is an fictional super-hero written by Batman is an imaginary super-hero created by
Batman is an real-life super-hero written by

BERT Encoder BERT Encoder BERT Encoder
Token-level Classifier Token-level Classifier Intuitively, when we replace the representative words in a sentence with its antonym, the semantic of the sentence is easy to be irrelevant or even opposite to the original sentence. As shown in Figure  1, given the sentence "Batman is an fictional superhero written by", we can replace "fictional" with its antonym "real-life", and then we get a counterfactual sentence "Batman is an real-life super-hero written by". The latter contradicts the former and forms a negative pair with it.
We generate two sentences from the original input sequence x ori , which express substantially different semantics but have few different words. One of the sentences is semantically close to x ori (denoted as x syn ), while the other is far from or even opposite to x ori (denoted as x ant ). In specific, we utilize spaCy 2 to conduct segmentation and POS for the original sentences, extracting verbs, nouns, adjectives, and adverbs. x syn is generated by replacing the extracted words with synonyms, hypernyms and morphological changes, and x ant is generated by replacing them with antonyms and random words. For x syn , about 40% tokens are replaced. For x ant , about 20% tokens are replaced.

Training Objectives
CLINE trains a neural text encoder (i.e., deep Transformer) E φ parameterized by φ that maps a sequence of input tokens Masked Language Modeling Objective With random tokens masked by special symbols [MASK], the input sequence is partially corrupted. Following BERT (Devlin et al., 2019), we adopt the masked language model objective (denoted as L MLM ), which reconstructs the sequence by predicting the masked tokens.
Replaced Token Detection Objective On the basis of x syn and x ant , we adopt an additional classifier C for the two generated sequences and detect which tokens are replaced by conducting two-way classification with a sigmoid output layer: The loss, denoted as L RTD is computed by: where δ t = 1 when the token x t is corrupted, and δ t = 0 otherwise. Contrastive Objective The intuition of CLINE is to accurately predict if the semantics are changed when the original sentences are modified. In other words, in feature space, the metric between h ori and h syn should be close and the metric between h ori and h ant should be far. Thus, we develop a contrastive objective, where (x ori , x syn ) is considered a positive pair and (x ori , x ant ) is negative. We use h c to denote the embedding of the special symbol [CLS]. In the training of CLINE, we follow the setting of RoBERTa (Liu et al., 2019) to omit the next sentence prediction (NSP) objective since previous works have shown that NSP objective can hurt the performance on the downstream tasks (Liu et al., 2019;Joshi et al., 2020). Alternatively, adopt the embedding of [CLS] as the sentence representation for a contrastive objective. The metric between sentence representations is calculated as the dot product between [CLS] embeddings: Inspired by InfoNCE, we define an objective L cts in the contrastive manner: .
(6) Note that different from some contrastive strategies that usually randomly sample multiple negative examples, we only utilize one x ant as the negative example for training. This is because the primary goal of our pre-training objectives is to improve the robustness under semantically adversarial attacking. And we only focus on the negative sample (i.e., x ant ) that is generated for our goal, instead of arbitrarily sampling other sentences from the pre-training corpus as negative samples.
Finally, we have the following training loss: where λ i is the task weighting learned by training.

Experiments
We conduct extensive experiments and analyses to evaluate the effectiveness of CLINE. In this section, we firstly introduce the implementation (Section 4.1) and the datasets (Section 4.2) we used, then we introduce the experiments on contrastive sets (Section 4.3) and adversarial sets (Section 4.4), respectively. Finally, we conduct the ablation study (Section 4.5) and analysis about sentence representation (Section 4.6).

Implementation
To better acquire the knowledge from the existing pre-trained model, we did not train from scratch but the official RoBERTa-base model. We train for 30K steps with a batch size of 256 sequences of maximum length 512 tokens. We use Adam with a learning rate of 1e-4, β 1 = 0.9, β 2 = 0.999, =1e-8, L2 weight decay of 0.01, learning rate warmup over the first 500 steps, and linear decay of the learning rate. We use 0.1 for dropout on all layers and in attention. The model is pre-trained on 32 NVIDIA Tesla V100 32GB GPUs. Our model is pre-trained on a combination of BookCorpus (Zhu et al., 2015) and English Wikipedia datasets, the data BERT used for pre-training.

Datasets
We evaluate our model on six text classification tasks: • IMDB (Maas et al., 2011) is a sentiment analysis dataset and the task is to predict the sentiment (positive or negative) of a movie review.
• SNLI (Bowman et al., 2015) is a natural language inference dataset to judge the relationship between two sentences: whether the second sentence can be derived from entailment, contradiction, or neutral relationship with the first sentence.
• PERSPECTRUM (Chen et al., 2019) is a natural language inference dataset to predict whether a relevant perspective is for/against the given claim.
• BoolQ (Clark et al., 2019) is a dataset of reading comprehension instances with boolean (yes or no) answers.
• AG (Zhang et al., 2015) is a sentencelevel classification with regard to four news topics: World, Sports, Business, and Science/Technology.
• MR (Pang and Lee, 2005) is a sentence-level sentiment classification on positive and negative movie reviews.

Experiments on Contrastive Sets
We evaluate our model on four contrastive sets: IMDB, PERSPECTRUM, BoolQ and SNLI, which were provided by Contrast Sets 3 (Gardner et al., 2020). We compare our approach with BERT and   RoBERTa across the original test set (Ori) and contrastive test set (Rev). Contrast consistency (Con) is a metric defined by Gardner et al. (2020) to evaluate whether a model's predictions are all correct for the same examples in both the original test set and the contrastive test set. We fine-tune each model many times using different learning rates (1e-5,2e-5,3e-5,4e-5,5e-5) and select the best result on the contrastive test set. From the results shown in Table 4, we can observe that our model outperforms the baseline. Especially in the contrast consistency metric, our method significantly outperforms other methods, which means our model is sensitive to the small change of semantic, rather than simply capturing the characteristics of the dataset. On the other hand, our model also has some improvement on the original test set, which means our method can boost the performance of PLMs on the common examples.

Experiments on Adversarial Sets
To evaluate the robustness of the model, we compare our model with BERT and RoBERTa on the vanilla version and FreeLB version across several adversarial test sets. Instead of using an adversarial attacker to attack the model, we use the adversarial examples generated by TextFooler (Jin et al., 2020) as a benchmark to evaluate the performance against adversarial examples. TextFooler identifies the important words in the text and then prioritizes to replace them with the most semantically similar and grammatically correct words.
From the experimental results in Table 5, we can observe that our vanilla model achieves higher accuracy on all the four benchmark datasets compared to the vanilla BERT and RoBERTa. By constructing similar semantic adversarial examples and using the contrastive training objective, our model can concentrate the representation of the original example and the adversarial example, and then achieve better robustness. Furthermore, our method is in the pre-training stage, so it can also be combined with the existing adversarial training methods. Compared with the FreeLB version of BERT and RoBERTa, our model can achieve stateof-the-art (SOTA) performances on the adversarial sets. Experimental results on contrastive sets and adversarial sets show that our model is sensitive to semantic changes and keeps robust at the same time.

Ablation Study
To further analyze the effectiveness of different factors of our CLINE, we choose PERSPECTRUM  and BoolQ (Clark et al., 2019) as benchmark datasets and report the ablation test in terms of 1) w/o RTD: we remove the replaced token detection objective (L RTD ) in our model to verify whether our model mainly benefits from the contrastive objective. 2) w/o Hard Negative: we replace the constructed negative examples with random sampling examples to verify whether the negative examples constructed by unsupervised word substitution are better. We also add 1% and 10% settings, meaning using only 1% / 10% data of the training set, to simulate a low-resource scenario and observe how the model performance across different datasets and settings. From Table 6, we can observe that: 1) Our CLINE outperformance RoBERTa on all settings, which indicates that our method is universal and robust. Especially in the  The max Hits(%) on all layers of the Transformer-based encoder.
We compute cosine similarity between sentence representations with the [CLS] token (CLS) and the mean-pooling of the sentence embedding (MEAN). And BS is short for BertScore. CLINE-B means our model trained from the BERT-base model and CLINE-R means our model trained from the RoBERTa-base model. low-resource scenario (1% and 10% supervised training data), our method shows a prominent improvement. 2) Compared to the CLINE, w/o RTD just has a little bit of performance degradation. This proves that the improvement of performance mainly benefits from the contrastive objective and the replaced token detection objective can further make the model sensitive to the change of the words. 3) Compared to CLINE, we can see that the w/o Hard Negative has a significant performance degradation in most settings, proving the effectiveness of constructing hard negative instances.

Sentence Semantic Representation
To evaluate the semantic sensitivity of the models, we generate 9626 sentence triplets from a sentencelevel sentiment analysis dataset MR (Pang and Lee, 2005). Each of the triples contains an original sentence x ori from MR, a sentence with similar semantics x syn and a sentence with opposite semantic x ant . We generate x syn /x ant by replacing a word in x ori with its synonym/antonym from Word-Net (Miller, 1995). And then we compute the cosine similarity between sentence pairs with [CLS] token and the mean-pooling of all tokens. And we also use a SOTA algorithm, BertScore  to compute similarity scores of sentence pairs. We consider cases in which the model correctly identifies the semantic relationship (e.g., if BertScore(x ori ,x syn )>BertScore(x ori ,x ant )) as Hits. And higher Hits means the model can better distinguish the sentences, which express substantially different semantics but have few different words.
We show the max Hits on all layers (from 1 to 12) of Transformers-based encoder in Table 7. We can observe: 1) In the BERT model, using the [CLS] token as sentence representation achieves worse results than mean-pooling, which shows the same conclusion as Sentence-BERT (Reimers and Gurevych, 2019). And because RoBERTa omits the NSP objective, so its result of CLS has no meaning. 2) The BertScore can compute semantic similarity better than other methods and our method CLINE-B can further improve the Hits. 3) By constructing positive and negative examples for contrastive learning in pre-training stage, our method CLINE-B and CLINE-R learn better sentence representation and detect small semantic changes. 4) We can observe that the RoBERTa has less Hits than BERT, and our CLINE-B has significant improvement compared to BERT. We speculate that there may be two reasons, the first is that BERT can better identify sentence-level semantic changes because it has been trained with the next sentence prediction (NSP) objective in the pre-training stage. And the second is that the BERT is not trained enough, so it can not represent sentence semantics well, and our method can improve the semantic representation ability of the model.

Pre-trained Language Models
The PLMs have proven their advantages in capturing implicit language features. Two main research directions of PLMs are autoregressive (AR) pre-training (such as GPT (Radford et al., 2018)) and denoising autoencoding (DAE) pre-training (such as BERT (Devlin et al., 2019)). AR pretraining aims to predict the next word based on previous tokens but lacks the modeling of the bidirectional context. And DAE pre-training aims to reconstruct the input sequences using left and right context. However, previous works mainly focus on the token-level pre-training tasks and ignore modeling the global semantic of sentences.

Adversarial Training
To make neural networks more robust to adversarial examples, many defense strategies have been proposed, and adversarial training is widely considered to be the most effective. Different from the image domain, it is more challenging to deal with text data due to its discrete property, which is hard to optimize. Previous works focus on heuristics for creating adversarial examples in the black-box setting. Belinkov and Bisk (2018) manipulate every word in a sentence with synthetic or natural noise in machine translation systems. Iyyer et al. (2018) leverage back-translated to produce paraphrases that have different sentence structures. Recently, Miyato et al. (2017) extend adversarial and virtual adversarial training (Miyato et al., 2019) to text classification tasks by applying perturbations to word embeddings rather than discrete input symbols. Following this, many adversarial training methods in the text domain have been proposed and have been applied to the state-of-the-art PLMs.  introduce a token-level perturbation to improves the robustness of PLMs. Zhu et al. (2020) use the gradients obtained in adversarial training to boost the performance of PLMs. Although many studies seem to achieve a robust representation, our pilot experiments (Section 2) show that there is still a long way to go.

Contrastive Learning
Contrastive learning is an unsupervised representation learning method, which has been widely used in learning graph representations (Velickovic et al., 2019), visual representations (van den Oord et al., 2018Chen et al., 2020), response representations (Lin et al., 2020b;Su et al., 2020), text representations (Iter et al., 2020;Ding et al., 2021) and structured world models (Kipf et al., 2020). The main idea is to learn a representation by contrasting positive pairs and negative pairs, which aims to concentrate positive samples and push apart negative samples. In natural language processing (NLP), contrastive self-supervised learning has been widely used for learning better sentence representations. Logeswaran and Lee (2018) sample two contiguous sentences for positive pairs and the sentences from the other document as negative pairs. Luo et al. (2020) present contrastive pretraining for learning denoised sequence representations in a self-supervised manner.  present multiple sentence-level augmentation strategies for contrastive sentence representation learning. The main difference between these works is their various definitions of positive examples. However, recent works pay little attention to the construction of negative examples, only using simple random sampling sentences. In this paper, we propose a negative example construction strategy with opposite semantics to improve the sentence representation learning and the robustness of the pre-trained language model.

Conclusion
In this paper, we focus on one specific problem how to train a pre-trained language model with robustness against adversarial attacks and sensitivity to small changed semantics. We propose CLINE, a simple and effective method to tackle the challenge. In the training phase of CLINE, it automatically generates the adversarial example and semantic negative example to the original sentence. And then the model is trained by three objectives to make full utilization of both sides of examples. Empirical results demonstrate that our method could considerably improve the sensitivity of pre-trained language models and meanwhile gain robustness.