Improving Gradient-based Adversarial Training for Text Classification by Contrastive Learning and Auto-Encoder

Recent work has proposed several efficient approaches for generating gradient-based adversarial perturbations on embeddings and proved that the model's performance and robustness can be improved when they are trained with these contaminated embeddings. While they paid little attention to how to help the model to learn these adversarial samples more efficiently. In this work, we focus on enhancing the model's ability to defend gradient-based adversarial attack during the model's training process and propose two novel adversarial training approaches: (1) CARL narrows the original sample and its adversarial sample in the representation space while enlarging their distance from different labeled samples. (2) RAR forces the model to reconstruct the original sample from its adversarial representation. Experiments show that the proposed two approaches outperform strong baselines on various text classification datasets. Analysis experiments find that when using our approaches, the semantic representation of the input sentence won't be significantly affected by adversarial perturbations, and the model's performance drops less under adversarial attack. That is to say, our approaches can effectively improve the robustness of the model. Besides, RAR can also be used to generate text-form adversarial samples.


Introduction
Text classification is a fundamental research topic in natural language processing (Pang et al., 2002;Lai et al., 2015;Neekhara et al., 2019;. Neural networks have obtained state-of-theart performance on many text classification datasets (Kim, 2014;Wang et al., 2018;Devlin et al., 2019). Despite these models' success, recent work has shown that they can be easily fooled by intentionally designed adversarial examples. These adversarial examples generated by adding little perturbations on original examples cannot affect human's judgment but can fail models (Ren et al., 2019a;. Adversarial training approaches are proposed to tackle this problem, which aims to enhance the model's strength of generalization and robustness by generating adversarial samples and letting the model learn them (Ren et al., 2019b;. The approaches for generating adversarial samples can be roughly classified into two categories: text-based and gradient-based. The former can be further classified into three levels: characterlevel, word-level, and sentence-level. Compared to gradient-based adversarial approaches, the textbased are explainable, but they may suffer from low attack diversity and rely more on human knowledge which limits the kinds of adversarial patterns. In contrast, during the gradient-based adversarial training process, small perturbations calculated from the gradient are added to mini-batches embeddings of original training samples, then the model's parameters will be optimized to correctly classify the original embeddings together with adversarial embeddings (Miyato et al., 2017). This kind of approach consists of two major steps: adversarial perturbation's construction and adversarial sample's learning. Recent approaches mainly focus on the first step, as for the second step, only the classification loss is used by the model to learn the adversarial samples.
In this work, we investigate gradient-based adversarial training and focus on the second step. To further improve model's robustness against adversarial perturbations, we propose two approaches for text classification models: CARL (Contrastive Adversarial Representation Learning) and RAR (Reconstruction from Adversarial Representations). We first generate adversarial samples by adding perturbations on input sentence's word embeddings, then CARL and RAR are used to learn these adversarial samples. CARL leverages the family of contrastive objectives (Gut-mann and Hyvärinen, 2010;Hjelm et al., 2019;Tian et al., 2020) and aims to prevent the semantic representation of input sentence from being affected by adversarial attacks by narrowing the distance between the adversarial sample and its corresponding original sample in the representation space, while pushing them apart from samples which belong to different classes. If the representations of adversarial sample and original sample are identical, the model won't be fragile to the adversarial attack. While CARL's goal is to learn a robust sentence-level representation, RAR acts like an auto-encoder and is designed to improve the robustness of the representation for each word by forcing the model to reconstruct original words from their adversarial embeddings. It will be much easier for the model to understand the adversarial sample when it can recognize every adversarial word embedding correctly. We summarize our contributions in the following: • We design a contrastive adversarial representation learning approach to learn adversarial examples in the representation space, which can directly improve the encoder's robustness.
• We propose a novel adversarial training task, RAR (Reconstruction from Adversarial Representations), to help the model learn a more robust representation at the word level.
• We conduct experiments on four text classification datasets and results show that our proposed approaches outperform strong baseline on accuracy and robustness. We release the source code at a GitHub repo. 1

Related Work
Gradient-based Adversarial Training. Adversarial examples were explored primarily in the computer vision area and received more attention in natural language processing recently. Different from the CV domain, we can improve NLP models' robustness and performance at the same time (Miyato et al., 2017). Miyato et al. (2017) proposed to add perturbations calculated from gradient on word embeddings to obtain adversarial samples in embedding space. Madry et al. (2018) proposed the k-PGD method and calculated adversarial perturbations through multiple forward-backward iterations to avoid the obfuscated gradient problem. It is widely accepted as the most effective approach, 1 https://github.com/FFYYang/CARL RAR but multiple iterations leads to high computation cost. To mitigate the cost, Zhang et al. (2019) restricted most perturbation updates in the first layer. Shafahi et al. (2019) designed a "free" algorithm that simultaneously updates both model parameters and adversarial perturbations in a single backward pass. Zhu et al. (2020) proposed FreeLB which simultaneously accumulates the "free" parameter gradients in each iteration and updates the model parameters all at once after all iterations.
Contrastive Learning. Contrastive learning has recently become a dominant component in selfsupervised learning methods for computer vision, natural language processing (NLP). The goal of contrastive learning is to learn a representation that is close in a certain metric space for pairs with the same label, while push apart the representation between pairs with different labels (Tian et al., 2020). This method has been successfully used in recent years for representation learning and knowledge distillation. In this work, we apply it into the adversarial training by narrowing the representations of the adversarial sample and its corresponding original sample, while enlarging their distance from samples that belong to different classes.
Auto-Encoder. The auto-encoder (Rumellhart, 1986) consists of two modules: the encoder and the decoder. The encoder is used to map the input sample x to the feature space z, i.e. the encoding process. Then the abstract feature z is mapped back to the original token space through a decoder to obtain the reconstructed sample x , i.e. the decoding process. The optimization goal is to optimize both encoder and decoder by minimizing the reconstruction error, to learn the abstract feature representation z for the input x.
In our work, we focus on the gradient-based adversarial training on the text classification where the model receives a sentence and outputs a single label. Though some neural networks have achieved promising results, they are vulnerable to the simple adversarial perturbations (Huang et al., 2017;Yuan et al., 2019). Some gradient-based adversarial training approaches were proposed to solve this problem (Zhu et al., 2020;Shafahi et al., 2019;Madry et al., 2018;Miyato et al., 2017). Most of them focus on the generation of adversarial examples, but we focus on how to use these examples to train the model more efficiently by combining the idea of contrastive learning and auto-encoder.

Approach
We aim to learn a robust text classification model by helping the model to learn adversarial samples more efficiently in the training process.

Overview
The overview of our approaches is depicted in Figure 1. Given an input training sentence, we first use FreeLB (Zhu et al., 2020) to get its adversarial embeddings E a which are likely to fool the current model. In addition to minimizing these adversarial examples' classification errors, we propose two novel approaches to train them: 1) CARL (Contrastive Adversarial Representation Learning). Its goal is to narrow the distance of sentence-level semantic representation between the original sample and its adversarial sample while pushing them away from samples that belong to different classes. We achieve this by using the CL (Contrastive Learning) module shown in Figure 1(a). 2) RAR (Reconstruction from Adversarial Representations). It is designed to reconstruct every word in the original input sentence from their adversarial representations by the reconstructor shown in Figure 1 In subsequent sections, we describe how to use CARL and RAR to train adversarial samples more effectively. In section 3.2, we describe how to use contrastive learning approach to learn a robust semantic representation for the input sentence. In section 3.3, a reconstruction module is designed to prompt the model to learn more robust lexical knowledge from input sentence's adversarial embeddings.

Contrastive Adversarial Representation Learning
Intuition. Recent gradient-based adversarial training approaches only use the classification loss to optimize the model on adversarial examples. Although they get promising results, the potential value of adversarial examples is not fully exploited. When only the classification loss is used, the model tends to learn a robust classifier, the robustness of the feature encoder is not greatly improved. After all, the classification loss function does not explicitly force the model try to learn a representation which is robust to adversarial perturbations. Representation knowledge is highly structured, because dimensions contain complex interdependencies (Tian et al., 2020). If the model learns the adversarial samples in this perspective, there will be a huge learning space. In addition, it is suitable for adversarial training, for the representation reflects the model's understanding and the extracted knowledge of the input sentence, which Figure 2: The intuition of CARL. The blue and green circle are adversarial and original representation of the input example, the triangles are representations of examples which belong to different classes. CARL aims to get two circles close and keep circles away from triangles should be consistent, no matter the input is the original sentence or the adversarial sentence. We expect the model to directly learn an encoder which can output a robust semantic representation for the input sentence, and even if the input sentence is contaminated by adversarial perturbations, the representation will not be significantly affected.
The intuition of CARL is shown in Figure 2. The big ellipse refers to the representation space which is corresponding to the output of ALBERT Encoding Block. R o , shown as the small green circle, is the representation of the original training example. R a , shown as the small blue circle, is the representation of adversarial example. R d , shown as small triangles, is a group of the representations of examples whose golden labels are different from the input sentence. CARL's goal is to make two small circles closer and make circles far away from triangles, so as to prevent adversarial attacks from leading the model to incorrectly understand the input sentence.
Implementation. We are inspired by the contrastive representation distillation approach proposed by Tian et al. (2020) and we adapt it to the text domain's adversarial training. Concretely, we design CARL's objective to maximize the lower bound to the mutual information between the adversarial and original representation of the input sentence.
Specifically, given a dataset V that consists of a collection of samples {v i } N i=1 . For each sample v i , there are many other samples that share the same label with it, we call these samples positives, accordingly, we call samples whose labels are different from v i as negatives. In addition, the adversarial sample of v i can also be called positive.
During model's training process, for each input sample v i whose embedding is E i , we sample K negatives {v n i,j } K j=1 for it. FreeLB algorithm is first used to obtain a perturbation δ which can approximately maximum classification loss inside the -ball around E i , as where y i is the golden label of v i , L C is the classification loss function, θ is model's parameter, f is the model's forward function. Adding δ to E i can obtain the adversarial embedding E adv i . Model's encoding block will then map E i and E adv i to the representation space to get R i and R adv i . Similarly, we can also get the original and adversarial representations for the negatives {v n i,j } K j=1 , we mark them as {R n i,j } K j=1 and {R n,adv i,j } K j=1 . We expect the distance between R i and R adv i to be as close as possible while pushing the representations of negatives away from them. To achieve this, we adapt the contrastive objective proposed by Tian et al. (2020) into our optimization problem, as where L a D is the contrastive loss function anchored on the adversarial representation R adv i of v i , it aims to force the input sentence's original representation and adversarial representation close, and push the adversarial representation of the input sentence apart from its negatives' original representations, and it is optimized on set i,1 , ..., R n,adv i,K }. h θ is a discriminating function which outputs a big value for positive pairs and small for negative pairs, we use vector dot product's result as the score and adjust its dynamic range by a hyperparameter τ , as In practice, K can be extremely large. To make the computation of Eq.2 tractable, we randomly select m(m < K) negatives from the dataset. Besides, Noise-Contrastive Estimation (Gutmann and Hyvärinen, 2010;Wu et al., 2018) is used to approximate the softmax distribution as well as reduce the computational cost.
During the model's training process, for every training sample, we need m negatives' original and adversarial representations. For m is usually large in practice, so it is impossible to calculate all of these representations at the same time during each mini-batch's iteration. Following Wu et al. (2018), we maintain two memory banks, B orig , B adv , to store the original and adversarial representations for every training sample. Therefore, when we calculate the contrastive loss, we don't have to recompute negatives' representations and we can just retrieve them from the memory bank. Besides, the memory bank should be dynamically updated with newly computed representations at each mini-batch iteration, as where M is a hyperparameter, i is the index of a training sample. To be noticed, CARL cannot be used at the beginning of training, because the model is unstable and both original and adversarial representations are noisy. Optimizing contrastive loss at this time can cause the model difficult to converge. The proper way is to wait until the model is going to be stable, and use an entire epoch to forward every training sample through the model to initialize the whole memory bank, after which the contrastive loss can be used to optimize the model. In conclusion, we will optimize the following problem, as where v is one training sample, y is its golden label, D is the data distribution, L C is the classification loss.

Reconstruction from Adversarial Representations
Intuition. The gradient-based adversarial attacking approach adds perturbations on every word's embedding, we have no idea the contaminated embedding indicates which word in the real world. If the model cannot recognize the contaminated word embedding or identify it to a wrong word, its understanding of the whole sentence's semantics could be wrong, especially when the keyword of the sentence is misunderstood by the model. The special cases are easy to occur because we find that the norm of adversarial perturbation added to the keyword of a sentence is usually larger than that of others words, and it makes the keyword harder to be recognized.
To solve the problem, inspire by the Masked Language Model proposed in BERT (Devlin et al., 2019). we design RAR to reconstruct every token from its adversarial representation. To reconstruct tokens correctly, the model should not only learn more robust lexical knowledge for every word but also accurately understand the semantics of the whole sentence.
Implementation. Inspired by the pre-training task used in BERT (Devlin et al., 2019), we map the adversarial representation of each word to a vector which length is the vocabulary size.
Specifically, the reconstructor receives input sentence's token-level adversarial representation R adv,tok i ∈ [sequence length, hidden size] from the ALBERT Encoding block as input, then R adv,tok i will be forwarded through a Layer Normalization, GeLU Activation Function and two Feed Forward Layers. The first feed-forward layer maps the hidden size to embedding size and the second feed-forward layer's parameters are shared with ALBERT Embedding Layer to project the embedding size into vocabulary size. Then, we can get the predicted probability distribution over the vocabulary for every token's position in the sentence. Finally, we use the cross-entropy function to calculate the reconstruction loss L R .
In the training process, FreeLB and RAR are combined to optimize the model. After we use FreeLB to get the adversarial representations of every word and the whole sentence, we simultaneously feed them to the reconstructor and the classifier accordingly. That is, the model is asked not only to predict the correct class of the adversarial sample but also to reconstruct the sample's original words from their adversarial representations. In conclusion, we will optimize the following problem, as min θ (LC + wr · LR) (v,y)∼D max ||δ||< LC (f θ (E + δ), y) , where w r = 0.1 is the weight for the reconstruction loss.
We evaluate our approaches on four datasets. We first introduce the datasets, the baselines, and the experiment settings. Then, we show experiment results and provide further analysis.

Datasets
We use four text classification datasets: SST-2, Yelp-P, AG's News, and Yahoo! Answers.
SST-2. The Stanford Sentiment Treebank (Socher et al., 2013) consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentence-level sentiment (positive/negative) of a given input text.
Yahoo! Answers. This dataset is composed of ten topic categories: Society & Culture, Science & Mathematics, Health, Education & Reference, etc. In this work, we use five categories. For every category, we use 12,000 training samples, 400 validation, and 400 test samples.
Yelp-P. The original Yelp dataset is built using reviews from the website Yelp 2 . Each review has a rating label varying from 1 to 5. We use it as the binary classification, and randomly choose 30,000 training samples, 1000 validation, and 1000 testing samples for every class.
AG's News. This is a dataset of more than one million news articles and they are categorized into four classes: World, Sports, Business, and Sci/Tech. Each class contains 30,000 training samples and 1,900 testing samples. In our work, for each class, we use 15,000 training samples, 500 validation and testing samples.

Baselines
We compare our proposed approach with the following approaches.
ALBERT for Text Classification. For AL-BERT, the first token of the sequence is [CLS], when doing the text classification task, ALBERT takes the final hidden state h of the [CLS] token as the representation of the whole sentence. The classifier consists of a feed-forward layer and a softmax function. where W is a learnable parameter matrix, c is the class. ALBERT is fine-tuned with all parameters as well as W jointly by maximizing the logprobability of the golden label.
FreeLB. FreeLB, proposed by Zhu et al. (2020), adds adversarial perturbations to ALBERT embedding layer's output, and minimizes the resultant adversarial loss around input samples, it leverages the "free" training strategy (Shafahi et al., 2019) to improve the efficiency of adversarial training, which made it possible to apply PGD-based adversarial training (Madry et al., 2018) into large-scale pre-trained language model. In this work, we apply FreeLB to ALBERT model.

Experiment settings
We implement our two approaches on albert-base-v2 (from huggingface's pytorch implementation 3 ), the parameters of ALBERT Embedding Block and ALBERT Encoding Block are loaded from the pretrained model, we do experiments on the finetuning stage. We use the Adam optimizer to train the modules and the learning rate is set to 1e-5, and batch size is 16 for AG's News and 32 for the other three datasets. Since FreeLB's hyperparameters highly depend on the characteristic of the dataset, we apply hyperparameter search to every dataset and the searching results are shown in Table 1. These hyperparameters stay unchanged in CARL and RAR. We train our models on two Tesla P40s. CARL and RAR are both implemented based on the FreeLB. In RAR, L R is used to update the model's parameters from the beginning of the training. While in CARL, L D is used after the model is about to be stable (specific settings can be found in Table 3). Besides, m is set to 20000 for YelpP and 16000 for the other three datasets. τ and M is  SST-2 Yahoo! Yelp-P AG's News τ 6315 7200 5625 7750 Table 3: Steps after which L D will start to be used in CARL before which only L C is used to optimize the model's parameters.
set to 0.07 and 0.5 respectively. For SST-2, we use a development set to do the evaluation. To make the results reliable, we run each experiment three times with the same hyperparameters but different random seeds and report their average scores. For the other three datasets, we use a development set to choose the best training checkpoint and evaluate it on the test set.

Results and Discussion
The results of the proposed approach and baselines are shown in Table 2. FreeLB, CARL, and RAR let the adversarial samples participate in the model's training process, so it's not surprising that all of them perform better than ALBERT. These improvements can be mainly attributed to the effect of data augmentation.
The experiment results also show that the performance of CARL and RAR on four data sets is higher than FreeLB. These results demonstrate that the approaches we proposed to defend against gradient-based adversarial attacks during the training process are effective and well applied to various text classification datasets. We conjecture that this is because the contrastive objective can encourage the model to discover the true underlining knowledge which can determine the classification label from adversarial and original representation. This underline knowledge is robust against adversarial perturbation added on the original sample and won't be changed by modifying the statement of the sentence. When the model can learn this knowledge, its generalization and robustness will be improved.  When comparing CARL and RAR, CARL performs better than RAR in most cases. It is because CARL's training objective is to narrow the distance between the adversarial sample and the original sample in the representation space, while the classifier of the model is also based on the representation of the sentence, so the objective of CARL has a more straight forward contribution to the classification task than that of the RAR.

Analysis
The difference between adversarial and original sample's representations. Table 4 compares the Euclidean distance and cosine similarity between adversarial and original samples' sentencelevel representations in four approaches. We use AG's News test set to do this experiment. We use the models trained by the above four approaches, and for every sample v i , we first calculate its original representation R i , and obtain their adversarial representation R adv i using the k-PGD approach with the same hyperparameters setting, then measure their distance by the cosine similarity and the Euclidean distance. We also compare results when  using different max perturbation norms α in k-PGD. The final result is the average of all samples. Experiment results show that FreeLB, CARL, and RAR perform much better than ALBERT either on the cosine similarity or Euclidean distance, this indicates that the robustness of the model in the representation space can be effectively improved by optimizing the classification error of adversarial samples. In addition, when compared with FreeLB, CARL, and RAR, the performance of RAR is the best, followed by CARL. This shows that our approaches are effective to further improve model representation space's robustness and RAR is more effective. The reason why RAR is better than CARL can be explained that the objective of RAR is more difficult than that of CARL. The optimization objective of RAR is at the token level, while CARL is at the sentence level, so RAR can encourage the model to learn additional lexical knowledge which is also beneficial for improving the semantic representation of the whole sentence.
The robustness of performance. We use the k-PGD method to attack models trained on AG's News by four approaches. Experimental results showed that the performance of the FreeLB, CARL, and RAR is significantly better than ALBERT. That is because they allow the adversarial samples to participate in the model's training process. In the case of FreeLB, RAR, and CARL, CARL is the best, followed by RAR. The reason can be explained from the perspective of multi-task learning. If we regard CARL and RAR as two multi-task learning frameworks, it is obvious that compared to the reconstruction task used in RAR, the contrastive learning task used in CARL is more similar to the classification task, because both of these two tasks' objectives operate on sentence-level representations. In addition, RAR performs better on representation robustness while CARL performs better on performance robustness. This indicates that although narrowing the representation distance between original and Outer-space buffs might love this film, but others will find its pleasures intermittent. N Outer-space buffs would love this film, but others will find its pleasures occasional. P The film will play equally well on both the standard and giant screens.

P
The film would play more well on all the standard and giant screens.  Table 6: Reconstructed adversarial samples. The first line is the original sentence, the second line is the reconstructed sentence. N and P refers to negative and positive label the model predicted. The model can correctly classify the original sentences, but not these reconstructed sentences.
adversarial samples can improve the model's performance and robustness. It's not the case that the shorter distance, the more robust performance.
Reconstructed adversarial samples. We let SST-2's dev set forward the trained RAR model and use the k-PGD method to attack it. Then we take the output logits of the RAR module to obtain the reconstructed sentence. We find that we could get some text-form adversarial samples in this way. The semantics of these reconstructed samples are almost identical with that of original samples, but they can fool the model trained by ALBERT successfully. Table 6 shows some examples of the reconstructed sentences which can be used as textform adversarial samples and can be further used as augmented data.

Conclusion
In this work, we propose two gradient-based adversarial training approaches, CARL and RAR, to improve the performance and robustness of text classification models. The key idea of CARL is narrowing the original sample and adversarial sample in the representation space. While RAR forces the model to reconstruct the original tokens from their adversarial representations. Experiments demonstrate our approaches outperform the baseline. The sentence representation and the model's performance are more robust, which proves the effectiveness of the proposed approaches. Besides, RAR can be used to generate adversarial examples.

A Appendices
We provide some details of experiment settings.

A.1 Additional Experimental Details
There is no significant difference in the training time between our proposed two approaches. For SST-2 and AG's News, it takes about two hours to train the model. For Yelp-P and Yahoo, it takes about ten hours.
The number of parameters in each model is shown in Table 7. The number of parameters for ALBERT, FreeLB, and CARL is the same, while RAR has more parameters because there is an additional reconstructor module.

A.2 Hyperparameter Search Details
Because the hyperparameters of FreeLB differ greatly in different datasets, we should search for the best hyperparameter configuration for each dataset. We first set the searching bounds of each  hyperparameter as shown in Table 8. Then we combine grid search and manual tuning approaches. Specifically, grid search is first used to search at a relatively large granularity, and then manual tuning is used to search at a small granularity. The criterion used for hyperparameter searching is the accuracy of the validation set. The searching result is also used in CARL and RAR.  Step size α, maximum perturbation norm (if it is set to zero, the perturbation's norm is not limited), number of iteration steps n, the magnitude of initial random perturbation γ.

A.3 Datasets Details
The statistics information of four datasets is shown in Table 9. Except SST-2, we only use a portion of data which is randomly selected from the original dataset because of the limitation of computing resource. Since our goal is not to reach the SOTA but to gain relative improvement of performance and robustness compared to FreeLB, dropping some training data won't affect it. The data pre-processing approach is the same as huggingface's implementation 4 . In addition, we randomly sample m negatives for each training example in CARL.