Weight Perturbation as Defense against Adversarial Word Substitutions

,


Introduction
Deep neural networks (DNNs) have achieved impressive results in a wide range of domains, but they were found to be vulnerable to adversarial examples maliciously crafted by adding a small perturbation to original examples (Szegedy et al., 2014).Many studies have demonstrated the vulnerability of DNNs on various natural language processing (NLP) tasks, including machine translation (Zhao et al., 2018;Cheng et al., 2020), dialogue systems (Cheng et al., 2019) and text classification (Liang et al., 2018;Zhao et al., 2018;Gao et al., 2018;Ren et al., 2019;Jin et al., 2020).These methods attack an NLP model by replacing, scrambling, and erasing characters or words under certain semantic and syntactic constraints.
The existence and pervasiveness of textual adversarial examples have raised serious concerns, espe-cially when NLP models are deployed to securitysensitive applications.Many methods have been proposed to defend against adversarial attacks for neural NLP models, including adversarial data augmentation (Zheng et al., 2020;Si et al., 2021), adversarial training (Madry et al., 2018;Zhu et al., 2020) and certified defense (Jia et al., 2019;Huang et al., 2019;Ye et al., 2020).Most of them improve the adversarial robustness of NLP models by applying some perturbations on input data and making the models robust to these perturbations in the input space.For example, one of the most effective methods is adversarial training that applies a min-max optimization into the training process by adding (usually gradient-guided) perturbations to the input embeddings (Miyato et al., 2017;Sato et al., 2018;Zhu et al., 2020).By augmenting these perturbed examples with the original training data, the models are robust to such perturbations.However, it is infeasible to enumerate and explore all possible inputs that would be fed to models by adversaries.In this study, we want to explore the feasibility of enhancing the robustness of neural NLP models by performing weight perturbations in the parameter space.The weight perturbation is useful to find a better solution in the parameter space (i.e., the weights) that minimizes the adversarial loss among other feasible solutions.
Adversarial weight perturbation (Wu et al., 2020;Foret et al., 2021) has been investigated in the image domain, but our preliminary experiments show that their methods to implement weight perturbation cannot be trivially applied to NLP models.The existing weight perturbation results in inferior robustness and requires a long training time due to the discrete nature of texts.We found that the weight perturbation works better for NLP models when it is combined with the perturbation in the input feature space.Based on this finding, we propose a mixed adversarial training method with accumulated weight perturbation, named MAWP.The mixed adversarial training was designed to boost the model's robustness by combining the weight perturbation with the traditional adversarial training (i.e., perturbation in the input embedding space like FreeLB (Zhu et al., 2020)).In this way, the resulting models can benefit more from the weight perturbation by exposing them to the input perturbation during the training process.The accumulated weight perturbation was mainly introduced to accelerate the training process while the model's robustness can be further improved.The accumulated perturbation takes the smoothed form of a weighted sum of gradient descents calculated in the previously-performed weight perturbations, which carries the global gradient information and gives a clear signal in which direction the parameters should move to aggressively if the successive gradients point in a similar direction.Through extensive experiments, we demonstrate that our method can boost the robustness of NLP models to a great extent while suffering no or little performance drop on the clean data across three different datasets.

Textual Adversarial Defense
The goal of adversarial defenses is to learn a model capable of achieving high accuracy on both clean and adversarial examples.Recently, many defense methods have been developed to defend against textual adversarial attacks, which can roughly be divided into two categories: empirical (Miyato et al., 2017;Sato et al., 2018;Zhou et al., 2021;Dong et al., 2021) and certified (Jia et al., 2019;Huang et al., 2019;Ye et al., 2020) methods.
Adversarial data augmentation is one of the most effective empirical defenses (Ren et al., 2019;Jin et al., 2020;Li et al., 2020) for NLP models.During the training time, they replace a word with one of its synonyms to create adversarial examples, and the models are trained on the dataset augmented with these adversarial examples.By augmenting these adversarial examples with the original training data, the model is robust to such perturbations.Zhou et al. (2021) and Dong et al. (2021) relax a set of discrete points (a word and its synonyms) to a convex hull spanned by the word embeddings of all these points, and use a convex hull formed by a word and its synonyms to capture word substitutions.Adversarial training (Miyato et al., 2017;Zhu et al., 2020) is another one of the most successful empirical defense methods by adding norm-bounded adversarial perturbations to word embeddings and minimizes the resultant adversarial loss.
The downside of existing empirical methods is that failure to discover an adversarial example does not mean that another more sophisticated attack could not find one.To address this problem, some certified defenses (Jia et al., 2019;Huang et al., 2019;Ye et al., 2020) have been introduced to guarantee the robustness to certain specific types of attacks.However, the existing certified defense methods make an unrealistic assumption that the defenders can access the synonyms used by the adversaries.They would be easily broken by more sophisticated attacks by using synonym sets with large sizes (Jin et al., 2020) or generating synonyms dynamically with BERT (Li et al., 2020).
Most of the existing defense methods improve the robustness by making the models adapt to the training set augmented with the adversarial examples crafted by adding adversarial perturbations to discrete tokens or distributed embeddings.However, it is infeasible to enumerate all possible inputs that would be fed to the models by adversaries.In contrast, we have full control over the values of the model's parameters.Therefore, we propose to improve the robustness of neural NLP models by performing weight perturbations in the parameter space rather than in the input space.

Weight Perturbation
Weight perturbation has been explored to improve the generalization of models in the image domain.Graves (2011) first investigated the method to apply the perturbations on the weights of neural networks and introduced a stochastic variational method to improve the generalization.Following this direction, Foret et al. (2021) proposed an optimization method, named Sharpness-Aware Minimization (SAM), to seek the values of parameters that yield a uniformly low loss in their neighborhood.
Recently, researchers from the computer vision community have also tried to improve the model's robustness by weight perturbation.He et al. (2019) presented a Parametric Noise Injection method that intentionally injects trainable noises on the activations and weights of neural networks.Wu et al. (2020) showed the correlation of model's performance with the direction and scale of weight perturbation by investigating the weight loss landscapes of multiple adversarial training techniques.They also proposed an Adversarial Weight Perturbation (AWP) method, which can be incorporated into existing adversarial training methods to narrow down the gap in robustness between training and test sets.
However, the existing weight perturbation methods cannot trivially be applied to NLP models due to the discrete nature of texts.To address the problem we faced when implementing the weight perturbation to train NLP models, we propose a mixed adversarial training method to further improve the adversarial robustness of neural NLP models and introduce an accumulated weight perturbation to speed up the training process.This study is among the first ones to study how to apply the adversarial weight perturbations in the text domain.

Preliminary
In the following, we first introduce the traditional adversarial training that performs perturbation in the input feature space, and then give a brief review of the adversarial weight perturbation.Before diving into the details, we need to set up some notations.A training dataset D = {(x i , y i )} n i=1 with n instances consists of a set of the feature vector representation x ∈ R d of an input text x and its corresponding label y ∈ {1, ..., C} pairs, where d is the size of feature vectors and C is the number of classes.Given a neural text classifier with a set of trainable weights w and a loss function of L(w, x, y), the regular training aims to find the values of weights w that minimizes the empirical risk of E (x,y)∼D [L(w, x, y)].In adversarial training, we denote the adversarial perturbation to input feature vectors x as δ, the weight perturbation to model's weights w as ϵ w , and the number of ascent steps as k ∈ {1, ..., K}.

Traditional Adversarial Training
Traditional adversarial training can be formulated as a min-max optimization problem (Madry et al., 2018) as follows: where δ is constrained in a Frobenius norm ball with a radius ϵ.As pointed out by Zhu et al. ( 2020), the outer minimization can be achieved by Stochastic Gradient Descent (SGD) method, and the inner maximization can be accomplished by Projected Gradient Descent (PGD)-based attack algorithm.Specifically, PGD-based algorithms take the following step (with a step size α) at k-th iteration under the constraint of Frobenius norm: where g(δ k ) = ∇ δ k L(w, x + δ k , y) denotes the loss gradient with respect to δ k , and ∥δ∥ F ≤ϵ is the projection of input perturbation δ within the Frobenius norm ball with the radius ϵ.

Adversarial Weight Perturbation
We here give a brief introduction of adversarial weight perturbation (Wu et al., 2020;Foret et al., 2021) in the image domain.Given an adversarial example x ′ of a clean one x, the adversarial weight perturbation seeks the values of parameters w that have the lowest training loss within the surrounding neighborhood, which can be formulated as follows: where ρ is the radius of weight perturbation under the l 2 -norm.Since it is hard to directly maximize the values of ϵ w , the optimal values can be approximated via the first-order Taylor expansion of L(w + ϵ w , x ′ , y) as follows: By this approximation, the value of ϵ w can be estimated as ρ∇ w L(w, x ′ , y)/∥∇ w L(w, x ′ , y)∥ 2 .Specifically, Wu et al. (2020) performs the adversarial weight perturbations at layer level by setting Once the values of ϵ w are obtained, they update w based on the gradient of ∇ w L(w, x ′ , y)| w+ϵw in order to make the models generalize well on the adversarial sample x ′ .

Method
In the following, we first introduce our accumulated weight perturbation method that is designed to accelerate the training process when fine-tuning a pre-trained language model.After then, we discuss how to combine it with adversarial training to further improve the robustness of NLP models.

Accumulated Weight Perturbation
Adversarial weight perturbation works by searching for the worst case within the neighborhood of current weights with respect to the training loss and finding a better solution in the neighborhood for k = 0 . . .(K − 1) do 8: // Calculate weight perturbation ϵw k 9: // Accumulate gradient g k for w k+1 13: // Update δ k via input perturbation 15: end for 18: // Calculate accumulated perturbation ϵg 19: end for 23: end for 24: return w through minimizing the adversarial loss.Note that the gradient calculated to find the worse case by the weight perturbation can be reused to optimize the value of the current weight, which would be updated in the reverse direction of the gradient already computed.For example, starting from a point of weight we can calculate its gradient with respect to the loss function to locate the worse case around this point to perform the weight perturbation.The normalized version of this gradient also can be used to update the model's parameters by the gradient descent.By reusing such a gradient calculated at each step of weight perturbation, we can improve both the generalization and robustness with less computational cost.
We also found experimentally that the robustness of models can be further improved by introducing a global term that takes the form of the accumulated gradients obtained at previous perturbation steps.The accumulated gradient carries the global information and gives a clear signal in which direction the parameters should move to aggressively if the gradients obtained at different steps point in a similar direction.
At each step k, we can perform the weight perturbation ϵ w k that finds the worse case in the neighborhoods of weight w k .Meanwhile, an accumulated weight perturbation is also calculated which takes the smoothed form of a weighted sum of gradient descents calculated in the previously-performed weight perturbations.Specifically, for each minibatch B, the values of weight perturbation ϵ s are obtained by summing up all the perturbations as ϵ s = K−1 k=0 ϵ w k .Then, we can calculate the accumulated weight perturbation ϵ g as follows: where η 2 is the radius of the accumulated weight perturbation, ϵ 0 a constant introduced for numerical stability in the computation, and ∥ϵg∥ 2 ≤η 2 the projection function.

Mixed Adversarial Training Method
We here describe our adversarial weight perturbation and how to combine it with FreeLB (Free Large-Batch) (Zhu et al., 2020), a popular adversarial training algorithm.The algorithm of FreeLB adds adversarial perturbations to word embeddings and minimizes the resultant adversarial loss inside different regions around input samples.It also adds norm-bounded adversarial perturbations to the input sentences' embeddings using a gradient-based method and enlarges the batch size with diversified adversarial samples under such norm constraints.By the mixed adversarial training, NLP models can benefit more from the adversarial weight perturbation by exposing the models to the input perturbation during the training process.Specifically, at the k-th ascent step, we calculate the weight perturbation ϵ w k based on the input perturbations δ k and the weights w k as follows: The input perturbation δ 0 is initialized from a uniform distribution, and δ 0 ∼ 1 √ H U (−σ, σ).After calculating the weight perturbation ϵ w k , the perturbed weights of w k+1 can be updated by w k + ϵ w k .Following the gradient accumulating operation in FreeLB, we compute the gradients of g k+1 with respect to the perturbed weights w k+1 , the perturbed inputs x + δ k and the accumulated gradient g k in k − 1-th step: where g 0 is initialized to 0. When we calculate the accumulated gradient g k+1 , it is free for us to calculate the input perturbation without additional cost.Therefore, we can calculate the input perturbation δ k+1 based on the perturbed weights w k+1 and the former input perturbation δ k as follows: We list the proposed MAWP in Algorithm 1.After the weights w 0 and the input perturbations δ 0 are initialized, we calculate the weight perturbations ϵ w k and add them to the model's weight w k at each ascent step.We then calculate the gradient g k (Line 14) and the next input perturbations δ k+1 (Line 16).Finally, the accumulated weight perturbations ϵ g are calculated to update the model's weights with the adversarial loss.

Experiments
We conducted three sets of experiments.The first one is to evaluate our MAWP in both clean accuracy and adversarial robustness on several datasets under three representative attack algorithms, compared to seven baseline methods.The goal of the second one is to investigate the impact of the number of ascent steps and training epochs on the performance.In the third experiment, we would like to better understand the interpretability of adversarial training via visualizations.
Three widely-used text classification datasets were used for evaluation: Stanford Sentiment Treebank (SST-2) (Socher et al., 2013), AG-News corpus (AGNEWS) (Zhang et al., 2015) and Internet Movie Database (IMDB) (Maas et al., 2011).SST-2 has about 67, 000 sentences for binary categories, and IMDB consists of about 50, 000 movie reviews for positive and negative sentiment classification, and AGNEWS is a text classification dataset pertaining to four categories containing about 30, 000 news articles.For fair comparison, we used a BERT-based model (Devlin et al., 2019) as the base model for all the defense methods.

Implementation Details
We implemented MAWP based on Huggingface Transformers1 .We chose to use PGD (Madry et al., 2018), FreeLB++ (Li et al., 2021), TA-VAT (Li and Qiu, 2020), and InfoBERT (Wang et al., 2020) as baselines.They are widely-used defense methods against textual adversarial attacks or have been proposed most recently.We also developed two strong baselines.One is the combination of AWP (Wu et al., 2020) (a weight perturbation method proposed for computer vision tasks) and PGD, denoted as PGD-AWP.Another is the combination of AWP with FreeLB, denoted by FreeLB-AWP.These two baselines were also designed for the ablation study and used to compare our mixed adversarial training method.For fair comparison, we also implemented PGD-AWP and FreeLB-AWP as describe in Algorithm 1 except using AWP instead of the proposed weight perturbation.
The size of the weight perturbation η was set to 1×10 −5 for all the training methods compared, and that of the accumulated perturbation η 2 was set to 2 × 10 −5 .All the experimental results are obtained over three runs with different random initialization.

Attack Algorithms and Evaluation Metrics
Three representative attack algorithms were used to evaluate the adversarial robustness of models: TextFooler (Jin et al., 2020), BERT-Attack (Li et al., 2020) and TextBugger (Li et al., 2019).We use these attack algorithms reimplemented by TextAttack toolkit2 (Morris et al., 2020).TextFooler and BERT-Attack adversarially perturb text inputs by synonym-based substitutions, whereas TextBugger can perform adversarial perturbation to inputs at both character and word levels.TextFooler generates synonyms using 50 nearest neighbors in the word embedding space, while BERT-Attack uses BERT to generate synonyms dynamically, and thus no defender can know in advance the synonyms used to replace original words by BERT-Attack.Following Li et al. (2021), we used three metrics to evaluate our methods and other competitors: clean accuracy (the accuracy of models on clean examples) is denoted as Clean%, accuracy under attack (the accuracy of models on adversarial examples under a certain attack) denoted as Aua%; and number of queries (the average number of queries an attacker needs to perform successful attacks) denoted as #Query.
The clean accuracy is calculated on the entire test set while the other two metrics are evaluated on the 1, 000 examples randomly sampled from the test set.In Table 1, we also report the experimental results under the attack algorithms imposed with some constraints suggested by Li et al. (2021) to ensure the quality of adversarial examples generated.For all the attack algorithms, the maximum percentage of words that are allowed to be modified is set to 0.2 on SST-2, 0.3 on AGNEWS, and 0.1 on IMDB.We set the minimum semantic similarity between original sample and adversary to 0.84, set the maximum number of one word's synonyms to 50 and set the maximum number of queries to the victim model for each sample text to 50L, where L serves as the length of sample text.For a fair comparison, the number of ascent steps K was set to 10 for all adversarial training methods considered.

Experimental Results
From the numbers reported in Table 1, a handful of trends are readily apparent: (1) MAWP consistently outperforms all the competitors by a significant margin on the adversarial data across three different attacks on SST-2 and IMDB datasets, and it also achieves comparable results on the predication accuracy of clean examples; (2) Although the gap in the accuracy under attack between MAWP and other baseline methods is slightly reduced under the attack algorithms imposed by some constraints as recommended by (Li et al., 2021), the adversaries require much more queries to find adversarial examples for the models trained with MAWP.Note that the greater the average number of queries required by the adversaries is, the more difficult it is for the defense model to be compromised; (3) MAWP outperforms most of the competitors and achieves a relatively high clean accuracy of 95.23 on the test set of AGNEWS.

Impact of Different Training Epochs
The accumulated weight perturbation was introduced to accelerate the training process and further improve the adversarial robustness of models.To evaluate its effectiveness, we implemented a variant of MAWP, denoted as "MAWP w/o Accum", in which the accumulation operation is not used (i.e., reset the value of ϵ g to 0 at every minibatch).
As shown in Figure 1, we found that: (1) Both MAWP and its variant outperform FreeLB++ and FreeLB-AWP in the accuracy under attack especially after 10 epochs; (2) MAWP requires fewer training epochs than its variant of "MAWP w/o Accum" in order to achieve the same or similar Aua%; (3) The robustness of the model trained with FreeLB-AWP still can be improved even after 35 epochs while the Aua% of the model trained $FFXUDF\XQGHU$WWDFN 0HWKRGV %DVH )UHH/% )UHH/%$:3 0$:3 0$:3ZR$FFXP with FreeLB++ drops obviously at the end of the training process.

Impact of the Number of Ascent Steps
We would like to understand how the choice of the number of ascent steps K impacts the adversarial robustness of models.MAWP was compared to FreeLB++ with different numbers of ascent steps, and the results on SST-2 dataset are reported in Table 2.

Methods
Step  MAWP outperforms FreeLB++ in all the cases of K under three different attack algorithms.As the number of ascent steps grows, the clean accuracy of the two models drops to 91.14 and 91.21 respectively, and the performance in adversarial robustness also drops slightly.It shows that too much perturbation does harm the model's clean accuracy and adversarial robustness.Therefore, we chose to set K = 10 in all the experiments except those reported in this subsection.

Input Loss Landscape
To give a reasonable explanation of the effect of an enlarged number of ascent steps, we visualized the input loss landscapes produced by the models trained with FreeLB++ and MAWP with the step K ∈ {5, 10, 15}.The visualization was provided by the models trained on SST-2 dataset.Specifically, we perturb the original input embedding x to x + αδ 1 + βδ 2 , where δ 1 and δ 2 denote two random Gaussian direction vectors with normalization, and α and β are two scalar parameters.Then, the corresponding input loss landscapes can be obtained and are shown in Figure 2. As we can see from Figure 2-(a), (c) and (e), the input loss landscapes will gradually become more flattened as the number of ascent steps K grows when the BERT-based model is trained by FreeLB++ (an enhanced adversarial training method of FreeLB).The similar trend also can be observed from , 2(d) and 2(f) when MAWP is used to train the models.As the number of ascent steps K grows, the region with lower loss will become larger, leading to more robust models.However, too strong perturbations without the norm-bounded constraints will distort the decision boundary of the models, which reduces the models' generalization and robustness.
To sum up, a larger number of ascent steps will make the input landscape flatter and enable the model less impacted by the adversarial perturbations.However, the model with a too flat landscape (i.e., an over-smooth model) will result in poor generalization and robustness.

The Features Captured with and without
Weight Perturbation We want to gain a deeper understanding of why the weight perturbation will lead to more robust models.We examined the differences in the feature vectors produced by the resulting models between original and perturbed examples.Given a neural model f , for an original example x we obtain a perturbed example x that stays very close to the decision boundary of the model f and f (x) = f (x).
Note that the smaller the difference, the harder a model can be compromised by the adversaries, and the more robust the model will be generally.We chose to use TextFooler * as the attack algorithm when generating such perturbed examples.Note that TextFooler * was imposed by some constraints highly recommended by (Li et al., 2021).We used this variant of TextFooler because we want the results yielded by different defense methods can be compared in a more controlled manner.We define the following distance function G(x, x) to measure the difference between the feature vectors of x and x: where h j (x) denotes the averaged feature vector extracted from the j-th layer of the BERT-based model for an input x.We chose to use l 2 -norm for calculating the distances since the size of perturbation in the embedding space is usually measured based on l 2 -norm for almost all the textual attack algorithms.We first investigated a set of an input x and its quasi-adversarial example x (stays very close to the decision boundary but does not make the model change its prediction yet) produced by different defense methods where x was generated against the BERT-based model.The reported differences are averaged over all the examples in SST-2 test set.As shown in Figure 3-(a), we found that: (1) The distances between the features of x and x generated at the 0-th layer (i.e., embeddings) are quite similar for almost all the models except InfoBERT, which aggressively compresses the embeddings; (2) At the deeper layers, the differences in this distance gradually increase among the models.More adversarially robust models have a lower distance between the features of x and x generated at every layer, especially for the last few layers.We also examined the differences between the feature vectors of original examples and perturbed ones generated against respective models (i.e., under test-time attacks).As shown in Figure 3-(b), the model trained with MAWP archived a relatively low distance in the generated feature space compared to other adversarial training methods, indicating that MAWP can keep such a distance smaller in the feature space produced at every network layer, which makes the model trained with MAWP more resistant to the perturbations imposed in the inputs.

Conclusion
This study is among the first ones to explore the feasibility of improving the adversarial robustness of neural NLP models by performing the perturbations in the parameter space (i.e., weights) rather than the input feature space (i.e., word embeddings).We experimentally demonstrate that the weight perturbation can be used to find a better solution in the parameter space by minimizing the adversarial loss with a multi-step, gradient-guided optimization method.We also show that the proposed method is complementary to existing adversarial training methods, and the combination of our method and FreeLB achieved state-of-the-art accuracy on both clean and adversarial examples on multiple benchmark datasets.

Limitations
We evaluated the proposed mixed adversarial training method with accumulated weight perturbation (MAWP) under the word substitution-based attacks only in this study.We are aware that there are a wide range of textual adversarial attacks, including adding, deleting or modifying characters or words or other language units under certain semanticspreserving constraints.In the future, we would like to investigate how well the MAWP can defend against other types of adversarial attacks.In the current implementation of the MAWP, the weight perturbation needs to be calculated and applied at each adversarial training step, which requires a relatively longer training time.It is also planned to implement the weight perturbations for training NLP models in a more efficient way.

Figure 1 :
Figure 1: Accuracy under attack versus the number of training epochs.The experiments were conducted on SST-2 under the attack algorithm of TextFooler.

Figure 3 :
Figure 3: The differences in the feature vector representations between original and perturbed examples produced at each BERT layer on SST-2, where the number "0" denotes the word embedding layer, "13" the output layer, and the numbers in between the hidden layers.(a) The perturbed examples were generated against the BERT-based model trained without any adversarial training method; (b) The perturbed examples were generated against respective models under the text-time attack.

Table 1 :
The experimental results of different defense methods on SST-2, AGNEWS, and IMDB datasets.The best performance is highlighted in bold fonts.The symbol * indicates the attack algorithms on which we impose some constraints for fair comparison by ensuring the quality of adversarial examples (see Section 5.2 for details).

Table 2 :
The impact of different ascent steps in clean accuracy and adversarial robustness on the validation set of SST-2.