Searching for an Effective Defender: Benchmarking Defense against Adversarial Word Substitution

Recent studies have shown that deep neural network-based models are vulnerable to intentionally crafted adversarial examples, and various methods have been proposed to defend against adversarial word-substitution attacks for neural NLP models. However, there is a lack of systematic study on comparing different defense approaches under the same attacking setting. In this paper, we seek to fill the gap of systematic studies through comprehensive researches on understanding the behavior of neural text classifiers trained by various defense methods under representative adversarial attacks. In addition, we propose an effective method to further improve the robustness of neural text classifiers against such attacks, and achieved the highest accuracy on both clean and adversarial examples on AGNEWS and IMDB datasets by a significant margin. We hope this study could provide useful clues for future research on text adversarial defense. Codes are available at https://github.com/RockyLzy/TextDefender.


Introduction
Deep neural networks have achieved impressive results on NLP tasks. However, they are vulnerable to intentionally crafted textual adversarial examples, which do not change human understanding of sentences but can easily fool deep neural networks. As a result, studies on adversarial attacks and defenses in the text domain have drawn significant attention, especially in recent years (Ebrahimi et al., 2017;Gao et al., 2018;Li et al., 2018;Ren et al., 2019a;Jia et al., 2019;Zhu et al., 2020;Garg and Ramakrishnan, 2020;Zeng et al., 2021). The goal of adversarial defenses is to learn a model that is capable of achieving high test accuracy on both clean (i.e., original) and adversarial examples. We are eager to find out which adversarial defense * Equal contribution method can improve the robustness of NLP models to the greatest extent while suffering no or little performance drop on the clean input data.
To the best of our knowledge, existing adversarial defense methods for NLP models have yet to be evaluated or compared in a fair and controlled manner. Lack of evaluative and comparative researches impedes understanding of strengths and limitations of different defense methods, thus making it difficult to choose the best defense method for practical use. There are several reasons why previous studies are not sufficient for comprehensive understanding of adversarial defense methods. Firstly, settings of attack algorithms in previous defense works are far from "standardized", and they vary greatly in ways such as synonym-generating methods, number of queries to victim models, maximum percentage of words that can be perturbed, etc. Most defense methods have only been tested on very few attack algorithms. Thus, we cannot determine whether one method consistently performs better than others from experimental data reported in the literature, because a single method might demonstrate more robustness to a specific attack while showing much less robustness to another. Second, some defense methods work well only when a certain condition is satisfied. For example, all existing certified defense methods except RanMASK (Zeng et al., 2021) assume that the defenders are informed of how the adversaries generate synonyms (Jia et al., 2019;Dong et al., 2021). It is not a realistic scenario since we cannot impose a limitation on the synonym set used by the attackers. Therefore, we want to know which defense method is more effective against existing adversarial attacks when such limitations are removed for fair comparison among different methods.
In this study, we establish a reproducible and reliable benchmark to evaluate the existing textual defense methods, which can provide detailed insights into the effectiveness of defense algorithms with the hope to facilitate future studies. In particular, we focus on defense methods against adversarial word substitution, one of the most widely studied attack approaches that could cause major threats in adversarial defenses. In order to rigorously evaluate the performance of defense methods, we propose four evaluation metrics: clean accuracy, accuracy under attack, attack success rate and number of queries. The clean accuracy metric measures the generalization ability of NLP models, while the latter three measure the model robustness against adversarial attack. To systematically evaluate the defense performance of different textual defenders, we first define a comprehensive benchmark of textual attack methods to ensures the generation of high-quality textual adversarial examples, which changes the output of models with human imperceptible perturbation to the input. We then impose constraints to the defense algorithms to ensure the fairness of comparison. For example, the synonyms set used by adversaries is not allowed to be accessed by any defense method. Finally, we carry out extensive experiments using typical attack and defense methods for robustness evaluation, including five different attack algorithms and eleven defense methods on both text classification and sentiment analysis tasks.
Through extensive experiments, we found that the gradient-guided adversarial training methods exemplified by PGD (Madry et al., 2018) and FreeLB (Zhu et al., 2020) can be further improved. Furthermore a variant of the FreeLB method (Zhu et al., 2020) outperforms other adversarial defense methods including those proposed years after it. In FreeLB, gradient-guided perturbations are applied to find the most vulnerable ("worst-case") points and the models are trained by optimizing loss from these vulnerable points. However, magnitudes of these perturbations are constrained by a relatively small constant. We find that by extending the search region to a larger 2 -norm through increasing the number of search steps, much better accuracy can be achieved on both clean and adversarial data in various datasets. This improved variant of FreeLB, denoted as FreeLB++, improves the clean accuracy by 0.6% on AGNEWS. FreeLB++ also demonstrates strong robustness under TextFooler attack (Jin et al., 2020b), achieving a 13.6% accuracy improvement comparing to the current stateof-the-art performance (Zeng et al., 2021). Similar results have been confirmed on IMDB dataset.
We believe that our findings invite researchers to reconsider the role of adversarial training, and reexamine the trade-off between accuracy and robustness (Zhang et al., 2019). Also, we hope to draw attentions on designing adversarial attack and defense algorithms based on fair comparisons.

Textual Adversarial Attacks
Textual adversarial attack aims to construct adversarial examples for the purpose of 'fooling' neural network-based NLP models. For example, in text classification tasks, a text classifier f (x) maps an input text x ∈ X to a label c ∈ Y, where x = w 1 , . . . , w L is a text consisting of L words and Y is a set of discrete categories. Given an original input x, an valid adversarial example x = w 1 , . . . , w L is crafted to conform to the following requirements: where y is the ground truth for x, Sim : X × X → [0, 1] is a similarity function between the original x and its adversarial example x and ε min is the minimum similarity. In NLP, Sim is often a semantic similarity function using Universal Sentence Encoder (USE) (Cer et al., 2018) to encode two texts into high dimensional vectors and use their cosine similarity score as an approximation of semantic similarity Li et al., 2018).

Adversarial Word Substitution
Adversarial word substitution is one of the most widely used textual attack methods, where an adversary arbitrarily replaces the words in the original text x by their synonyms according to a synonym set to alert the prediction of the model. Specially, for each word w, w ∈ S w is any of w's synonyms (including w itself), where the synonym sets S w are chosen by the adversaries, e.g., built on welltrained word embeddings (Mikolov et al., 2013;Pennington et al., 2014;Su et al., 2018). The process of adversarial word substitution usually involves two steps: determine an important position to change; and modify words in the selected positions to maximize prediction error of the model. To find a word w ∈ S w that maximizes the model's prediction error, two kinds of searching strategies are introduced: greedy algorithms (Kuleshov et al., 2018;Li et al., 2018;Ren et al., 2019b;Hsieh et al., 2019;Jin et al., 2020b; (Ye et al., 2020) RanMASK (Zeng et al., 2021) Table 1: The comparison of different defense algorithms. We use "norm-bounded perturbations" to denote whether the perturbations to word embeddings are norm-bounded, "synonyms-agnostic" to whether the defense algorithms rely on pre-defined synonym sets, "structure-free" to whether the defense methods can only be applied to specific network architecture, and "ensemble-based" to whether the ensemble method is required to produce results. 2020;  and combinatorial optimization algorithms (Alzantot et al., 2018;Zang et al., 2020). Although the latter usually can fool a model more successfully, they are time-consuming and require massive amount of queries. This is especially unfair to defenders, because almost no model can guarantee high prediction accuracy in the case of large-scale queries. Therefore, we must impose constraints on the attack algorithm before we systematically evaluate the performance of the defense algorithms, which will be discussed in Section 3.

Textual Adversarial Defenses
Many defense methods have been proposed to improve the robustness of models against text adversarial attacks. Most of these methods focus on defending against adversarial word substitution attack (Ye et al., 2020). According to whether they possess provably guaranteed adversarial robustness, these methods can roughly be divided into two categories: empirical (Zhu et al., 2020;Si et al., 2020;Dong et al., 2021) and certified defense (Jia et al., 2019;Ye et al., 2020;Zeng et al., 2021) methods. Table 1 demonstrates detailed categories of these defense methods.
Adversarial Data Augmentation (ADA) is one of the most effective empirical defenses (Ren et al., 2019a; for NLP models. However, ADA is extremely insufficient due to the enormous perturbation search space, which scales exponentially with the length of input text. To cover much larger proportion of the perturbation search space, Si et al. (2020) proposed Mix-ADA, a mixup-based (Zhang et al., 2017) augmen-tation method. Region-based adversarial training Dong et al., 2021) improves a models' robustness by optimizing its performance within the convex hull (Region) formed by embeddings of a word and its synonyms. Adversarial training (Madry et al., 2018;Zhu et al., 2020; incorporates a min-max optimization between adversarial perturbations and the models by adding norm-bounded perturbations to words embeddings. Previous research on norm-bounded adversarial training focused on improving the generalization of NLP models. However, our experimental results showed that these methods can also effectively improve models' robustness while suffering no performance drop on the clean inputs. It has been experimentally shown that the above empirical methods can defend against attack algorithms. However, they can not provably guarantee that their predictions are always correct even under more sophisticated attackers. Recently, a set of certified defense methods has been introduced for NLP models, which can be divided into two categories: Interval Bound Propagation (IBP) (Jia et al., 2019;Huang et al., 2019;Shi et al., 2020;Xu et al., 2020) and randomized smoothing (Ye et al., 2020;Zeng et al., 2021) methods. IBP-based methods depend on the knowledge of model structure because they compute the range of the model output by propagating the interval constraints of the inputs layer by layer. Randomized smoothing-based methods, on the other hand, are structure-free; they constructs stochastic ensembles to input texts and leverage the statistical properties of the ensemble to provably certify the robustness. All certified defense methods except RanMASK (Zeng et al., 2021) are based on an assumption that the defender can access the synonyms set used by the attacker. Experimental results show that under the same settings, e.g., without accessing the synonyms set, RanMASK achieves the best defense performance among these certified defenders.

Constraints on Adversarial Example Generation
In this section, we first introduce constraints of textual adversarial attacks that should be imposed to ensure the quality of adversarial examples generated, which can help us benchmark textual defense. Then we introduce the datasets for experiments and pick out the optimal hyper-parameters for each constraint.

The Constraints on Adversaries
To ensure the quality of adversarial examples generated, we impose constraints on textual attack algorithms in the following four aspects: • The minimum semantic similarity ε min between original input x and adversarial example x . • The maximum number of one word's synonyms K max . • The maximum percentage of modified words ρ max . • The maximum number of queries to the victim model Q max .
Semantic Similarity In order for the generated adversarial examples to be undetectable by human, we need to ensure that the perturbed sentence is semantically consistent with the original sentence. This is usually achieved by imposing a semantic similarity constraint, see Eq.
(1). Most adversarial attack methods (Li et al., 2018;Jin et al., 2020b; use Universal Sentence Encoder (USE) (Cer et al., 2018) to evaluate semantic similarity. USE first encodes sentences into vectors and then uses cosine similarity score between vectors as an approximation of the semantic similarity between the corresponding sentences. Following the setting in , we set the default value of minimum semantic similarity ε min to be 0.84 (Morris et al., 2020a).
Size of Synonym Set For a word w and its synonym set S w , we denote the size of elements in S w as K = |S w |. The value of K influences the search space of attack methods. A larger K increases success rate of the attacker (Morris et al., 2020b).
However, larger K would result in the generation of lower-quality adversarial examples since there is no guarantee that these K words are all synonyms of the same word, especially when the GloVe vectors are used to construct a word's synonyms set . While setting the maximum value of K in attack algorithms, we control other variables and select the optimal value K max that keeps attack success rate of the attack algorithm from decreasing too much, seeing Section 3.2 for more details.

Percentage of Modified Words
For an input text x = w 1 , . . . , w L , whose length is L, and its adversarial examples x = w 1 , . . . , w L , the percentage of modified words is defined as: where L i=1 I{x i = x i } is the Hamming distance, with I{·} being the indicator function. An attacker is not allowed to perturb too many words since texcessive perturbation of words results in lower similarity between perturbed and original sentences. However, most existing attack algorithms do not limit the modification ratio ρ, and sometimes even perturb all words in a sentence to ensure the attack success rate. Since it is too difficult for defense algorithms to resist such attacks, we restrain a maximum value of ρ. Similar to the method adopted when setting K max , we use control variable to select the optimal value ρ max , which will be discussed in Section 3.2. Number of Queries Some existing attack algorithms achieve high attack success rate through massive queries to the model (Yoo et al., 2020). In order to build a practical attack, we placed constraint on query efficiency. Considering the difficulty of defense and the time cost of benchmarking, we need to restrict the number of queries for attackers to query the victim model. At present, most representative attack algorithms are based on greedy search strategies (see Section 2.1). Experiments have shown that these greedy algorithms are sufficient to achieve a high attack success rate . For a greedy-based attack algorithm, assuming the size of its synonyms set is K = |S w |. Then its search complexity is O(K × L), where L is the length of input text x, since the greedy algorithm guarantees that each word in the sentence is replaced at most once. Thus, we set the maximum number of queries to the product of K max and sentence length L in default, Q max = K max × L.

Datasets and Hyper-parameters
We conducted experiments on two widely used datasets: the AG-News corpus (AGNEWS) (Zhang et al., 2015) for text classification task and the Internet Movie Database (IMDB) (Maas et al., 2011) for sentiment analysis task.
In order to pick the optimal value K max and ρ max for each dataset, we choose 5 representative adversarial word substitution algorithms: PWWS (Ren et al., 2019a), TextBugger (Li et al., 2018), TextFooler , DeepWordBug (Gao et al., 2018), BERT-Attack . All of them are greedy search based attack algorithms 1 . All attackers use K nearest neighbor words of GloVe vectors (Pennington et al., 2014) to generate a word's synonyms except DeepWordBug, which performs character-level perturbations by generating K typos for each word; and BERT-Attack, which dynamically generates synonyms by BERT (Devlin et al., 2018). We use BERT as baseline model, and implementations are based on TextAttack framework (Morris et al., 2020a).
When selecting the optimal K max value for AG-NEWS, we first control other variables unchanged, e.g., the maximum percentage of modified words ρ max = 0.3, and conduct experiments on AG-NEWS with different K values. As we can see from Figure 1(a), as K increases, the accuracy under attack decreases. The decline of the accuracy under attack is gradually decreasing. For K ≥ 50, the decline in accuracy under attack becomes minimal, thus we pick K max = 50. Through the same process, we determine the optimal values ρ max = 0.1, K max = 50 for IMDB dataset as shown in Figure 1(c) and 1(d), and ρ max = 0.3 for AGNEWS dataset as shown in Figure 1(b).
In conclusion, we impose four constraints on attack algorithms to better help with evaluation of different textual defenders. We set ρ max = 0.3 for AGNEWS and ρ max = 0.1 for IMDB. Such setting is reasonable because the average sentence length of IMDB (208 words) is much longer than that of AGNEWS (44 words). For other constraints, we set K max = 50, ε min = 0.84, Q max = K max × L. We choose 3 base attackers to benchmark the defense performance of textual defenders: TextFooler, BERT-Attack, and TextBugger. Our choice of attackers is based on their outstanding attack performances, as shown in Figure 1.

Evaluation Metrics
Under the unified setting of the above-mentioned adversarial attacks, we conducted experiments on the current existing defense algorithms on AG-NEWS and IMDB. We present 4 metrics to measure the defense performance.   A good defense method should have higher clean accuracy, higher accuracy under attack, lower attack success rate, and requires larger number of queries for attack.

Implementation Details
Our reproduction of all defense methods, along with the hyper-parameter settings, are completely based on their original papers, except for the following two constraints: (1) For methods which are not synonyms-agnostic, we establish different synonym sets for both attackers and defender. (2) For methods that are ensemble-based, we use the "logitsummed" ensemble method introduced in (Devvrit et al., 2020) to make final predictions. Specifically, we use the counter-fitting vectors (Mrkšić et al., 2016) to generate the synonym set for attackers, and use vanilla Glove Embedding (Pennington et al., 2014) to generate synonym set for defenders 2 . Following Devvrit et al. (2020); Zeng et al. 2 According to our statistics, 69.70% of the words in defender's synonym set appears in the attacker synonym set's vocabulary. Among them, 73% of the synonyms in the de-(2021), we take the average of logits produced by the base classifier over all randomly perturbed input sentences, whose size is denoted as C, as the final prediction. For AGNEWS, we set the value of C to 100, while for IMDB, the value of C is default 16. In the implementation of FreeLB++, we remove the constraints of norm bounded projection, and set step size as 30 and 10 on AGNews and IMDB datasets respectively. More details will be introduced in Section 4. All the hyper-parameter settings are tuned on a randomly chosen development dataset.
We use BERT (Devlin et al., 2018) as our base model. Clean accuracy (Clean%) is tested on the whole test dataset, while the latter three metrics, e.g., Aua% are evaluated on 1000 randomly chosen samples from the test dataset.

Results
As we can see from Table 2, (1) the ADA-based methods have a small decrease in clean accuracy, but excellent accuracy under attack. However, comfender's synonym set are covered by the attacker's. paring with the remaining methods, ADA-based methods need to know specific attacker algorithms to generate adversarial examples before defending.
(2) The adversarial training methods, e.g., FreeLB, achieve higher clean accuracy than the baseline, and their improvement in robustness is also very insignificant Zeng et al., 2021). Interestingly, once we remove the l 2 -norm bounded limitation for FreeLB, we find out that defense performance is significantly improved (see FreeLB++ in the tables). FreeLB++ surpasses all existing defense methods by a large margin under TextFooler and TextBugger attacks. We will leave more discussions about adversarial training methods in Section 5.1. (3) The region-based adversarial training methods, e.g., DNE, perform poorly on both clean accuracy and accuracy under attack. It is mainly because the synonym set used in the attack method is different from that used in DNE, which is further discussed in Section 5.3. (4) The certified defense methods achieve high defense performance. It is worth noting that the average number of queries to the model of these methods is larger. We think the improvement of robustness comes from the ensemble method, seeing further discussions in Section 5.2.
Results of defense performance on IMDB are reported in Table 3. Defense methods share the trends with performance on AGNEWS. However, the general robustness of models on IMDB is poorer than AGNEWS. It is probably because the average length of sentences in IMDB (208 words) is far longer than that in AGNEWS (44 words). Longer sentences implies a larger search space for attackers, making it more difficult for defenders to defend against attacks.

Effectiveness of Adversarial Training
The objective of standard adversarial training methods, e.g., PGD-K (Madry et al., 2018) and FreeLB (Zhu et al., 2020) is to minimize the maximum risk for perturbation δ within a small -norm ball: where D is the data distribution, X is the embedding representations of input sentence x, y is the gold label, and L is the loss function for training neural networks, whose parameters is denoted as θ. In order to solve inner maximization, projected gradient descent (PGD) algorithm is applied as descrided in Madry et al. (2018) and Zhu et al. (2020): where g(δ t ) = ∇ δ L(f θ (X + δ), y) is the gradient of the loss with respect to δ, δ F ≤ performs a projection onto the -Frobenius norm ball, and t is the number of ascent steps to find the "worst-case" perturbation δ with step size α.  In this section, we first study the influence of the value of the norm on the model's robustness performance, which is also discussed by Gowal et al. (2020) in computer vision field. As can be seen from Table 4, we find out that both of the Clean% and Aua% increase as increases. Note that the value of is usually set to a very small value, e.g., = 0.01 (Zhu et al., 2020). A large value (e.g., = 1 in Table 4) is equivalent to removing the norm-bounded limitation (seeing "w/o" in Table 4), because when is large enough and the step size α is fixed, the magnitude of perturbation that used to update δ is also fixed, seeing Eq. (4). In this case, from g(δt) g(δt) F ≤ 1 and Eq. (4), we have: Thus, with multi-step updating δ, we have: where we can find that the upper bound of the norm of perturbation δ is determined by the number of ascent steps t if the value of α is fixed. In other words, the number of ascent steps t influences the search region of the perturbation δ, where the larger t is, the larger the search region will be. However, in original FreeLB, the same -norm has been applied to all perturbations and to restrict the search   (c) show the accuracy of FreeLB++ and PGD-K++ respectively under three different attack algorithms (TextFooler, TextBugger and BERT-Attack). As the value of t grows, Clean% and Aua% of the models will increase until reaching their peak values. After that, they begin to decrease as the value of t increases continually.
region around every word embedding. We denote our versions of PGD-K and FreeLB which remove the norm-bounded limitations as PGD-K++ and FreeLB++, respectively. We conducted experiments on PGD-K++, FreeLB++ with different t to study the impact of the value of t. As shown in Figure 2, the Clean% of both PGD-K++ and FreeLB++ models reaching a peak at t = 5 while the peak of Aua% performance reaches at t = 30 for FreeLB++ and t = 10 for PGD-K++. We give a possible explanation to this improvement of performance. We regard standard adversarial training as an exploration of embedding space. When t is small, the adversarial example space explored by the model is relatively small, resulting in poor defense performance of the model when a high-intensity attack arrives. This problem is alleviated when t becomes larger, and this explains why both Clean% and Aua% can be improved when t increases. When t exceeds its optimal value, the adversarial example generated by the algorithm may become dissimilar to the original example. Excessive learning of examples with different distributions from the original examples will lead to a decline in model's modeling ability.

Impact of Ensemble Strategies
There are two ensemble strategies (Devvrit et al., 2020): logits-summed (logit) and majority-vote (voting) ensemble. As mentioned above, in the logit method, the logits produced by the base classifier are averaged. Whereas, in the voting strategy, the predictions of classifiers for each class label are counted, and the vote results will be regarded as the output probability for classification. Compared to the logit method, we found that the majority-vote strategy can effectively improve the model's robustness, as can be inferred from the results in Table 5. However, after further research, the reason the vot-ing strategy achieves better defense performance is that it increases the difficulty for score-based attackers, which is also discussed in Zeng et al. (2021).
A typical score-based attacker usually involves two key steps: searching for weak spots in a text and replacing words in these weak spots to maximum model's prediction error. In the second stage, if no words in the synonym set can lower the logits, the adversary will give up perturbing this word. However, for those voting-based methods which create ensemble by introducing small noise to the original text x, e.g., SAFER, RanMASK-5%, the models tend to output very sharp distribution, even close to one-hot categorical distribution. This forces the attackers to launch decision-based attacks instead of the score-based ones, which can dramatically improve their attack difficulty. Therefore, it may be unfair to compare voting-based ensemble defense methods with others due to lack of effective ways to attack voting-based ensembles in the literature. We believe voting-based methods will greatly improve model's defense performance, but we recommend using logit-summed algorithm if one needs to prove the effectiveness of the proposed algorithm against adversarial attacks in future research.  Voting-based ensembles achieve better performance than logit-based ensembles, but this is potentially due to the nondifferentiability introduced by voting-based attacks.  Table 6: The ablation experiment on synonym set. "w" and "w/o" means the corresponding defense method use or not use the synonym set of the attack method. Table 6 shows the results of the ablation study on the impact of external synonym set on performance in the defense methods. Some previous studies (Ye et al., 2020;Dong et al., 2021) use the same synonym set as the attacker during adversarial defense training, leading to significant defense performance. As we can see from Table  6, all methods improve the Aua% by a large margin after sharing the synonym set with the attacker. However, having access to the attacker's synonym is not a realistic scenario since we cannot impose a limitation on the synonym set used by the attackers. Thus, for the sake of fair comparison in future research, we suggest that future work should assume that the attacker's synonym set cannot be accessed, and report the defense performance in this case.

Conclusion
In this paper, we established a comprehensive and coherent benchmark to evaluate the defense performance of textual defenders. We impose constraints to existing attack algorithms to ensure the quality of adversarial examples generated. Using these attackers, we systematically studied the advantages and disadvantages of different textual defenders. We find out that adversarial training methods are still the most effective defenders. Our FreeLB++ can not only achieve state-of-the-art defense performance under various attack algorithms, but also improve the performance on clean examples. We hope this study could provide useful clues for future research on text adversarial defense.