Defending Pre-trained Language Models from Adversarial Word Substitution Without Performance Sacrifice

Pre-trained contextualized language models (PrLMs) have led to strong performance gains in downstream natural language understanding tasks. However, PrLMs can still be easily fooled by adversarial word substitution, which is one of the most challenging textual adversarial attack methods. Existing defence approaches suffer from notable performance loss and complexities. Thus, this paper presents a compact and performance-preserved framework, Anomaly Detection with Frequency-Aware Randomization (ADFAR). In detail, we design an auxiliary anomaly detection classifier and adopt a multi-task learning procedure, by which PrLMs are able to distinguish adversarial input samples. Then, in order to defend adversarial word substitution, a frequency-aware randomization process is applied to those recognized adversarial input samples. Empirical results show that ADFAR significantly outperforms those newly proposed defense methods over various tasks with much higher inference speed. Remarkably, ADFAR does not impair the overall performance of PrLMs. The code is available at https://github.com/LilyNLP/ADFAR


Introduction
Deep neural networks (DNNs) have achieved remarkable success in various areas. However, previous works show that DNNs are vulnerable to adversarial samples (Goodfellow et al., 2015;Kurakin et al., 2017;, which are inputs with small, intentional modifications that cause the model to make false predictions. Pretrained language models (PrLMs) (Devlin et al., * Corresponding author. This paper was partially supported by National Key Research and Development Program of China (No. 2017YFB0304100), Key Projects of National Natural Science Foundation of China (U1836222 and 61733011). This work was supported by Huawei Noah's Ask Lab 2019; Liu et al., 2019;Clark et al., 2020;Zhang et al., 2020Zhang et al., , 2019 are widely adopted as an essential component for various NLP systems. However, as DNN-based models, PrLMs can still be easily fooled by textual adversarial samples (Wallace et al., 2019;Jin et al., 2019;Nie et al., 2020;Zang et al., 2020). Such vulnerability of PrLMs keeps raising potential security concerns, therefore researches on defense techniques to help PrLMs against textual adversarial samples are imperatively needed.
Different kinds of textual attack methods have been proposed, ranging from character-level word misspelling (Gao et al., 2018), word-level substitution (Alzantot et al., 2018;Ebrahimi et al., 2018;Ren et al., 2019;Jin et al., 2019;Zang et al., 2020;Li et al., 2020;Garg and Ramakrishnan, 2020), phrase-level insertion and removal (Liang et al., 2018), to sentence-level paraphrasing (Ribeiro et al., 2018;Iyyer et al., 2018). Thanks to the discrete nature of natural language, attack approaches that result in illegal or unnatural sentences can be easily detected and restored by spelling correction and grammar error correction (Islam and Inkpen, 2009;Sakaguchi et al., 2017;Pruthi et al., 2019). However, attack approaches based on adversarial word substitution can produce high-quality and efficient adversarial samples which are still hard to be detected by existing methods. Thus, the adversarial word substitution keeps posing a larger and more profound challenge for the robustness of PrLMs. Therefore, this paper is devoted to overcome the challenge posed by adversarial word substitution.
Several approaches are already proposed to mitigate issues posed by adversarial word substitution Jia et al., 2019;Huang et al., 2019;Cohen et al., 2019;Ye et al., 2020;Si et al., 2021). Although these defense methods manage to alleviate the negative impact of adversarial word substitution, they sometimes reduce the prediction accuracy for non-adversarial samples to a notable extent. Given the uncertainty of the existence of attack in real application, it is impractical to sacrifice the original prediction accuracy for the purpose of defense. Moreover, previous defense methods either have strong limitations over the attack space to certify the robustness, or require enormous computation resources during training and inference. Thus, it is imperatively important to find an efficient performance-preserved defense method.
For such purpose, we present a compact and performance-preserved framework, Anomaly Detection with Frequency-Aware Randomization (ADFAR), to help PrLMs defend against adversarial word substitution without performance sacrifice. Xie et al. (2018) show that introducing randomization at inference can effectively defend adversarial attacks. Moreover, (Mozes et al., 2020) indicate that the usual case for adversarial samples is replacing words with their less frequent synonyms, while PrLMs are more robust to frequent words. Therefore, we propose a frequency-aware randomization process to help PrLMs defend against adversarial word substitution.
However, simply applying a randomization process to all input sentences would reduce the prediction accuracy for non-adversarial samples. In order to preserve the overall performance, we add an auxiliary anomaly detector on top of PrLMs and adopt a multi-task learning procedure, by which PrLMs are able to determine whether each input sentence is adversarial or not, and not introduce extra model. Then, only those adversarial input sentences will undergo the randomization procedure, while the prediction process for non-adversarial input sentences remains the same.
Empirical results show that as a more efficient method, ADFAR significantly outperforms previous defense methods (Ye et al., 2020; over various tasks, and preserves the prediction accuracy for non-adversarial sentences. Comprehensive ablation studies and analysis further prove the efficiency of our proposed method, and indicate that the adversarial samples generated by current heuristic word substitution strategy can be easily detected by the proposed auxiliary anomaly detector.

Adversarial Word Substitution
Adversarial word substitution (AWS) is one of the most efficient approaches to attack advanced neural models like PrLMs. In AWS, an attacker deliberately replaces certain words by their synonyms to mislead the prediction of the target model. At the same time, a high-quality adversarial sample should maintain grammatical correctness and semantic consistency. In order to craft efficient and high-quality adversarial samples, an attacker should first determine the vulnerable tokens to be perturbed, and then choose suitable synonyms to replace them.
Current AWS models (Alzantot et al., 2018;Ebrahimi et al., 2018;Ren et al., 2019;Jin et al., 2019;Li et al., 2020;Garg and Ramakrishnan, 2020) adopt heuristic algorithms to locate vulnerable tokens in sentences. To illustrate, for a given sample and a target model, the attacker iteratively masks the tokens and checks the output of the model. The tokens which have significant influence on the final output logits are regarded as vulnerable.
Previous works leverage word embeddings such as GloVe (Pennington et al., 2014) and counterfitted vectors (Mrkšić et al., 2016) to search the suitable synonym set of a given token. Li et al. (2020); Garg and Ramakrishnan (2020) uses BERT (Devlin et al., 2019) to generate perturbation for better semantic consistency and language fluency.

Defense against AWS
For general attack approaches, adversarial training (Goodfellow et al., 2015;Jiang et al., 2020) is widely adopted to mitigate adversarial effect, but (Alzantot et al., 2018;Jin et al., 2019) shows that this method is still vulnerable to AWS. This is because AWS models leverage dynamic algorithms to attack the target model, while adversarial training only involves a static training set.
Methods proposed by Jia et al. (2019); Huang et al. (2019) are proved effective for defence against AWS, but they still have several limitations. In these methods, Interval Bound Propagation (IBP) (Dvijotham et al., 2018), an approach to consider the worst-case perturbation theoretically, is leveraged to certify the robustness of models. However, IBP-based methods can only achieve the certified robustness under a strong limitation over the attack space. Furthermore, they are difficult to adapt to PrLMs for their strong reliance on the assumption Two effective and actionable methods (DISP  and SAFER Ye et al. (2020)) are proposed to overcome the challenge posed by AWS, and therefore adopted as the baselines for this paper. DISP ) is a framework based on perturbation discrimination to block adversarial attack. In detail, when facing adversarial inputs, DISP leverages two auxiliary PrLMs: one to detect perturbed tokens in the sentence, and another to restore the abnormal tokens to original ones. Inspired by randomized smoothing (Cohen et al., 2019), Ye et al. (2020) proposes SAFER, a novel framework that guarantees the robustness by smoothing the classifier with synonym word substitution. To illustrate, based on random word substitution, SAFER smooths the classifier by averaging its outputs of a set of randomly perturbed inputs. SAFER outperforms IBP-based approaches and can be easily applied to PrLMs.

Randomization
In recent years, randomization has been used as a defense measure for deep learning in computer vision (Xie et al., 2018). Nevertheless, direct extensions of these measures to defend against textual adversarial samples are not achievable, since the text inputs are discrete rather than continuous. Ye et al. (2020) indicates the possibility of extending the application of the randomization approach to NLP by randomly replacing the words in sentences with their synonyms.

Frequency-aware Randomization
Since heuristic attack methods attack a model by substituting each word iteratively until it successfully alters the model's output, it is normally difficult for static strategies to defense such kind of dynamic process. Rather, dynamic strategies, such as randomization, can better cope with the problem. It is also observed that replacing words with their more frequent alternatives can better mitigate the adversarial effect and preserve the original performance. Therefore, a frequency-aware randomization strategy is designed to perplex AWS strategy. Figure 1 shows several examples of the frequency-aware randomization. The proposed approach for the frequency-aware randomization is shown in Algorithm 1, and consists of three steps. Firstly, rare words with lower frequencies and a number of random words are selected as substitution candidates. Secondly, we choose synonyms with the closest meanings and the highest frequencies to form a synonym set for each candidate word. Thirdly, each candidate word is replaced with a random synonym within its own synonym set. To quantify the semantic similarity between two words, we represent words with embeddings from (Mrkšić et al., 2016), which is specially designed for synonyms identification. The semantic similarity of two words are evaluated by cosine similarity of their embeddings. To determine the frequency of a word, we use a frequency dictionary provided by FrequencyWords Repository * .

Anomaly Detection
Applying the frequency-aware randomization process to every input can still reduce the prediction accuracy for normal samples. In order to overcome this issue, we add an auxiliary anomaly detection head to PrLMs and adopt a multi-task learning procedure, by which PrLMs are able to classify the input text and distinguish the adversarial samples at the same time, and not introduce extra model. In inference, the frequency-aware randomization only applied to the samples that are detected as adversarial. In this way, the reduction of accuracy is largely avoided, since non-adversarial samples are not affected.  also elaborates the idea of perturbation discrimination to block attack. However, their method detects anomaly on token-level and requires two resource-consuming PrLMs for detection and correction, while ours detects anomaly on sentence-level and requires no extra models. Com-Algorithm 1 Frequency-aware Randomization Input: Sentence X = {w 1 , w 2 , ..., w n }, word embeddings Emb over the vocabulary V ocab Output: Randomized sentence X rand 1: Initialization: X rand ← X 2: Create a set W rare of all rare words with frequencies less than f thres , denote n rare = |W rare |. 3: Create a set W rand by randomly selecting n * r − n rare words w j / ∈ W rare , where r is the pre-defined ratio of substitution. 4: Create the substitution candidates set, W sub ← W rare + W rand , and |W sub | = n * r. 5: Filter out the stop words in W sub . 6: for each word w i in W sub do 7: Create a set S by extracting the top n s synonyms using CosSim(Emb w i , Emb w word ) for each word in V ocab.

8:
Create a set S f req by selecting the top n f frequent synonyms from S.

9:
Randomly choose one word w s from S.
10: , our method is two times faster in inference speed and can achieve better accuracy for sentence-level anomaly detection.

Framework
In this section, we elaborate the framework of AD-FAR in both training and inference. Figure 2 shows the framework of ADFAR in training. We extend the baseline PrLMs by three ma- jor modifications: 1) the construction of training data, 2) the auxiliary anomaly detector and 3) the training objective, which will be introduced in this section. Figure 2, we combine the idea of both adversarial training and data augmentation (Wei and Zou, 2019) to construct our randomization augmented adversarial training data. Firstly, we use a heuristic AWS model (e.g. TextFooler) to generate adversarial samples based on the original training set. Following the common practice of adversarial training, we then combine the adversarial samples with the original ones to form an adversarial training set. Secondly, in order to let PrLMs better cope with randomized samples in inference, we apply the frequency-aware randomization on the adversarial training set to generate a randomized adversarial training set. Lastly, the adversarial training set and the randomized adversarial training set are combined to form a randomization augmented adversarial training set.

Construction of Training Data As shown in
Auxiliary Anomaly Detector In addition to the original text classifier, we add an auxiliary anomaly detector to the PrLMs to distinguish adversarial samples. For an input sentence, the PrLMs captures the contextual information for each token by self-attention and generates a sequence of contextualized embeddings {h 0 , . . . h m }. For text classification task, h 0 ∈ R H is used as the aggregate sequence representation. The original text classifier leverages h 0 to predict the probability that X is labeled as classŷ c by a logistic regression with softmax: For the anomaly detector, the probability that X is labeled as classŷ d (if X is attacked,ŷ d = 1; if X is normal,ŷ d = 0) is predicted by a logistic regression with softmax: As shown in Figure 2, the original text classifier is trained on the randomization augmented adversarial training set, whereas the anomaly detector is only trained on the adversarial training set. Training Objective We adopt a multi-task learning framework, by which PrLM is trained to classify the input text and distinguish the adversarial samples at the same time. We design two parallel training objectives in the form of minimizing crossentropy loss: loss c for text classification and loss d for anomaly detection. The total loss function is defined as their sum: Loss = loss c + loss d Figure 3 shows the framework of ADFAR in inference. Firstly, the anomaly detector predicts whether an input sample is adversarial. If the input sample is determined as non-adversarial, the output of the text classifier (Label A) is directly used as its final prediction. If the input sample is determined as adversarial, the frequency-aware randomization process is applied to the original input sample. Then, the randomized sample is sent to the PrLM again, and the second output of the text classifier (Label B) is used as its final prediction.

Tasks and Datasets
Experiments are conducted on two major NLP tasks: text classification and natural language inference. The dataset statistics are displayed in Table  1. We evaluate the performance of models on the non-adversarial test samples as the original accuracy. Then we measure the after-attack accuracy of models when facing AWS. By comparing these two accuracy scores, we can evaluate how robust the model is.  Text Classification We use three text classification datasets with average text lengths from 20 to 215 words, ranging from phrase-level to documentlevel tasks. SST2 (Socher et al., 2013): phraselevel binary sentiment classification using finegrained sentiment labels on movie reviews. MR (Pang and Lee, 2005): sentence-level binary sentiment classification on movie reviews. We take 90% of the data as training set and 10% of the data as test set as (Jin et al., 2019). IMDB (Maas et al., 2011): document-level binary sentiment classification on movie reviews.
Natural Language Inference NLI aims at determining the relationship between a pair of sentences based on semantic meanings. We use Multi-Genre Natural Language Inference (MNLI) (Nangia et al., 2017), a widely adopted NLI benchmark with coverage of transcribed speech, popular fiction, and government reports.

Attack Model and Baselines
We use TextFooler † (Jin et al., 2019) (Ye et al., 2020). The implementation of DISP is based on the repository offered by . For SAFER, we also leverage the code proposed by Ye et al. (2020). Necessary modifications are made to evaluate these methods' performance under heuristic attack models.

Experimental Setup
The implementation of PrLMs is based on Py-Torch ‡ . We leverage, BERT BASE (Devlin et al., 2019), RoBERTa BASE (Liu et al., 2019) and ELECTRA BASE (Clark et al., 2020)   PrLMs. We use AdamW (Loshchilov and Hutter, 2018) as our optimizer with a learning rate of 3e-5 and a batch size of 16. The number of epochs is set to 5. For the frequency-aware randomization process, we set f thres = 200, n s = 20 and n f = 10. In the adopted frequency dictionary, 5.5k out of 50k words have a frequency lower than f thres = 200 and therefore regarded as rare words. r is set to different values for training (25%) and inference (30%) due to different aims. In training, to avoid introducing excessive noise and reduce the prediction accuracy for non-adversarial samples, r is set to be relatively low. On the contrary, in inference, our aim is to perplex the heuristic attack mechanism. The more randomization we add, the more perplexities the attack mechanism receives, therefore we set a relatively higher value for r. More details on the choice of these hyperparameters will be discussed in the analysis section.

Main results
Following (Jin et al., 2019), we leverage BERT BASE (Devlin et al., 2019) as baseline PrLM and TextFooler as attack model. Table 2 shows the performance of ADFAR and other defense frameworks. Since randomization may lead to a variance of the results, we report the results based on the average of five runs. Experimental results indicate that ADFAR can effectively help PrLM against AWS. Compared with DISP  and SAFER (Ye et al., 2020), ADFAR achieves the best performance for adversarial samples. Meanwhile, ADFAR does not hurt the performance for non-adversarial samples in general. On tasks such as MR and IMDB, ADFAR can even enhance the baseline PrLM.
DISP leverages two extra PrLMs to discriminate ¶ https://github.com/huggingface and recover the perturbed tokens, which introduce extra complexities. SAFER makes the prediction of an input sentence by averaging the prediction results of its perturbed alternatives, which multiply the inference time. As shown in Table 3, compared with previous methods, ADFAR achieves a significantly higher inference speed.

Results with Different Attack Strategy
Since ADFAR leverages the adversarial samples generated by TextFooler (Jin et al., 2019) in training, it is important to see whether ADFAR also performs well when facing adversarial samples generated by other AWS models. We leverage PWWS (Ren et al., 2019) and GENETIC (Alzantot et al., 2018) to further study the performance of ADFAR.  As shown is Table 4, the performance of ADFAR is not affected by different AWS models, which further proves the efficacy of our method. Table 5 shows the performance of ADFAR leveraging RoBERTa BASE (Liu et al., 2019) and ELECTRA BASE (Clark et al., 2020) as PrLMs. In order to enhance the robustness and performance of the PrLM, RoBERTa extends BERT with a larger corpus and using more efficient parameters, while ELECTRA applies a GAN-style architecture for pre-training. Empirical results indicate that ADFAR can further improve the robustness of RoBERTa and ELECTRA while preserving their original performance.   Table 6, the frequency-aware randomization is the key factor which helps PrLM defense against adversarial samples, while anomaly detection plays an important role in preserving PrLM's prediction accuracy for non-adversarial samples.  Table 6: Ablation study on MR and SST2 using BERT BASE as PrLM, and TextFooler as attack model. Adv represents adversarial training, FR indicates frequency-aware randomization and AD means anomaly detection. The results are based on the average of five runs.

Anomaly Detection
In this section, we compare the anomaly detection capability between ADFAR and DISP . ADFAR leverages an auxiliary anomaly detector, which share a same PrLM with the original text classifier, to discriminate adversarial samples. DISP uses an discriminator based on an extra PrLMs to identify the perturbed adversarial inputs, but on token level. For DISP, in order to detect anomaly on sentence level, input sentences with one or more than one adversarial tokens identified by DISP are regarded as adversarial samples. We respectively sample 500 normal and adversarial samples from the test set of MR and SST to evaluate the performance of ADFAR and DISP for anomaly detection. Table 7 shows the performance of ADFAR and DISP for anomaly detection. Empirical results show that ADFAR can predict more precisely, since it achieves a significantly higher F 1 score than DISP. Moreover, ADFAR has a simpler framework, as its anomaly detector shares the same PrLM with the classifier, while DISP requires an extra PrLM. The results also indicate that the current heuristic AWS strategy is vulnerable to our anomaly detector, which disproves the claimed undetectable feature of this very adversarial strategy.  Table 7: Performance for anomaly detection.

Effect of Randomization Strategy
As the ablation study reveals, the frequency-aware randomization contributes the most to the defense. In this section, we analyze the impact of different hyperparameters and strategies adopted by the frequency-aware randomization approach, in inference and training respectively.

Inference
The frequency-aware randomization process is applied in inference to mitigate the adversarial effects. Substitution candidate selection and synonym set construction are two critical steps during this process, in which two hyperparameters (r and n s ) and the frequency-aware strategy are examined.

Selection of Substitution Candidates
The influence of different strategies for substitution candidate selection in inference is studied in this section. The impact of two major factors are measured: 1) the substitution ratio r and 2) whether to apply a frequency-aware strategy. In order to exclude the disturbance from other factors, we train BERT on the original training set and fix n s to 20. Firstly, we alter the value of r from 5% to 50%, without applying the frequency-aware strategy. As illustrated by the blue lines in Figure 4, as r increases, the original accuracy decreases, while the adversarial accuracy increases and peaks when r reaches 30%. Secondly, a frequency-aware strategy is added to the experiment, with f thres = 200. As depicted by the yellow lines in Figure 4, both original and adversarial accuracy, the general trends coincide with the non-frequency-aware scenario, but overall accuracy is improved to a higher level. The highest adversarial is obtained when r is set to 30% using frequency-aware strategy. Figure 4: Effect of the substitution ratio r and the frequency-aware strategy in substitution candidate selection during inference.

Construction of Synonym Set
The influence of different strategies for synonym set construction in inference is evaluated in this section. The impact of two major factors are measured: 1) the size of a single synonym set n s and 2) whether to apply a frequency-aware strategy. In order to exclude the disturbance from other factors, we train BERT on the original training set and fix r to 30% . Firstly, we alter the value of n s from 5 to 50, without applying the frequency-aware strategy. The resulted original and adversarial accuracy are illustrated by the blue lines in Figure 5. Secondly, a frequencyaware strategy is added to the experiment, with n f = 50% * n s . As depicted by the yellow lines in Figure 5, the original accuracy and the adversarial accuracy both peaks when n s = 20, and the overall accuracy is improved to a higher level compared to the non-frequency-aware scenario.

Training
The frequency-aware randomization process is applied in training to augment the training data, and hereby enables the PrLM to better cope with randomized samples inference. Based on this purpose, the frequency-aware randomization process in training should resemble the one in inference as much as possible. Therefore, here we set an identical process for synonym set construction, i.e. n s = 20 and n f = 50% * n s . However, for the substitution selection process, to avoid introducing excessive noise and maintain the accuracy for the PrLM, the most suitable substitution ratio r might be different than the one in inference. Experiments are conducted to evaluate the influence of r in training. We alter the value of r from 5% to 50%. In Figure 6, we observe that r = 25% results in highest original and adversarial accuracy.

Conclusion
This paper proposes ADFAR, a novel framework which leverages the frequency-aware randomization and the anomaly detection to help PrLMs defend against adversarial word substitution. Empirical results show that ADFAR significantly outperforms those newly proposed defense methods over various tasks. Meanwhile, ADFAR achieves a remarkably higher inference speed and does not reduce the prediction accuracy for non-adversarial sentences, from which we keep the promise for this research purpose.
Comprehensive ablation study and analysis indicate that 1) Randomization is an effective method to defend against heuristic attack strategy. 2) Replacement of rare words with their more common alternative can help enhance the robustness of PrLMs. 3) Adversarial samples generated by current heuristic adversarial word substitution models can be easily distinguished by the proposed auxiliary anomaly detector. We hope this work could shed light on future studies on the robustness of PrLMs.