Frequency-Guided Word Substitutions for Detecting Textual Adversarial Examples

Recent efforts have shown that neural text processing models are vulnerable to adversarial examples, but the nature of these examples is poorly understood. In this work, we show that adversarial attacks against CNN, LSTM and Transformer-based classification models perform word substitutions that are identifiable through frequency differences between replaced words and their corresponding substitutions. Based on these findings, we propose frequency-guided word substitutions (FGWS), a simple algorithm exploiting the frequency properties of adversarial word substitutions for the detection of adversarial examples. FGWS achieves strong performance by accurately detecting adversarial examples on the SST-2 and IMDb sentiment datasets, with F1 detection scores of up to 91.4% against RoBERTa-based classification models. We compare our approach against a recently proposed perturbation discrimination framework and show that we outperform it by up to 13.0% F1.


Introduction
Artificial neural networks are vulnerable to adversarial examples-carefully crafted perturbations of input data that lead a learning model into making false predictions (Szegedy et al., 2014).
While initially discovered for computer vision tasks, natural language processing (NLP) models have also been shown to be oversensitive to adversarial input perturbations for a variety of tasks (Papernot et al., 2016;Jia and Liang, 2017;Belinkov and Bisk, 2018;Glockner et al., 2018;Iyyer et al., 2018). Here we focus on highly successful synonym substitution attacks (Alzantot et al., 2018;Ren et al., 2019;Zang et al., 2020), in which individual words are replaced with semantically similar ones. Existing defense methods against these attacks mainly focus on adversarial training (Jia and Liang, 2017;Ebrahimi et al., 2018 Figure 1: Corpus log e frequencies of the replaced words (bold, italic, red) and their corresponding adversarial substitutions (bold, black) using the GE-NETIC (Alzantot et al., 2018) and PWWS (Ren et al., 2019) attacks on SST-2 (Socher et al., 2013). et al., 2018Ren et al., 2019;Jin et al., 2019) and hence typically require a priori attack knowledge and models to be retrained from scratch to increase their robustness. Recent work by Zhou et al. (2019) instead proposes DISP (learning to discriminate perturbations), a perturbation discrimination framework that exploits pre-trained contextualized word representations to detect and correct word-level adversarial substitutions without having to retrain the attacked model. In this paper, we show that we can achieve an improved performance for the detection and correction of adversarial examples based on the finding that various word-level adversarial attacks have a tendency to replace input words with less frequent ones. 1 Figure 1 illustrates this tendency for two state-of-the-art attacks. We provide statistical evidence to support this observation and propose a rule-based and modelagnostic algorithm, frequency-guided word substitutions (FGWS), to detect adversarial sequences and recover model performances for perturbed test set sequences. FGWS effectively detects adversarial perturbations, achieving F 1 scores of up to 91.4% against RoBERTa-based models (Liu et al., 2019) on the IMDb sentiment dataset (Maas et al., 2011). Furthermore, our results show that FGWS outperforms DISP by up to 13.0% F 1 when differentiating between unperturbed and perturbed sequences, despite representing a conceptually simpler approach to this task.

Generating adversarial examples
In our experiments, we investigate two baseline attacks introduced by Ren et al. (2019) as well as two state-of-the-art attacks.
RANDOM. Our first baseline attack is a simple word substitution model that randomly selects words in an input sequence and replaces them with synonyms randomly sampled from a set of synonyms related to the specific word. We follow Ren et al. (2019) by using WORDNET (Fellbaum, 1998) to identify synonym substitutions for each selected word.
PRIORITIZED. Our second baseline builds upon RANDOM by selecting the replacement word from the synonym set that maximizes the change in prediction confidence for the true label of an input.
GENETIC. We additionally analyze an attack suggested by Alzantot et al. (2018), consisting of a population-based black-box mechanism based on genetic search that iteratively performs individual word-level perturbations to an input sequence to cause a misclassification.
PWWS. Lastly, we analyze the probability weighted word saliency (PWWS) algorithm (Ren et al., 2019). For each word in an input sequence, PWWS selects a set of synonym replacements from WORDNET and chooses the synonym yielding the highest difference in prediction confidence for the true class label after replacement. The algorithm furthermore computes the word saliency (Li et al., 2016a,b) for each input word and ranks word replacements based on these two indicators.
Datasets and models. We perform experiments on two binary sentiment classification datasets, the Stanford Sentiment Treebank (SST-2, Socher et al., 2013) and the IMDb reviews dataset (Maas et al., 2011), both of which are widely used in related work focusing on adversarial examples in NLP (Jia et al., 2019;Ren et al., 2019;Zhou et al., 2019). Dataset details can be found in Appendix A.
Adhering to Zhou et al. (2019), we attack a pretrained model based on the Transformer architecture (Vaswani et al., 2017). Zhou et al. (2019) use BERT (Devlin et al., 2019 in their experiments, but we found that RoBERTa (Liu et al., 2019) represents a stronger model for the specified tasks.
The fine-tuned RoBERTa model achieves 93.4% and 94.9% accuracy on the IMDb and SST-2 test sets, which is comparable to existing work (Beltagy et al., 2020;Liu et al., 2019). On the IMDb test set, the CNN achieves an accuracy of 86.0% and the LSTM achieves 83.1%. These performances are close to existing work using comparable settings (Zhang et al., 2019;Ren et al., 2019). On the SST-2 test set, the CNN achieves 84.0% and the LSTM 85.2% accuracy, which are also close to comparable experiments (Huang et al., 2019).
Following Ren et al. (2019), we apply all four attacks to a random subset of 2,000 sequences from the IMDb test set as well as the entire test set of SST-2 (1,821 samples). Implementation details for the models and attacks can be found in Appendix B. We report the after-attack accuracies 2 for the RoBERTa model in Table 2 and for the CNN/LSTM models in Table 3 (column Adv.). We observe that all four attacks cause notable decreases in model accuracy on the test sets, and that GE-NETIC and PWWS are more successful than the baseline attacks in most comparisons.

Analyzing frequencies of adversarial word substitutions
Next, we conduct an analysis of the word frequencies of individual words replaced by the attacks and their substitutions. We compute the log e training set frequencies φ(x) of all words x that have been replaced by the respective attacks and all of their corresponding substitutions. Then, we conduct Bayesian hypothesis testing (Rouder et al., 2009) to statistically compare the two samples. This is achieved by computing the Bayes factor BF 10 , rep-  Table 1 shows the log e frequencies (mean µ φ and standard deviation σ φ ) and Cohen's d for the specified samples generated by the attacks against the RoBERTa model (the results for the CNN and LSTM models can be found in Appendix C). We report the mean frequencies of all adversarial substitutions (Subst.) and only those that occur in the training set (non-OOV), to demonstrate that the frequency differences are not solely caused by OOV substitutions. Across datasets and attacks, the substitutions are consistently less frequent than the words selected for replacement. We observe large Cohen's d effect sizes for the majority of comparisons, statistically supporting the observation of mean frequency differences between replaced words and their corresponding substitutions. We furthermore observe that BF 10 > 10 55 holds for all comparisons-both when considering all and only non-OOV substitutions (the BF 10 scores can be found in Appendix D). This provides strong empirical evidence that H 1 is more likely to be supported by the measured word frequencies (see Appendix E for additional illustrations).

Frequency-guided word substitutions
Based on the observation of consistent frequency differences between replaced words and adversarial substitutions, we argue that the effects of such substitutions can be mitigated through simple frequency-based transformations. To do this, we propose frequency-guided word substitutions (FGWS), a detection method that estimates whether a given input sequence is an adversarial example. 5 We denote a classification model by a function f (X) that maps a sequence X to a C-dimensional vector representing the probabilities for predicting each of the C possible classes. We represent a sequence as X = {x 1 , . . . , x n }, where x i denotes the i-th word in the sequence. We furthermore introduce the notation f * (X) ∈ {1, . . . , C} representing the class label predicted by f given input X. FGWS transforms a given sequence X into a sequence X by replacing infrequent words with more frequent, semantically similar substitutions. We initially define the subset X E := {x ∈ X | φ(x) < δ} of words that are eligible for substitution, where δ ∈ R >0 is a frequency threshold. FGWS then generates a sequence X from X by replacing all eligible words with words that are semantically similar, but have higher occurrence frequencies in the model's training corpus. For each eligible word x ∈ X E we consider the set of replacement candidates S(x) and find a replacement x by selecting x = argmax w∈S(x) φ(w). We then generate X by replacing each eligible word x with x if φ(x ) > φ(x). Given the prediction label y = f * (X) for X and a threshold γ ∈ [0, 1], the sequence X is considered adversarial if f (X) y − f (X ) y > γ, i.e., if the difference in prediction confidence on class y before and after transformation exceeds the threshold γ. The threshold allows control of the rate of false positives (i.e., unperturbed sequences that are erroneously identified as adversarial) flagged by our method.

Comparisons
DISP. We compare FGWS to the DISP framework (Zhou et al., 2019), which is, to the best of our knowledge, the best existing approach for the detection of word-level adversarial examples. DISP uses two independent BERT-based components, a perturbation discriminator and an embedding estimator for token recovery, to identify perturbed   tokens and to reconstruct the replaced ones.
NWS. For the CNN and LSTM models, we compare FGWS with the naive word substitutions (NWS) baseline. For a given input sequence, NWS selects all OOV words in that sequence and replaces each with a random choice from a set of semantically related words. We restrict NWS to allow only substitutions for which the replacement word occurs in the model's training vocabulary. NWS can be interpreted as a variant of FGWS that is not explicitly guided by word frequencies.

Experiments
We apply both methods to the adversarial examples crafted by the four attacks on the subsets of both the IMDb and SST-2 datasets as described in Section 2.
To account for an imbalance between unperturbed and perturbed sequences, we repeatedly bootstrap a balanced set of unperturbed sequences for each set of perturbed sequences for 10,000 times and compute the average detection scores. For FGWS, we tune the frequency threshold δ for each modeldataset combination on the validation set. To do this, we utilize the PRIORITIZED attack to craft adversarial examples from all sequences of the validation set 6 and compare FGWS detection performances with different values for δ. Specifically, we set δ equal to the log e frequency representing the q th percentile of all log e frequencies observed by the words eligible for replacement in the training set, and experiment with q ∈ {0, 10, . . . , 100}. We select γ so that not more than 10% of the unperturbed sequences in the validation set are labeled as adversarial. 7 For FGWS, we define the set of replacement candidates for each word x ∈ X E as the union of the word's K nearest neighbors in a pre-trained GLOVE (Pennington et al., 2014) word embedding space and its synonyms in WORDNET. We set K equal to the average number of WORD-NET synonyms for each word in the validation set (yielding K = 6 for IMDb and K = 8 for SST-2).

Results
We report the results comparing FGWS to DISP on attacks against RoBERTa in Table 2 [odoriferous] and playful romantic comedy positive (99.9%) Figure 2: The detection methods applied to an adversarial example from the PWWS attack against RoBERTa on SST-2. The words highlighted in bold, italic and red were selected for replacement by the attack (A) and the detection methods (D), the ones in bold and black denote the substitutions. The values above the words denote their log e frequencies.
rectly identified as such, and the false positive rate (FPR) denotes the percentage of unperturbed sequences that were identified as adversarial. The column Adv. gives the classification accuracy on the perturbed sequences, and Restored acc. the model's accuracy on the adversarial sequences after transformation. We observe that FGWS best restores the model's classification accuracy across all comparisons, showing it to be effective in mitigating the effects of the individual attacks. Furthermore, FGWS outperforms DISP in terms of true positive rates and F 1 across the majority of experiments. These results show that, although contextualized word representations (DISP) serve as a competitive method to detect adversarial examples, relying solely on frequency-guided substitutions (FGWS) shows to be more effective. Figure 2 provides an example adversarial sequence generated with the PWWS attack and the two corresponding transformed sequences using DISP and FGWS (see Appendix G for additional examples). The results of NWS and FGWS against the CNN and LSTM models are shown in Table 3. We observe that FGWS outperforms NWS across all comparisons in terms of restored model accuracy and in the majority of comparisons in terms of F 1 . Moreover, the direct comparison between NWS and FGWS again underlines the importance of utilizing word frequencies as guidance for the word substitutions: while NWS is not guided by word frequency characteristics to perform the word replacements, we observe that FGWS outperforms NWS by a large margin in most comparisons, demonstrating the effectiveness of mapping infrequent words to their most frequent semantically similar counterparts to detect adversarial examples.

FGWS on unperturbed data
We furthermore investigate the effect of FGWS on model performance on unperturbed sequences after transformation. To do this, we transform the sampled test sets using FGWS and evaluate classification accuracies after sequence transformation. The differences in accuracy for the CNN, LSTM and RoBERTa models before and after transformation are 0.0%, +1.0% and −0.2% for IMDb and −1.8%, −2.9% and −1.8% for SST-2. This indicates that FGWS applied to unperturbed data has only small effects on classification accuracy, and in some cases even slightly increases prediction accuracy.

Limitations
It is worth mentioning that compared to FGWS, DISP represents a more general perturbation discrimination approach since it is trained to detect both character-and word-level adversarial perturbations, whereas FGWS solely focuses on word-level attacks.
Furthermore, it remains open whether FGWS would be effective against attacks for which the frequency difference is less evident. To investigate this, we conducted preliminary experiments by restricting the investigated attacks to only allow equifrequent substitutions. However, we observed that introducing this constraint has a substantial effect on attack performance, since the attacks are supplied with fewer candidate replacements. We will further investigate this in future work.

Conclusion
We have shown that the word frequency characteristics of adversarial word substitutions can be leveraged effectively to detect adversarial sequences for neural text classification. Our proposed approach outperforms existing detection methods despite representing a conceptually simpler approach to this task.  vastava et al., 2014) during training with a rate of 0.1 before applying the output layer. We trained both models for 20 epochs using the Adam optimizer (Kingma and Ba, 2014). We evaluated model performance after each epoch on the validation set and selected the best-performing checkpoints for testing. The CNN and LSTM models were trained with batch size 100 and a learning rate of 1 · 10 −3 .  2019)). We compute the perplexity scores for each perturbed sequence only around the respective replacement words by only considering a subsequence ranging from five words before to five words after an inserted replacement. The motivation for using a different language model as compared to the original implementation is due to computational efficiency, since we observed a notable decrease in attack runtime with our modification. This does not have an impact on attack performance, since our implementation of the GE-NETIC has an attack success rate of 98.6% against the LSTM on IMDb, whereas Alzantot et al. (2018) report an attack success rate of 97%.

B.3 PWWS
For attacks against SST-2, we furthermore increase the δ threshold for the maximum distance between replaced words and substitutions to δ = 1.0,  since we observed poor attack performances with δ = 0.5 (which was used by Alzantot et al. (2018) and in our experiments on IMDb). All other parameters of the attack (e.g., the number of generations and population size) are directly adapted from Alzantot et al. (2018). We restrict the words eligible for replacement by the GENETIC attack to non-stopwords, in accordance to Alzantot et al. (2018). Since the attack computes nearest neighbors for a selected word from a pre-trained embedding space, we furthermore can only select words for which there exists an embedding representation in this pre-trained space. On the SST-2 test set, we found three input sequences consisting of only one word which we excluded from our evaluation, since the used GPT-2 language model implementation requires an input sequence consisting of more than one word.

B.5 RANDOM, PRIORITIZED, PWWS, GENETIC
For the GENETIC attack, we follow Alzantot et al. (2018) by limiting the maximum amount of word replacements to 20% of the input sequence length. We apply the same threshold to the RANDOM and PRIORITIZED attacks, but not to PWWS since we observed low replacement rates despite the attack's effectiveness. This is in agreement to the results reported in Ren et al. (2019).

C Frequency differences for CNN and LSTM models
The log e frequencies for the four attacks against the CNN and LSTM models can be found in Table 4. In accordance to the experiments with RoBERTa (see Section 3 in the paper), we observe large Cohen's d effect sizes for the majority of the comparisons, which shows that the statistical frequency differences between replaced words and their substitutions are present for adversarial attacks against these two models as well.

D Bayes factors
The Bayes factors for the mean frequency comparisons between replaced words and their adversarial substitutions can be found in Table 5. We observe high values for BF 10 across all comparisons, providing strong evidence for the hypothesis that the log e frequency means between replaced words and their substitutions are different. Figure 3 illustrates the frequency differences for attacks against the RoBERTa model using his-  tograms. We observe that for the majority of the attacks, OOV substitutions occur most often among the perturbed sequences.

F Varying false positive thresholds
The rate of false positives predicted by a detection system is crucial for its practicability, and a limited amount of false positives is hence highly desirable. Figure 4 illustrates the true positive rates predicted by FGWS for all attacks against RoBERTa with different quasi-fixed false positive thresholds (as in Section 4.2 of the paper, δ was tuned on the validation set for each value of γ corresponding to the specific false positive threshold). As expected, we observe a trade-off between true and false positive rates for varying values of γ, such that lower false positive rates imply lower true positive rates. However, even for false positive rates of 1% and 5%, we observe that FGWS is able to detect between 33.6% and 90.0% of adversarial examples on IMDb and between 31.7% and 67.2% on SST-2. This indicates that FGWS has the potential to detect a useful fraction of adversarial examples without creating an excessive burden of false positives.

G Additional FGWS examples
Additional examples of FGWS can be found in Table 6 (true positives), Table 7 (false positives), Table 8 (true negatives) and Table 9 (false negatives).  Unperturbed i am a huge rupert everett fan . i adore kathy bates so when i saw it available i decided to check it out . the synopsis didn t really tell you much . in parts it was silly touching and in others some parts were down right hysterical . any person that is a huge fan of a personality of any type will find some small identifying traits with the main character . of course there are many they won t but that is the point if you like any of the actors give it a watch but don t look for any thing too dramatic it s good fun . i might also mention you can see how darn tall rupert is . i mean i knew he was 6 4 but he seems even more in this film . he even seemed to stoop a bit due to the other characters height in this . he is tall i mean tall and for you rupert fans there is a bare chest scene . [check] it out . the synopsis didn t really tell you much . in parts it was silly touching and in others some parts were down right hysterical . any person that is a huge fan of a personality of any type will find some small identifying traits with the main character . of course there are many they won t but that is the point if you like any of the actors give it a watch but don t look for any thing too dramatic it s 0.00 undecomposed 9.22 [good] fun . i might also mention you can see how darn tall rupert is . i mean i knew he was 6 4 but he seems even more in this film . he even seemed to stoop a bit 0.00 imputable 6.31 [due] to the other characters height in this . he is tall i mean tall and for you rupert fans there is a bare chest scene ... [stop] it out . the synopsis didn t really tell you much . in parts it was silly touching and in others some parts were down right hysterical . any person that is a huge fan of a personality of any type will find some small identifying traits with the main character . of course there are many they won 0.00 , 9.97 [t] but that is the point if you like any of the actors give it a watch but don t look for any thing too dramatic it [might] also mention you can see how darn tall rupert is . i mean i knew he was 6 4 but he seems even more in this film . he even seemed to stoop a bit imputable to the other characters height in this . he is tall i mean tall and for you rupert fans there is a bare chest scene ... 12.05 .

FGWS
i am a huge rupert everett fan . i adore kathy bates so when i saw it available i decided to stop it out . the synopsis didn t really tell you much . in parts it was silly touching and in others some parts were down right hysterical . any person that is a huge fan of a personality of any type will find some small 7.26 place 2.48 [identifying] traits with the main character . of course there are many they won t but that is the point if you like any of the actors give it a watch but don t look for any thing too dramatic it s 9.22 good 0.00 [undecomposed] fun . i might also mention you can see how darn tall rupert is . i mean i knew he was 6 4 but he seems even more in this film . he even seemed to  positive (65.7%)

Model: RoBERTa on IMDb
Unperturbed admittedly alex has become a little podgey but they are still for me the greatest rock trio ever . i wholeheartedly recommend this dvd to any fan . i was very disappointed that they canceled their planned recent munich gig logistics and regret not making an effort to see them elsewhere . the dvd is a small consolation the greatest incentive to acquire a proper dvd playback setup . naive perhaps but i still don t understand the significance of the tumble driers on stage i would be grateful for any clarification . cheers iain .
positive (99.4%) DISP admittedly alex has become a little podgey but they are still for me the greatest rock trio ever . i wholeheartedly recommend this dvd to any fan . i was very disappointed that they canceled their planned recent munich gig logistics and regret not making an effort to see them elsewhere . the dvd is a small consolation the greatest incentive to acquire a proper dvd playback setup . naive perhaps but i still don t understand the significance of the  [cheesey] slow mo cinematography . i d sooner watch a movie i ve already seen like goodfellas a bronx tale even grease . there are no likeable characters . in the end you just want everyone to die already . save 2 hours of your life and skip this one .
negative (99.9%) Table 8: Illustration of true negatives generated with FGWS against RoBERTa on SST-2 (top) and IMDb (bottom). The substitutions did not cause the model to change the predicted label for the given unperturbed sequences.
Unperturbed the spark of special anime magic here is unmistakable and hard to resist positive ( [just] not creative . negative (99.5%) FGWS graduation day is a result of the success of friday the 13th . both of those films are about creative bloody murders rather than suspense . if you enjoy that type of film i d recommend graduation day . if not i wouldn t. there s nothing new here just the same ancient killings . even though i ve given the film a 4 out of 10 i will say that it s not a repulsive film . it is watchable if your curious about it just not creative .
positive (53.5%) Table 9: Illustration of false negatives generated with FGWS against RoBERTa on SST-2 (top) and IMDb (bottom). The substitutions did not cause the model to change the predicted label back to its ground-truth for the given adversarial examples.