C HECK HARD: Checking Hard Labels for Adversarial Text Detection, Prediction Correction, and Perturbed Word Suggestion

An adversarial attack generates harmful text that fools a target model. More dangerously, this text is unrecognizable by humans. Existing work detects adversarial text and corrects a target’s prediction by identifying perturbed words and changing them into their synonyms, but many benign words are also changed. In this paper, we directly detect adversarial text, correct the prediction, and suggest perturbed words by checking the change in the hard labels from the target’s predictions after replacing a word with its transformation using a model that we call C HECK HARD. The experiments demonstrate that C HECK HARD outperforms existing work on various attacks, models, and datasets.


Introduction
Currently, deep-learning-based models achieve high performance on many NLP tasks.However, those models are still sensitive to adversarial attacks.These attacks can only perturb a small amount of an input text, which is sufficient to fool the models.More dangerously, the modified text still preserves its original meaning, and humans cannot recognize the modification in the text.We set three objectives for this paper.First, we detect the adversarial text to recognize an adversarial attack.Second, we correct the prediction to protect models against adversarial attacks.Last, we suggest perturbed words in the adversarial text.These suggestions can be used to reduce the effect of perturbed words in other tasks (e.g., text summarization or opinion mining).
Previous works suggested perturbed words for downstream tasks including adversarial text detection and prediction correction.These perturbed words can be identified by using the BERT model (Zhou et al., 2019) or word frequency (Mozes et al., 2021).However, many benign words are identified instead, which remarkably affects the downstream tasks.Motivation: Adversarial text must satisfy two criteria: (1) it fools a target model and (2) preserves the original meaning.There is little text satisfying these two criteria.Figure 1 presents an example from SST-2 in which the CNN model is attacked by probability weighted word saliency (PWWS).The CNN model classifies the input text into two classes, i.e., positive and negative.PWWS perturbs words in the text until the CNN model is fooled.During the attack, only the last text becomes adversarial, as it renders the CNN prediction incorrect, while other perturbed text is still correctly predicted by the CNN model.Here, we realize another phenomenon as follows: If we continue transforming1 the adversarial text by replacing the perturbed word (e.g., "charter") with similar words, many conflicts appear between the predictions made for the transformed text and the prediction of the adversarial text.In contrast, the transformed text obtained by replacing the benign word (such as "film") presents no conflict.
Contribution: We propose a simple method, namely, CHECKHARD, which detects adversarial text, corrects predictions, and suggests perturbed words.(1) For adversarial text detection, we first perform transformations for the input text by replacing each individual word with similar words.We then check the conflicts in the predictions for the input text and its transformations.(2) In terms of prediction correction, we identify and correct predictions for misclassified text.The misclassified text belongs to two cases: adversarial text (which fools a target model) and original text (which is predicted incorrectly by the target model).We observe that both kinds of misclassified text have the same characteristics.We then identify the two kinds of misclassified text in a similar way as that used in (1).Next, we correct the prediction for both misclassified texts using the hard labels of the input transformations.(3) For perturbed word suggestion, we suggest top words that produce the largest conflicts during the checking performed in (1).Our main contributions are summarized as follows: • We propose CHECKHARD for detecting adversarial text, correcting predictions, and detecting perturbed words.To the best of our knowledge, CHECKHARD is the first method that addresses all three objectives when defending against ten state-of-the-art attacks.
Other existing methods only reported these tasks with fewer than five attacks, including both baselines and previous attacks.
• Since CHECKHARD only uses hard labels from a target model via a black-box setting, it is compatible with common pre-trained target models.
• The evaluation shows that CHECKHARD outperforms existing work across various attacks, datasets, and models.
• CHECKHARD is directly compatible with all current and future attacks from the TextAttack (Morris et al., 2020) without changing its source code2 or other attacks up to the word level without changing its architecture.

Related Work
Adversarial attack: Most of the major attacks are implemented by the TextAttack framework.This framework also builds a general architecture in which many strong attacks are added (e.g., BAE (Garg and Ramakrishnan, 2020), and IGA (Wang et al., 2021)).Table 1 summarizes representatives from the current sixteen attacks3 that are related to text classification from TextAttack in terms of three major aspects: level, transformations, and constraints.Other similar attacks reach equivalent performance with the corresponding representatives: Alzantot (Alzantot et al., 2018) and Fast-Alzantot (Jia et al., 2019) use the same core, A2T (Yoo and Qi, 2021) and TextFooler (Jin et al., 2020) transform a word from a word embedding, and both BERT-Attack (Li et al., 2020) and CLARE (Li et al., 2021) extract synonyms from the same masked language model as in BAE (Garg and Ramakrishnan, 2020).The two remaining attacks are restricted in some models or datasets: Hot-Flip (Ebrahimi et al., 2018) only supports LSTM, and Checklist (Ribeiro et al., 2020) attacks short text as in SST-2 with a mere 2.3% success rate.
Adversarial text detection: Although adversarial text resembles original text, some abstract features from transformer-based models can distinguish them, such as attention input (Biju et al., 2022), PCA eigenvector (Raina and Gales, 2022), and density (Yoo et al., 2022).Mosca et al. (2022) estimated the change in prediction before and after deleting important words.Wang et al. (2022) voted on the fixed k text after replacing some words with their synonyms.Zhou et al. (2019) used BERT to detect adversarial text by identifying perturbed words.Mozes et al. (2021) claimed that adversarial text contains many low-frequency words.
Most of recent works (Raina and Gales, 2022; Biju et al., 2022;Yoo et al., 2022) are limited on target models derived from transformers.Mosca et al. (2022) restrictively detect parallel pairs of the adversarial and original text and ignore the original text that a target model incorrectly classifies.Wang et al. (2022)  Prediction correction: Several previous works correct predictions after a text is modified by perturbed words.In one approach, these perturbed words are disabled by replacing them with similar words in various ways.Zhou et al. (2019) chose
the nearest synonyms for similar words using a kNN search.Mozes et al. (2021) selected highfrequency synonyms for such words.In another approach, Rusert and Srinivasan (2022) randomly replaced some words, which were both perturbed and non-perturbed, and integrated the predictions from a few instances of replaced text.
Although the prediction of adversarial text is corrected, most previous works downgrade the prediction on clean text.Other works keep or slightly increase a few clean predictions.In contrast, CHECK-HARD efficiently improves the prediction on all adversarial text and most of the clean text.
Perturbed word suggestion: Zhou et al. ( 2019) fine-tuned a BERT model to suggest perturbed words.Mozes et al. (2021) suggested lowfrequency words as perturbed words.However, many benign words are also suggested in addition to the perturbed words.This redundant suggestion affects the adversarial text detection and prediction correction in downstream tasks.

CHECKHARD
respondingly, a target model F : X → Y maps the input space X to the label space Y.According to TextFooler (Jin et al., 2020), a valid adversarial text X adv that is generated from original text X org must satisfy the two criteria: where Sim(X adv , X org ) is the similarity between X adv and X org , and ϵ is the minimum similarity between them.ϵ is a threshold that causes X adv and X org to be closer together in the meaning (e.g., via semantic and syntactic criteria).
We set three objectives to process input text X input : (1) we determine whether X input is adversarial or original text, (2) we correct the labels distorted by adversarial attacks while maintaining the accuracy of benign text, and (3) we suggest the top k perturbed words in X input that are likely modified by the attack.Figure 2 and Algorithm 1 summarize our proposed method.
Model details: To process input text X input with one of three objectives, "adversarial detection", "prediction correction", or "perturbation detection," CHECKHARD first predicts a hard label Y input for X input using a target model F (e.g., a CNN).Then, it transforms the text using an auxiliary attack A (e.g., PWWS).An adversarial threshold λ adv , a misclassification threshold λ mis , and a suggestion number k are used for "adversarial detection, "prediction correction," and "perturbation suggestion," respectively.CHECKHARD allows the use of an optional word proportion τ < 100% and support models F sup , which accelerate the processing and improve the performance, respectively.Support models should solve the same task as the target model such as sentiment analysis.

Detection
Figure 2: Given the input text, we generate a transformation set for each word (e.g., "warm" and "charter").We then use a target model and optional support models to predict hard labels for each transformation set.Next, we calculate the rates of the hard labels that are different from the label of the input text resulting from the target model.The obtained rates are used for three tasks: (1) adversarial text detection achieved by comparing the rates with a threshold λ adv , ( 2) misclassified text detection achieved by comparing the rates with a threshold λ mis , which is used for prediction correction, and (3) perturbed word suggestions in which the top k input words are output in the decreasing order of the rates.
up CHECKHARD while maintaining reasonable performance, as shown in Figure 3, which presents the experimental results.
Second (line 7), we create a transformation set W trans for each word in W rand by using the auxiliary attack A. For example, PWWS uses WordNet synonyms for W trans .The main transformations from other attacks are listed in Table 1.
Third (line 9), we generate a transformed text X trans by replacing each word in W trans with the corresponding input word.To ensure that X trans satisfies Equation 1, we then use the constraints in the auxiliary attack A to check the similarity between X trans and X input with a threshold ϵ.For example, PWWS prohibits X trans from modifying stop words.The main constraints from other attacks are summarized in Table 1.
Fourth (lines 11-22), the valid X trans is input into the target model F and support models F sup to produce hard labels.These labels are added to a local list Y trans for each transformation set.The labels that differentiate Y input are added into a global list Y correct , which is used to correct the prediction later.Since adversarial text fools only the target model, the support models F sup do not need to be used for "perturbation suggestion".
Fifth (lines 25-32), we calculate the difference rate R, which is the ratio of the number of labels in Y trans that conflict with Y input .R is compared with λ adv or λ mis for "adversarial detection" and "prediction correction," respectively.In "adversarial detection," if R is large enough, we determine the input text as adversarial text.In "prediction correction," if the text is determined to be misclassified, we correct its prediction by voting on Y correct .R is also added to a global list R diff for "perturbation suggestion." Finally, (lines 34-41), if the above process does not determine X input as adversarial text, we determine X input to be the original text for "adversarial detection''.Similarly, if X input is not determined as misclassified text, we maintain Y input as the final prediction for "prediction correction."For "perturbation suggestion," we sort the input words by R diff in descending order and return the top k words.
Here, λ adv and λ mis are optimized via validation sets; k is set to 1 as a default because most adversarial text only changes one word.For example, 49.7% of such text from SST-2 has only one perturbation, as shown in Figure 4 in Appendix A.

Adversarial Text Detection and Prediction Correction
We follow the same experimental settings as in frequency-guided word substitutions (FGWS)'s paper (e.g., the number of train/development/test data and evaluation metrics).In particular, we conducted experiments on adversarial text targeting a CNN model4 on SST-2 (8.7 words/text) as shown in Table 2. Experiments with other models and the IMDB are conducted later.Adversarial text was generated with ten representative attacks5 , which are listed in Table 1.The ten attacks were clustered into three groups based on the extent.Character-based attacks include DeepWordBug and Pruthi.
TextBugger is a hybrid attack at the character and word levels.The remaining attacks are word-based attacks.CHECKHARD uses RoBERTa as a support and the same attack that generated adversarial text as the auxiliary attack.We used five metrics to evaluate the first two objectives: the true positive rate (TPR), false positive rate (FPR), and F-score (F1) were used for adversarial text detection; and the original accuracy under attack (Adv) and corrected accuracy from attacking (Adv correction) were used for prediction correction.
In character-based and hybrid attacks, since FGWS was originally designed for word-based attacks, FGWS only reaches up to 50.6% and 59.2% of the F1 and adversarial correction, respectively, In word-based attacks, while FGWS processes original text nearly the same way (FPR = 11.0%∼ 11.1%), adversarial text is detected differently when under various attacks.Since FGWS detects adversarial text based on low-frequency words, it is most applicable with PWWS, which replaces words from WordNet without context checking.The detection also affects the prediction correction of FGWS.CHECKHARD outperforms FGWS in terms of both adversarial text detection and prediction correction.In particular, CHECKHARD achieves noteworthy improvement when detecting adversarial text with F1 in the range of 66.1% and 88.9%.The lowest prediction correction of CHECKHARD is 78.6%, overcoming the highest prediction correction of FGWS, which is 65.5%.
Ablation studies: We conducted experiments with two scenarios, which are shown in Table 3.The first scenario is that CHECKHARD's auxiliary (indicated in brackets such as CHECK-HARD(DeepWordBug)) is different from the attack.In the second scenario, the auxiliary and the attack are the same.Adversarial text in both scenarios was generated by PWWS and targeted a CNN on SST-2.
In the first scenario, we report DeepWordBug and TextFooler as auxiliaries, while other auxiliaries reach similar results, as presented in Appendix B. Since CHECKHARD without support attains approximately 70% on F1 and correction scores, RoBERTa support remarkably boosts both scores up to 93.3%.
In the second scenario, CHECKHARD without support overcomes FGWS, especially on the correction score.CHECKHARD is improved by using a support, and a strong support such as RoBERTa improves the results more than a conventional support such as LSTM.Their combination also achieves reasonable results.Other supports produce similar results as shown in Appendix C.
Evaluation on other target models and datasets: We conducted similar experiments on other models and datasets.In particular, we evaluated CHECKHARD and FGWS on adversarial text generated from PWWS targeting four common models (CNN, LSTM, BERT, and RoBERTa) on SST-2 and the IMDB (235.72 words/text), as shown in Table 4.Following the suggestion from the FGWS paper (Mozes et al., 2021), we chose 1000 training and 2000 testing samples from the IMDB for validation and testing, respectively.These numbers are a similar ratio with 872 and 1821 from SST-2.We added the correction accuracy from clean text to demonstrate the influence of CHECKHARD and FGWS on unattacked text.CHECKHARD used RoBERTa as a support for the CNN, LSTM, and BERT target models; XLNet supported the RoBERTa target.
CHECKHARD outperforms FGWS when detecting adversarial text on both SST-2 and the IMDB as well as when correcting its prediction on the IMDB.In SST-2, while FGWS decreases the clean accuracy, CHECKHARD with RoBERTa as support increases the accuracy for the CNN and LSTM.CHECKHARD balances the correction on clean and adversarial text for transformer-based models, i.e., BERT and RoBERTa.

Perturbed Word Suggestion
We evaluated perturbed word suggestions on adversarial text generated by PWWS as shown in Table 5.In particular, CHECKHARD, FGWS, and a random approach (RD) suggested k words (k ∈ {1, 3, 5}).We then checked whether or not any real perturbed word belongs to the suggested words.While RD and FGWS are affected by text length, CHECKHARD outperforms both and maintains stable results across all experiments with large margins from 8.1% (SST-2, CNN, k=1) to 83.2% (IMDB, LSTM, k=3).

Run Time
We compared the run time between PWWS attack, FGWS and CHECKHARD when detecting adversarial text generated by a corresponding attack targeting a CNN model as shown in Table 6 diction correction and perturbed word suggestion) reach similar ratios.We separated the detection time of CHECKHARD between adversarial and original text, while FGWS consumes the same time for both.
FGWS runs in less than 0.1 s.CHECKHARD without support runs at most 0.054 s and 3.146 s for SST-2 and the IMDB, respectively, which is faster than the 0.095 s and 4.298 s attack times.CHECK-HARD with RoBERTa as support can accelerate the run time by reducing the word proportion τ in Algorithm 1 as shown in Figure 3.With a τ of 30% and 3% for SST-2 and the IMDB, respectively, CHECKHARD speeds up 3.9x and 39.2x from a full τ of 100%.CHECKHARD maintains 80.0% and 85.7% F1 scores with these τ , which are higher than the 76.3% and 84.4% produced by FGWS.While SST-2 steadily increases the F1 scores with τ greater than 30%, the IMDB slightly improves the F1 scores when compared to run time when τ    is greater than 10%.These results demonstrate the impact of τ in terms of accelerating CHECKHARD, especially with long text, as in the IMDB.

Discussion
Direct attack: We evaluated CHECKHARD and FGWS under a direct attack.In particular, we used PWWS to attack SST-2 text targeting the CNN, which is protected by CHECKHARD and FGWS.CHECKHARD achieves 37.0% accuracy under the PWWS attack, which is higher than the 15.2% from FGWS.These results demonstrate that CHECK-HARD is better than FGWS in terms of defending against adversarial text.Parallel processing: Adversarial attacks optimize each step in a sequence until a target model is fooled.Conversely, CHECKHARD can generate all transformed text at once and predict them in parallel.CHECKHARD can thus be accelerated with parallel or distributed computing.

Limitations
Beyond word-based attack: CHECKHARD is currently suitable for all current attacks from the TextAttack framework at the character, word, and hybrid levels.To the best of our knowledge, there is no work that detects adversarial text beyond the word level, such as the phrase level as in (Lei et al., 2022) and sentence level as in (Iyyer et al., 2018), so it is still an open problem.
Beyond text classification: CHECKHARD can be directly applied to text classification tasks, for which it is easy to estimate the change in prediction.To apply CHECKHARD to other tasks (such as question answering and translation), we need to define a similar metric to measure the change in prediction.
CHECKHARD with a strange auxiliary and without support: CHECKHARD without support is still unstable when the auxiliary attack is different from the attack used to generate adversarial text, as shown in Table 7 in Appendix B. This limitation can be remedied with support, but it requires a trade-off in run time.

Conclusion
In this paper, we propose CHECKHARD by checking the change in the hard label before and after replacing the word with its transformation.Checking is used to detect adversarial text, correct predictions, and suggest perturbed words.The experiments on various attacks, models, and datasets demonstrate that CHECKHARD outperforms existing work.

C Other Support Models
In addition to LSTM and RoBERTa, as reported in Table 3, we conducted experiments to evaluate other supports6 as shown in Table 8.Similar

Figure 1 :
Figure 1: Generation of adversarial text and conflicting predictions after changing an individual word in the adversarial text.

Figure 3 :
Figure 3: Correlation between the detection time and F1 scores of CHECKHARD with RoBERTa as support when detecting adversarial text generated by PWWS targeting the CNN model and changing the word proportion τ .The time is averaged for all original and adversarial detections.

Figure 4 :
Figure 4: Ratio of the number of perturbed words in adversarial text generated by PWWS targeting the CNN model.
random words from X input with proportion τ 5 for each word w i in W rand do 6Transformation label list Y trans ← {} 7 Create transformation set W trans of w i by using A 8 for each word w j in W trans do 9

Table 2 :
. Other attacks, target models, and other objectives (pre-Detection of adversarial text and prediction correction on adversarial text targeting a CNN model on SST-2.

Table 3 :
Ablation studies of adversarial text generated by PWWS targeting a CNN model on SST-2.

Table 4 :
Detecting adversarial text generated by PWWS and prediction correction.

Table 5 :
Perturbed word suggestion on adversarial text generated by PWWS.

Table 6 :
Run time for attacking the original text with PWWS and detecting adversarial text generated by PWWS targeting the CNN model.

Table 7 :
Other auxiliary attacks for which CHECKHARD detected adversarial text and corrected the predictions from adversarial text generated by PWWS targeting the CNN model on SST-2.

Table 8 :
Other supports for CHECKHARD used to detect adversarial text and correct the predictions from adversarial text generated by PWWS targeting the CNN model on SST-2.