"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks

Adversarial attacks are a major challenge faced by current machine learning research. These purposely crafted inputs fool even the most advanced models, precluding their deployment in safety-critical applications. Extensive research in computer vision has been carried to develop reliable defense strategies. However, the same issue remains less explored in natural language processing. Our work presents a model-agnostic detector of adversarial text examples. The approach identifies patterns in the logits of the target classifier when perturbing the input text. The proposed detector improves the current state-of-the-art performance in recognizing adversarial inputs and exhibits strong generalization capabilities across different NLP models, datasets, and word-level attacks.


Introduction
Despite recent advancements in Natural Language Processing (NLP), adversarial text attacks continue to be highly effective at fooling models into making incorrect predictions (Ren et al., 2019;Wang et al., 2019;Garg and Ramakrishnan, 2020). In particular, syntactically and grammatically consistent attacks are a major challenge for current research as they do not alter the semantical information and are not detectable via spell checkers (Wang et al., 2019). While some defense techniques addressing this issue can be found in the literature (Mozes et al., 2021;Zhou et al., 2019;Wang et al., 2019), results are still limited in performance and text attacks keep evolving. This naturally raises concerns around the safe and ethical deployment of NLP systems in real-world processes.
Previous research showed that analyzing the model's logits leads to promising results in discriminating manipulated inputs (Wang et al., 2021;Aigrain and Detyniecki, 2019;Hendrycks and Gimpel, 2016). However, logits-based adversarial detectors have been only studied on computer vision applications. Our work transfers this type of methodology to the NLP domain and its contribution can be summarized as follows: (1) We introduce a logits-based metric called Word-level Differential Reaction (WDR) capturing words with a suspiciously high impact on the classifier. The metric is model-agnostic and also independent from the number of output classes.
(2) Based on WDR scores, we train an adversarial detector that is able to distinguish original from adversarial input texts preserving syntactical correctness. The approach substantially outperforms the current state of the art in NLP.
(3) We show our detector to have full transferability capabilities and to generalize across multiple datasets, attacks, and target models without needing to retrain. Our test configurations include transformers and both contextual and genetic attacks.
(4) By applying a post-hoc explainability method, we further validate our initial hypothesis-i.e. the detector identifies patterns in the WDR scores. Furthermore, only a few of such scores carry strong signals for adversarial detection. arXiv:2204.04636v2 [cs.AI] 29 Jun 2023 2 Background and Related Work

Adversarial Text Attacks
Given an input sample x and a target model f , an adversarial example x = x + ∆x is generated by adding a perturbation ∆x to x such that arg max f (x) = y = y = arg max f (x ). Although this is not required by definition, in practice the perturbation ∆x is often imperceptible to humans and x is misclassified with high confidence. In the NLP field, ∆x consists in adding, removing, or replacing a set of words or characters in the original text. Unlike image attacks-vastly studied in the literature (Zhang et al., 2020) and operating in high-dimensional continuous input spaces-text perturbations need to be applied on a discrete input space. Therefore, gradient methods used for images such as FGSM (Goodfellow et al., 2014) or BIM (Kurakin et al., 2017) are not useful since they require a continuous space to perturb x. Based on the text perturbation introduced, text attacks can be distinguished into two broad categories.
Visual similarity: These NLP attacks generate adversarial samples x that look similar to their corresponding original x. These perturbations usually create typos by introducing perturbations at the character level. DeepWordBug (Gao et al., 2018), HotFlip (Ebrahimi et al., 2018) , and VIPER (Eger et al., 2019) are well-known techniques belonging to this category.
Semantic similarity: Attacks within this category create adversarial samples by designing sentences that are semantically coherent to the original input and also preserve syntactical correctness. Typical word-level perturbations are deletion, insertion, and replacement by synonyms (Ren et al., 2019) or paraphrases (Iyyer et al., 2018). Two main types of adversarial search have been proposed. Greedy algorithms try each potential replacement until there is a change in the prediction Ren et al., 2019;. On the other hand, genetic algorithms such as Alzantot et al. (2018) and Wang et al. (2019) attempt to find the best replacements inspired by natural selection principles.

Defense against Adversarial Attacks in NLP
Defenses based on spell and syntax checkers are successful against character-level text attacks (Pruthi et al., 2019;Wang et al., 2019;Alshemali and Kalita, 2019). In contrast, these solutions are not effective against word-level attacks preserving language correctness (Wang et al., 2019). We identify methods against word-level attacks belonging to two broad categories: Robustness enhancement: The targeted model is equipped with further processing steps to not be fooled by adversarial samples without identifying explicitly which samples are adversarial. For instance, Adversarial Training (AT) (Goodfellow et al., 2014) consists in training the target model also on manipulated inputs. The Synonym Encoding Method (SEM) (Wang et al., 2019) introduces an encoder step before the target model's input layer and trains it to eliminate potential perturbations. Instead, Dirichlet Neighborhood Ensemble (DNE)  and Adversarial Sparse Convex Combination (ASCC) (Dong et al., 2021) augment the training data by leveraging the convex hull spanned by a word and its synonyms.
Adversarial detection: Attacks are explicitly recognized to alert the model and its developers. Adversarial detectors were first explored on image inputs via identifying patterns in their corresponding Shapley values (Fidel et al., 2020), activation of specific neurons (Tao et al., 2018), and saliency maps (Ye et al., 2020). For text data, popular examples are Frequency-Guided Word Substitution (FGWS) (Mozes et al., 2021) and learning to DIScriminate Perturbation (DISP) (Zhou et al., 2019). The former exploits frequency properties of replaced words, while the latter uses a discriminator to find suspicious tokens and uses a contextual embedding estimator to restore the original word.

Logits-Based Adversarial Detectors
Inspecting output logits has already led to promising results in discriminating between original and adversarial images (Hendrycks and Gimpel, 2016;Pang et al., 2018;Kannan et al., 2018;Roth et al., 2019). For instance, Wang et al. (2021) trains a recurrent neural network that captures the difference in the logits distribution of manipulated samples. Aigrain and Detyniecki (2019), instead, achieves good detection performance by feeding a simple three-layer neural network directly with the logit activations.
Our work adopts a similar methodology but focuses instead on the NLP domain and thus text attacks. In this case (1) logits-based metrics to identify adversarial samples should be tailored to the new type of input and (2) detectors should be tested on currently used NLP models such as transformers (Devlin et al., 2019).

Methodology
The defense approach proposed in this work belongs to the category of adversarial detection. It defends the target model from attacks generated via word-level perturbations belonging to the semantic similarity category. The intuition behind the method is that the model's reaction to originaland adversarial samples is going to differ even if the inputs are similar. Hence, it relies on feature attribution explanations coupled with a machine learning model to learn such difference and thus identify artificially crafted inputs. Figure 1 shows the overall pipeline of the approach. Given a text classifier f trained on the task at hand, the pipeline's goal is to detect whether the currently fed input x is adversarial. In 3.1, we explain in greater detail how we measure the model f 's reaction to a given input x. This quantitylater indicated with WDR(x, f )-is then passed to the adversarial detector, whose training procedure is described in 3.2. Finally, in 3.3, we provide detailed information about the setup of our experiments such as target models, datasets, and attacks.

Interpreting the Target Model and
Measuring its Reaction: Word-Level Differential Reaction Adversarial attacks based on semantic similarity replace the smallest number of words possible to change the target model's prediction (Alzantot et al., 2018). Thus, we expect the replacements transforming x into x to play a big role for the output. If not, we would not have f (x ) substantially different from f (x). To assess the reaction of the target model f to a given input x, we measure the impact of a word via the Word-level Differential Reaction (WDR) metric. Specifically, the effect of replacing a word x i on the prediction where f (x\x i ) y indicates the output logit for class y for the input sample x without the word x i . Specifically, x i is replaced by an unknown word token. If x is adversarial, we could expect to find perturbed words to have a negative WDR(x i , f ) as without them the input text should recover its original prediction. Table 1 shows an example pair of original and adversarial text together with their corresponding WDR(x i , f ) scores. The original class is recovered after removing a perturbed word in the adversarial sentence. This switch results in a negative WDR. However, even if the most important word is removed from the original sentence ('worst'), the predicted class does not change and thus WDR( Our adversarial detector takes as input WDR(x, f ), i.e. the sorted list of WDR scores WDR(x i , f ) for all words x i in the input sentence. As sentences vary in length, we pad the list with zeros to ensure a consistent input length for the detector.

Adversarial Detector Training
The adversarial detector is a machine-learning classifier that takes the model's reaction WDR(x, f ) as input and outputs whether the input x is adversarial or not. To train the model, we adopt the following multi-step procedure: (S1) Generation of adversarial samples: Given a target classifier f , for each original sample available x, we generate one adversarial example x . This leads to a balanced dataset containing both normal and perturbed samples. The labels used are original and adversarial respectively.
(S2) WDR computation: For each element of the mixed dataset, we compute the WDR(x, f ) scores as defined in Section 3.1. Once more, this step creates a balanced dataset containing the WDR scores for both normal and adversarial samples.
(S3) Detector training: The output of the second step (S2) is split into training and test data. Then, the training data is fed to the detector for training along with the labels defined in step (S1).
Please note that no assumption on f is made. At the same time, the input of the adversarial detectori.e. the WDR scores-does not depend on the number of output classes of the task at hand. Hence, the adversarial detector is model-agnostic w.r.t. the classification task and the classifier targeted by the attacks.
In our case, we do not pick any particular architecture for the adversarial detector. Instead, we experiment with a variety of models to test their suitability for the task. In the same spirit, we test our setting on different target classifiers, types of attacks, and datasets.

Experimental Setup
To test our pipeline, four popular classification benchmarks were used: IMDb (Maas et al., 2011), Rotten Tomatoes Movie Reviews (RTMR) (Pang and Lee, 2005), Yelp Polarity (YELP) (Zhang et al., 2015), and AG News (Zhang et al., 2015). The first three are binary sentiment analysis tasks in which reviews are classified in either positive or negative sentiment. The last one, instead, is a classification task where news articles should be identified as one of four possible topics: World, Sports, Business, and Sci/Tech. As main target model for the various tasks we use DistilBERT (Sanh et al., 2020) fine-tuned on IMDb. We choose DistilBert-a transformer language model (Vaswani et al., 2017)-as transformer architectures are widely used in NLP applications, established as state of the art in several tasks, and generally quite resilient to adversarial attacks (Morris et al., 2020). Furthermore, we employ a Convolutional Neural Network (CNN) (Zhang et al., 2015), a Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997), and a full BERT model (Devlin et al., 2019) to test transferability to different target architectures. All models are provided by the TextAttack library (Morris et al., 2020) and are already trained 1 on the datasets used in the experiments.
We generate adversarial text attacks via four well-established word-substitution-based techniques: Probability Weighted Word Saliency (PWWS) (Ren et al., 2019), Improved Genetic Algorithm (IGA) (Jia et al., 2019), TextFooler , and BERT-based Adversarial Examples (BAE) (Garg and Ramakrishnan, 2020). The first is a greedy algorithm that uses word saliency and prediction probability to determine the word replacement order (Ren et al., 2019). IGA, instead, crafts attacks via mutating sentences and promoting the new ones that are more likely to cause a change in the output. TextFooler ranks words by importance and then replaces the ones with the highest ranks. Finally, BAE, leverages a BERT language model to replace tokens based on their context (Garg and Ramakrishnan, 2020). All attacks are generated using the TextAttack library (Morris et al., 2020).
We investigate several combinations of datasets, target models, and attacks to test our detector in a variety of configurations. Because of its robustness and well-balanced behavior, we pick the average F1-score as our main metric for detection. However, as in adversarial detection false negatives can have major consequences, we also report the recall on adversarial sentences. Later on, in 4.3, we also compare performance with other metrics such as precision and original recall and observe how they are influenced by the chosen decision threshold.

Experimental Results
In this section, we report the experimental results of our work. In 4.1, we study various detector architectures to choose the best performing one for the remaining experiments. In 4.2, we measure our pipeline's performance in several configurations (target model, dataset, attack) and we compare it to the current state-of-the-art adversarial detectors. While doing so, we also assess transferability via observing the variation in performance when changing the dataset, the target model, and the attack source without retraining our detector. Finally, in 4.3, we look at how different decision boundaries affect performance metrics.

Choosing a Detector Model
The proposed method does not impose any constraint on which detector architecture should be used. For this reason, no particular model has been specified in this work so far. We study six different detector architectures in one common setting. We do so in order to pick one to be utilized in the rest of the experiments. Specifically, we compare XGBoost (Chen and Guestrin, 2016), Ad-aBoost (Schapire, 1999), LightGBM (Ke et al., 2017), SVM (Hearst et al., 1998), Random Forest (Breiman, 2001), and a Perceptron NN (Singh and Banerjee, 2019). All models are compared on adversarial attacks generated with PWWS from IMDb samples and targeting a DistilBERT model fine-tuned on IMDb. A balanced set of 3, 000 instances-1, 500 normal and 1, 500 adversarialwas used for training the detectors while the test set contains a total of 1360 samples following the same proportions.  As shown in Table 2, all architectures achieve competitive performance and none of them clearly appears superior to the others. We pick XGBoost (Chen and Guestrin, 2016) as it exhibits the best F1-score. The main hyperparameters utilized are 29 gradient boosted trees with a maximum depth of 3 and 0.34 as learning rate. We utilize this detector architecture for all experiments in the following sections.

Detection Performance
Tables 3a and 3b report the detection performance of our method in a variety of configurations. In each table, the first row represents the setting-i.e. combination of target model, dataset, and attack type-in which the detector was trained. The remaining rows, instead, are w.r.t. settings in which we tested the already trained detector without performing any kind of fine-tuning or retraining.
We utilize a balanced training set of size 3, 000 and 2, 400 samples respectively for the detectors trained on IMDb adversarial attacks (Table 3a) and on AG News attacks (Table 3b). All results are obtained using balanced test sets containing 500 samples. The only exceptions are the configurations (DistilBERT, RTMR, IGA) and (DistilBERT, AG News, IGA) which used test sets of size 480 and 446 respectively due to data availability.
To the best of our knowledge, the FGWS method from Mozes et al. (2021) is the best detector avail-  able and was already proven to be better than DISP (Zhou et al., 2019) by its authors. Hence, we utlize FGWS as baseline for comparison in all configurations. Analogously to our method, FGWS is trained on the configuration in the first row of each table and then applied to all others. More in detail, we fine-tune its frequency substitution threshold parameter δ (Mozes et al., 2021) until achieving a best fit value of δ = 0.9 in both training settings. From what can be seen in both tables, the proposed method consistently shows very competitive results in terms of F1-score and outperforms the baseline in 22 configurations out of 28 (worse in 5) and is on average better by 8.96 percentage points. At the same time, our methods exhibits a very high adversarial recall, showing a strong capability at identifying attacks and thus producing a small amount of false negatives.
Generalization to different target models: Starting from the training configurations, we vary the target model while maintaining the other components fixed (rows 2-4 of each table). Here, the detector achieves state-of-the-art results in all test settings, occasionally dropping below the 90% F1score on a few simpler models like LSTM and CNN while not exhibiting any decay on more complex models like BERT.
Generalization to different datasets: Analogous to the previous point, we systematically substitute the dataset component for evaluation (rows 5-6 of each table). We notice a substantial decay in F1-score when testing with RTMR (74.1 -75.8%) since samples are short and, therefore, may contain few words which are very relevant for the prediction, just like adversarial replacements. Nevertheless, removing adversarial words still result in a change of prediction to the original class thereby preserving high adversarial recall." Generalization to different attacks: Results highlight a good reaction to all other text attacks (rows 7-9 of each table) and even experiences a considerable boost in performance against TextFooler. In contrast, the baseline FGWS significantly suffers against more complex attacks such as BAE, which generates context-aware perturbation.
Besides testing generalization properties via systematically varying one configuration component at the time, we also test on a few settings presenting changes in multiple ones (rows 10-14 of each table). Also in these settings, the proposed method maintains a very competitive performance, with noticeable drops only on the RTMR dataset.

Tuning the Decision Boundary
Depending on the application in which the detector is used to monitor the model and detect malicious input manipulations, different performance metrics can be taken into account to determine whether it is safe to deploy the model. For instance, in a very safety-critical application where successful attacks lead to harmful consequences, adversarial recall becomes considerably more relevant as a metric than the F1-score. We examine how relevant metrics change in response to different choices for the discrimination threshold. Please note that a lower value corresponds to more caution, i.e. we are more likely to output that a certain input is adversarial.

DT Precision F1
Adv.   Figure 2 and Table 4 show performance results w.r.t. different threshold choices. We notice that decreasing its value from 0.5 to 0.15 can increase the adversarial recall to over 98% at a small cost in terms of precision and F1-score (< 2 percentage points). Applications where missing attacksi.e. false negatives-have disastrous consequences could take advantage of this property and consider lowering the decision boundary. This is particularly true if attacks are expected with a low frequency and an increase in false positive incurs only minor costs.

Discussion and Qualitative Results
Section 4 discussed quantitative results and emphasized the competitive performance that the proposed approach achieves. Here, instead, we focus on the qualitative aspects of our research findings. For instance, we try to understand why our pipeline works while also discussing challenges, limitations, ethical concerns, and future work.

Understanding the Adversarial Detector
The proposed pipeline consists of a machine learning classifier-e.g. XGBoost-fed with the model's WDR scores. The intuition behind the approach is that words replaced by adversarial attacks play a big role in altering the target model's decision. Despite the competitive detection performance, the detector is itself a learning algorithm and we cannot determine with certainty what patterns it can identify.
To validate our original hypothesis, we apply a popular explainability technique-SHAP (Lundberg and Lee, 2017)-to our detector. This allows us to summarize the effect of each feature at the dataset level. We use the official implementation 2 to estimate the importance of each WDR and use a beeswarm plot to visualize the results.  Figure 3 shows that values in the first positionsi.e. 1, 2, and 3-of the input sequence are those influencing the adversarial detector the most. Since in our pipeline WDR scores are sorted based on their magnitude, this means that the largest WDR of each prediction are the most relevant for the detector. This is consistent with our hypothesis that replaced words substantially change output logits and thus measuring their variation is effective for detecting input manipulations. As expected, negative values for the WDR correspond to a higher likelihood of the input being adversarial.
We also notice that features after the first three do not appear in the naturally expected order. We believe this is the case as for most sentences it is sufficient to replace two-three words to generate an adversarial sample. Hence, in most cases, only a few WDR scores carry important signals for detection.

Challenges and Limitations
While WDR scores contain rich patterns to identify manipulated samples, they are also relatively expensive to compute. Indeed, we need to run the model once for each feature-i.e. each word-in the input text. While this did not represent a limitation for our use-cases and experiments, we acknowledge that it could result in drawbacks when input texts are particularly long.
Our method is specifically designed against word-level attacks and it does not cover characterlevel ones. However, the intuition seems to some extent applicable also to sentences with typos and similar artifacts as the words containing them will play a big role for the prediction. This, like in the word-level case, needs to happen in order for the perturbations to result in a successful adversarial text attack and change the target model's prediction

Ethical Perspective and Future Work
Detecting-or in general defending againstadversarial attacks is a fundamental pillar to deploy machine learning models ethically and safely. However, while defense strategies increase model robustness, they can also inspire and stimulate new and improved attack techniques. An example of this phenomenon is BAE (Garg and Ramakrishnan, 2020), which leverages architectures more resilient to attacks such as BERT to craft highlyeffective contextual attacks. Analogously, defense approaches like ours could lead to new attacks that do not rely on a few words to substantially affect output logits.
Based on our current findings, we identify a few profitable directions for future research. (1) First of all, the usage of logits-based metrics such as the WDR appears to be very promising for detecting adversarial inputs. We believe that a broader exploration and comparison of other metrics previously used in computer vision could lead to further improvements.
(2) We encourage future researchers to draw inspiration from this work and also test their defenses in settings that involve mismatched attacks, datasets, and target models. At the same time, we set as a priority for our future work to also evaluate the efficacy of adversarial detection methods on adaptive attacks (Tramer et al., 2020;Athalye et al., 2018). (3) This work proves the efficacy of WDR in a variety of settings, which include a few different datasets and tasks. However, it would be beneficial for current research to understand how these techniques would apply to high-stakes NLP applications such as hate speech detection .

Conclusion
Adversarial text attacks are a major obstacle to the safe deployment of NLP models in high-stakes applications. However, although manipulated and original samples appear indistinguishable, interpreting the model's reaction can uncover helpful signals for adversarial detection.
Our work utilizes logits of original and adversarial samples to train a simple machine learning detector. WDR scores are an intuitive measure of word relevance and are effective for detecting text components having a suspiciously high impact on the output. The detector does not make any assumption on the classifier targeted by the attacks and can be thus considered model-agnostic.
The proposed approach achieves very promising results, considerably outperforming the previous state-of-the-art in word-level adversarial detection. Experimental results also show the detector to possess remarkable generalization capabilities across different target models, datasets, and text attacks without needing to retrain. These include transformer architectures such as BERT and wellestablished attacks such as PWWS, genetic algorithms, and context-aware perturbations.
We believe our work sets a strong baseline on which future research can build to develop better defense strategies and thus promoting the safe deployment of NLP models in practice. We release our code to the public to facilitate further research and development 3 .