What Learned Representations and Influence Functions Can Tell Us About Adversarial Examples

Adversarial examples, deliberately crafted using small perturbations to fool deep neural networks, were first studied in image processing and more recently in NLP. While approaches to detecting adversarial examples in NLP have largely relied on search over input perturbations, image processing has seen a range of techniques that aim to characterise adversarial subspaces over the learned representations. In this paper, we adapt two such approaches to NLP, one based on nearest neighbors and influence functions and one on Mahalanobis distances. The former in particular produces a state-of-the-art detector when compared against several strong baselines; moreover, the novel use of influence functions provides insight into how the nature of adversarial example subspaces in NLP relate to those in image processing, and also how they differ depending on the kind of NLP task.


Introduction
The high sensitivity of deep neural networks (DNNs) to slight modifications of inputs is widely recognised and makes DNNs a convenient target for adversarial attacks (Szegedy et al., 2014).Creating malicious inputs or adversarial examples by adding small perturbations to the model's inputs can cause the model to misclassify the inputs that would be predicted correctly otherwise.Such adversarial attacks are highly successful in both image and Natural Language Processing (NLP) domains.
In the image domain, due to the straightforwardness of creating adversarial images by calibrating noise to the original records, researchers have explored many high-performing adversarial attacks (Papernot et al., 2016b;Moosavi-Dezfooli et al., 2016;Carlini and Wagner, 2017, for example).The perturbations of the input images degrade the model's performance with a high success rate and are generally imperceptible to a human.
Work in the NLP space has followed that in image processing.Here, in addition to the goal of impacting the model's prediction, adversarial text examples need to be syntactically and semantically sound to the reader.Consequently, adversarial attack techniques on text use semantics-preserving textual changes at the character level, word level and phrase level or sentence level (Pruthi et al., 2019;Alzantot et al., 2018;Li et al., 2020, for example).Table 1 illustrates two examples, showing different types of attack formulation in NLP.
In the image domain, defence against adversarial attack can be 'proactive' or 'reactive' (Cohen et al., 2020), where proactive defence refers to improving the model's robustness (Madry et al., 2018;Gopinath et al., 2018;Cohen et al., 2019) and reactive defence focuses on detecting real adversarial examples before they are passed to neural networks (Feinman et al., 2017;Ma et al., 2018;Lee et al., 2018;Papernot and McDaniel, 2018).Broadly speaking, for reactive methods, the detection of adversarial examples involves taking a conceptualisation of the space of learned representations and the adversarial subspaces within them (Tanay and Griffin, 2016;Tramèr et al., 2017), and then characterising the differences in some function of the learned representations between the actual and the adversarial inputs produced by the DNN; for example, Ma et al. (2018) applied a local intrinsic dimensionality (LID) measure to the learned representations and used that to successfully distinguish normal and adversarial images.
In the NLP space, relatively fewer adversarial defence techniques have been proposed.Among them, many focus on enhancing the models' robustness proactively through adversarial training (Jia et al., 2019;Pruthi et al., 2019;Jin et al., 2020); generating textual samples for proactive adversarial training is computationally expensive because of necessary search and constraints based on sentence encoding (Yoo and Qi, 2021).Reactive adversarial text detection techniques have mostly been different from their image counterparts, in that they typically modify the input by e.g.repeatedly checking word substitutions (Mozes et al., 2021;Wang et al., 2022;Zhou et al., 2019) rather than trying to characterise the learned representations; consequently, they focus on detecting synonym-substitution adversarial examples.An exception is the work of Liu et al. (2022), which both adapts LID to the text space and proposes the new MultiDistance Representation Ensemble (MDRE) method; their state-of-the-art results suggest that the detection methods based on learned representations drawn from the image processing domain are a promising source of ideas for NLP.
The particular focus of the present paper is the use of influence functions in adversarial detection methods, proposed for image processing by Cohen et al. (2020).They propose that distances to nearest neighbors (used by previous methods) and influence functions, which measure the impact of every training sample on validation or test set data, can be used complementarily to detect adversarial examples: they argue, with support from the strong results from their method, that adversarial examples locate in different regions of the learned representation space of their neighbors with respect to influence functions, compared to original datapoints (Fig 1).Specifically, in the image space, for original datapoints, nearest neighbors and influence function training points overlap, but for adversarial examples, they do not.Influence functions have only relatively recently begun to be explored in NLP, with Han et al. (2020) finding that, with the variety of classification tasks in NLP, the information provided by influence functions differs from image processing and is task-dependent.In this paper, noting significant differences between inputs in NLP and image processing (continuous versus discrete) and attack types, we explore whether and how they can help in NLP in detecting adversarial examples using learned representations, and what this can tell us about the nature of adversarial subspaces.
We also adapt a second method from the image processing literature, by Lee et al. (2018), which uses a Mahalanobis-based confidence score; this was a strong baseline for Cohen et al. (2020), giving an additional perspective on the nature of adversarial subspaces in NLP.
The contributions of this paper are as follows: • An adaptation of two adversarial detection techniques from the image processing literature, MA-HAL confidence (Lee et al., 2018) and Nearest Neighbor Influence Functions (NNIF) (Cohen et al., 2020), into the text domain; we show that we can achieve SOTA results relative to several strong, recent baselines.• An analysis of how influence functions work in this context, contributes to understanding both the nature of adversarial subspaces in the text space and what information influence functions can provide.

Related Work
Adversarial Defences for Image An intuitive adversarial defence is to train a deep neural network to be robust against adversarial input samples by e.g.mixing adversarial samples with the training data (Goodfellow et al., 2015;Madry et al., 2018;Xie et al., 2019); popular platforms like Cleverhans (Papernot et al., 2016a) are available to support robust training.However, such defences, termed as 'proactive', are expensive and vulnerable to optimisation attacks (Cohen et al., 2020).
In contrast, others have proposed 'reactive' defences that identify the variations in the representations learned by the DNN on the original input images to separate the adversarial samples; typically, these posit that adversarial examples can be characterised as belonging to particular subspaces (Tramèr et al., 2017), and the different approaches aim to capture the nature of these subspaces in different ways, with detectors such as logistic regression classifiers built over the learned representations.Feinman et al. (2017) built detectors us-Original Text at last, a movie that handles the probability of alien visits with the appropriate depth and loving warmth.Positive Char-level (Pruthi et al., 2019) at last, a movie that handles the probability of alien visits with the appr0priate depth and loving warDmth Negative Word-level (Alzantot et al., 2018) at last, a movie that handles the probability of alien trips with the adequate depth and loving warmth Negative  (Li et al., 2016(Li et al., , 2017;;Ribeiro et al., 2018;Jones et al., 2020).
In NLP, only Liu et al. (2022) has used the idea of constructing detectors over learned representations as in the image domain, which explored the idea of adapting the LID (Ma et al., 2018) method above.In addition, they proposed the MultiDistance Representation Ensemble Method (MDRE) algorithm that puts together learned representations from multiple DNN models to detect adversarial texts.Unlike other approaches, the same detector could apply to different types of attacks (characterbased, word-based, syntax-based) and MDRE in particular improved over baseline methods across the range of attacks.This motivates our adaptation of more recent techniques from the image domain.Influence Functions The influence function (IF) is a statistical method that captures the dependence of an estimator on any one of the sample (training) points.Koh and Liang (2017) were the first to adapt IFs to image DNNs as a method for interpreting the model's decision: the IF finds the most influential training samples, both helpful and harmful, contributing to each prediction.The essence of the approach is to consider a point z from the training set and compute the change to parameters θ if z were upweighted by a small ϵ; they then defined closed-form expressions I(z, z test ) to identify the most influential points z on a test point z test .
IFs were first applied to NLP deep architectures by Han et al. (2020), and compared with established gradient-based saliency maps as a way of interpreting input feature importance, using sentiment classification and natural language inference (NLI) as testbeds.Their first finding was that IFs are reliable for deep NLP architectures.Their second interesting finding was that while IFs and saliency measures were consistent for sentiment classification, they differed for NLI: they concluded that for more complex understanding tasks like NLI, IFs captured more useful interpretive information.They also found IFs to be useful for identifying and quantifying the effect of data artifacts on model prediction.A few other works have continued investigating the usefulness of IFs in NLP, such as Guo et al. (2021), who proposed a faster method for IF computation by restricting candidates to top-k nearest neighbors.

NNIF Detector
We follow Cohen et al. (2020)'s Nearest Neighbor Influence Function (NNIF) method and apply it to NLP architectures.The essence of it is, for some point z that may be regular or adversarial, to identify the training points that are most influential and those that are nearest neighbors to z, and to build a classifier based on those that will predict whether z is regular or adversarial based on differences in relative distributions (Fig 1).
We take a DNN classifier and dataset for some particular task (e.g.sentiment classification); we refer to this DNN as the TARGET MODEL.For each test sample z test , we compute the influence scores I(z, z test ) for all training points z, given the target model, and select the top M most helpful and M most harmful (details App B).We then construct a DKNN classifier in the style of Papernot and McDaniel (2018) Where the target model of Cohen et al. ( 2020) is a ResNet model, ours is a large language model (LLM) base with additional layers that are finetuned for the chosen tasks ( §4.3).The hidden layers we use for NNIF are then the pre-final additional layers on top of the DNN ( §4.5).

MAHAL Detector
Here we follow Lee et al. (2018), who build a detector that captures the variation in the probability density of the class-conditional Gaussian distribution of the learned representation by the model.Motivated, like Papernot and McDaniel (2018), by the problem that DNNs are poorly calibrated (Guo et al., 2017), they replace the final softmax layer with a Gaussian Discriminant Analysis (GDA) softmax classifier.
For a set of training points {(x 1 , y 1 ), ..., (x n , y n )} with the label y ∈ {1, 2, . . ., C}, the class mean μc and covariance ˆ are computed for each class c to approximate the generative classifier's parameters from the pre-trained target DNN f (x).Next, from the obtained class-conditional Gaussian distribution, the Mahalanobis distance between a test sample x and its closest distribution is measured to find the confidence score Finally, we label the Mahalanobis scores for the test samples as positive and adversarial samples as negative and input this feature set to an LR detector.Lee et al. (2018) propose two calibration techniques to improve the detection accuracy and make regular and out-of-distribution samples more separable: (1) input pre-processing, where they add a small noise in a controllable manner to the test samples; and (2) feature ensemble, which combines the confidence scores from all the hidden layers of the DNN including the final features.Both together substantially improve the performance of the base approach; each individually reaches almost the combination of the two.As for our NNIF detector in §3.1, our target DNN will have several hidden layers, and we explore models both with final layer-only representations and feature ensembles over all hidden layers.The input preprocessing of ( 1) is appropriate to the continuous space of images, but not in an obvious way to text, so we do not use that.

Experimental Setup
We broadly follow the setup of Liu et al. (2022), as the prior NLP work that has used learned representations to detect adversarial examples.

Tasks and Datasets
We work on the sentiment analysis and the natural language inference tasks, two widely tasks used in the adversarial example generation (Pruthi et al., 2019;Alzantot et al., 2018;Ribeiro et al., 2018;Ren et al., 2019;Iyyer et al., 2018;Yoo and Qi, 2021;Li et al., 2020Li et al., , 2021;;Jin et al., 2020).In addition, these are the two tasks that were used for the investigation of the use of influence functions in NLP (Han et al., 2020).
Sentiment Analysis For the sentiment analysis, we use the IMDB dataset (Maas et al., 2011) that has 50,000 movie reviews, split into 25,000 training and 25,000 test examples with binary labels indicating positive or negative sentiment.IMDB dataset has 262 words per review on average.In all experiments, we use 512 maximum sequence lengths for the language models on IMDB.
Natural Language Inference The Multi-Genre NLI (MULTINLI) dataset (Williams et al., 2018), used for the natural language inference (NLI) task, contains pairs of sentences annotated with textual entailment information.The test examples are mismatched with train examples and are collected from different sources.The dataset has 392,702 training and 9,832 testing examples labelled as three classes: entailment, neutral, and contradiction.Each text of the dataset has 34 words on average.On this dataset, we set the maximum sequence length to 256.

Attack Methods
We use the implementations from Liu et al. (2022) of two widely used attack methods that apply character-level and word-level perturbations to construct adversarial examples.We take a BERT BASE model ( §4.3) as the target model.An adversarial attack is successful when the adversaries have different predictions than the target mode's original predictions.Our two methods are (more details in §A.1): • CHARATT (Pruthi et al., 2019).This is a character-level attack that tweaks the original texts by randomly swapping, dropping and adding characters or adding a keyboard mistake.• WORDATT (Alzantot et al., 2018).This is a word-level attack that allows the attacker to alter practically every word from the sentence if required with the context-preserving synonymous words.This implementation follows Jia et al. (2019) in speeding up the synonym search.

Target Model
Following (Liu et al., 2022) 6; we note that in all the cases, CHARATT degrades the classifier's performance comparatively more than WORDATT.Sizes for IMDB and MULTINLI datasets and number of generated adversarial texts from them are in Table 5.

Detectors
For data to train the adversarial example detectors on, we follow standard practice in image processing (Ma et al., 2018;Cohen et al., 2020)  Due to the computational intensity of estimating the influential training records for the NNIF method, we limit our detectors to having 10k records (5k tests and 5k adversarial texts) and follow a similar data size for all the other detection methods for comparability.We split the detection dataset 80-20 train-test, and construct and evaluate logistic regression classifiers as detectors over this detection dataset split for our proposed methods ( §4.5) and baselines ( §4.6).2020) sample 10K neighbors from 49K training points).We choose M = 500 for our main results, which is at the top end of the range of values of M selected by Cohen et al. (2020); we show in §5.2 that, unlike the image processing domain, results in our experiments are broadly monotonically increasing as M increases.

NNIF and Mahalanobis
Note that we don't use the faster variant of IF computation of Guo et al. (2021), as NNIF requires separate perspectives from IFs and kNNs, and FAS-TIF restricts IF search to subsets of kNNs.MAHAL As per §3.2, we compute the mean and covariance for each class and calculate the Mahalanobis distance score for each normal instance and its adversarial counterpart.Like Ma et al. (2018), we consider both using only the final layer of the model and stacking scores from each layer of the model (feature ensembling).Feature ensembling is always better, so we only include those in the main results, but do separately analyse the contribution of the feature ensembling.Code For both of these, our code uses the implementation of Cohen et al. (2020) as a starting point and adapts as above.2

Baseline Detection Methods
We evaluate six adversarial text detection methods as our baseline detectors.The first four are from Liu et al. (2022) (we omit the language model, as it operates essentially at the chance), while the other two are also recent high-performing systems. 3We give more details on the methods in §A.2.DISP (Zhou et al., 2019).This is a system that aims to correct any adversarial perturbations before an example is passed to a classifier.Liu et al. (2022) adapt this to detecting the adversarial examples.FGWS (Mozes et al., 2021).This algorithm uses a word frequency threshold and calibrated replacement approach to detect adversarial examples.It is only designed to work against word-level attacks.LID (Liu et al., 2022).From among image processing detection methods, Liu et al. (2022) adapted the Local Intrinsic Dimensionality (LID) approach of Ma et al. (2018).This technique creates a distribution over local distances for a test record concerning its neighbors from the training set; it then applies these to the outputs of each layer from the target model to create a detection classifier.MDRE (Liu et al., 2022).This has similarities to LID above but uses Euclidean distance rather than the LID measure, and creates an ensemble using different Transformer models (like Liu et al. (2022), we use BERT BASE , RoBERTa BASE , XLNet BASE , BART BASE ).RSV (Wang et al., 2022).In this Randomized Substitution and Vote approach, the assumption is that a word-level attacker aims to find an optimal synonym substitution that mutually influences other words in the sentence.Hence, Wang et al. (2022) randomly replaces words from the text with synonyms in order to destroy the mutual interaction between words and eliminate adversarial perturbation.Like FGWS, this is only designed to work against word-level attacks.SHAP (Mosca et al., 2022).In this approach, an adversarial detector is trained using the SHapley Additive exPlanations (SHAP) values of the training data for each test data item using the SHAP explainer (Fidel et al., 2020).They experiment on multiple classifiers as the detectors: logistic regression, random forest, support vector and neural network.In our main results, we report the best classifier for each dataset and attack.8% better than the second) and 90% on WORDATT (more than 1% better than the second, RSV, which is tailored to word-level attacks).For MULTINLI WORDATT, it is around 4% better than the second best.The only one where it is not best, CHARATT, is only very slightly below the best performer DISP.(We note that for DISP we report the accuracy values from Liu et al. (2022).This means that the DISP detector used more data in its training set, and so has an advantage in this respect.)MAHAL also performs quite strongly, either better or similar to the baseline detectors, although not as strongly as NNIF; this mirrors the findings in image processing.MDRE results are lower than in Liu et al. (2022) as a consequence of using less data for training all detection classifiers, as discussed in §4.4.

Results on the detector baselines are in
In terms of aggregate task performance, in all our experiments, the detection accuracy on the natural language inference task is lower than the sentiment analysis task in general.As the MULTINLI dataset is a three-class problem and additionally uses mismatched test sentences, the detection is innately harder.generally much more clearly separable and so IF points contribute especially strongly to the method, except for MULTINLI against WORDATT, where they are essentially the same and the method relies on the two-view aspect of NNIF.This observation about the relative importance of the IF contribution was not made by Cohen et al. (2020), and so may be specific to NLP tasks, although this would require more investigation to verify.We also note that our results align with observations of Han et al. (2020), that in the harder task of MULTINLI ( §5.1, Table 4), IFs provide a different perspective to characterising the datapoint of interest.We give some text examples in App E.

Analyses Regions around adversarial examples
To look further into the more challenging combination of MULTINLI and CHARATT (as the one case in Table 2 where NNIF was not the highest scoring, albeit by a small margin), we consider a successful and an unsuccessful detection case by NNIF, with the actual examples given in the appen-   3 shows the accuracies of MAHAL using only the final layer or the feature ensemble.As with Lee et al. (2018), the feature ensemble produces much better results.The im- provement is larger for IMDB, but still important for MULTINLI, as without the ensemble, detection is essentially at the chance.Noting that the target model of Lee et al. (2018) had many more hidden layers in the ensemble, it is an open question as to whether introducing additional dense layers into our LLM-based model might improve detection while still preserving target model performance.

Conclusion and Future Work
We have adapted from image processing two methods, NNIF (Cohen et al., 2020) and MAHAL (Lee et al., 2018), that detect adversarial examples using learned representations.Both perform strongly, with NNIF the best on three of four task/attack combinations, and a close second on the fourth, against several strong baselines.
Our analysis shows that influence function points make a particularly important contribution to the NNIF method.The MULTINLI task is more challenging for all methods; here it is the complementary nature of information from influence functions and nearest neighbors, supporting observations by Han et al. (2020) about the different perspective of influence functions in this more complex NLP task.
The NNIF method is computationally expensive, so future work will look at ways to make it more efficient.Additionally, to gain a fuller understanding of what information influence functions can provide in NLP tasks, future work will look at a wider range of tasks and attacks.

Limitations
The major limitation is the computationally expensive calculation of influence functions in our NNIF method.For this, following Cohen et al. (2020) we restrict the data size to 10k (5k test, 5k adversarial) for NNIF and follow a similar approach for other methods for comparability.This helps faster explanation generation in SHAP as well.We use a small architecture as recommended in Han et al. (2020) for the BERT BASE model for NNIF and other detectors.As noted in the paper, we recognise that there is the FASTIF method of Guo et al. (2021) for speeding up influence function calculation, but because of the restriction of influence function points to nearest neighbors, it is not suitable for our application.
We use only two datasets/tasks and two attack methods, partly because of the computational expense of NNIF.While they are commonly used in the adversarial example literature as well as the analysis of influence functions in NLP by Han et al. (2020) and represent different levels of task complexity and attack type, a wider range of datasets/tasks and attack methods is needed for a full characterisation of influence functions and the nature of adversarial subspaces.
For all experiments, we restrict the maximum sequence length following Liu et al. (2022), which may influence the detectors' performance, especially for the NLI task, that requires the model to learn from a hypothesis and premise text pairs.For the detector baselines, we used the most available methods.There are two recent contemporaneous methods by Wang et al. (2022) and (Bao et al., 2021) that explore the idea that adversarial perturbations are typically rare-frequency words, and create augmented training sets by replacing those words in each sentence with synonyms.For the detection, Wang et al. (2022) matches the voted prediction with the obtained prediction and (Bao et al., 2021) trains the model on a separate auxiliary learning objective.Between these two works, we choose the RSV from Wang et al. (2022) in our work.For RSV, we follow the similar setting from Wang et al. (2022) in choosing the vote number, word substitution rate and stop word selection for both IMDB and MULTINLI.A different setting for MULTINLI may improve the result.

A Experimental Setup Details
The size of the datasets and the number of adversarial samples generated by each of the attack methods are given in Tab. 5. Obtained accuracies of the BERT BASE model are in Tab.6 and the other models used in MDRE are in Tab.7 A.1 Attack Methods CHARATT.We implement CHARATT as proposed by Pruthi et al. (2019).It tweaks the original texts by randomly swapping, dropping and adding characters or adding a keyboard mistake.Swapping refers to exchanging places of two adjacent internal characters.Dropping removes a character and Adding inserts a new character at a randomly selected position.Keyboard mistakes is for substituting a character with one of its adjacent characters in keyboards.
In our experiments, we allow a maximum of half the words from the original text to be perturbed, so the maximum number of possible attacks on the IMDB and MULTINLI datasets is 256 and 128 per sentence, respectively.
WORDATT.Alzantot et al. (2018) proposed an effective and widely used adversarial attack that we incorporate in our work as WORDATT.
This method allows the attacker to alter practically every word from the sentence if required with the context-preserving synonymous words.The synonym search is done over a large search space that includes the GloVe word vectors (Pennington et al., 2014), counter-fitting word vectors (Mrkšić et al., 2016), and the Google 1 billion words language model (Chelba et al., 2014).Then, following the natural selection methods, crossover and mutation techniques from the population-based genetic algorithm are applied to generate the next set of adversarial sentences.On each iteration, several adversarial texts that are unsuccessful in changing the model's prediction are removed from the pool.
However, Jia et al. (2019) found that the algorithm is computationally expensive and recommended using a faster language model and stopping the semantic drift of the algorithm that refers to applying the language model on the synonyms picked from previous iterations as well to choose words from their neighboring word-space.
We incorporate the above recommendations by utilising a faster Transformer-XL architecture (Dai et al., 2019) that is pretrained on the WikiText-103 dataset (Merity et al., 2017) and prohibiting the semantic drift by finding all test examples words' neighbors only before attacks.We also restrict the minimum number of perturbations to one-fifth of the maximum sequence length which is 102 and 51 for the IMDB and MULTINLI, respectively.

A.2 Baseline Detection Methods
The first four are from Liu et al. (2022) (we omit the language model, as it operates essentially at the chance), and we use the implementations from there. 4earning to Discriminate Perturbations (DISP) (Zhou et al., 2019).DISP is one of the commonly used baselines for adversarial text detection that identifies a set of character-level of word-level perturbed tokens and then applies an embedding estimator that predicts embeddings for each perturbed token and maps them to the actual word to repair the perturbations.
If the model's prediction on an adversarial text restored by DISP remains the same class as the prediction on its original version, we consider it a successful detection of an adversarial example.
Frequency-guided word substitutions (FGWS) (Mozes et al., 2021).Mozes et al. (2021) verifies that in the case of word-level attacks, the synonym replacements normally occur in low frequency.They use this concept in a model-agnostic rule-based adversarial text detection algorithm Frequency-Guided Word Substitutions (FGWS).
Firstly, the algorithm sets a word frequency threshold to identify infrequent words that have frequencies lower than this value.Then the algorithm replaces those words with their high-frequency synonyms and selects the replaced sentences as adversarial samples if the model's prediction confidence scores for the replacements change over a threshold.They use WordNet (Fellbaum, 2005) and GloVe vectors (Pennington et al., 2014) to find the synonyms.They experiment by taking {0 -th, 10 -th, • • • , 100 -th} percentile of word frequencies in the training set as the word-frequency threshold.Finally, on these selected alternative sentences, if the prediction confidence differs from their corresponding original sentence's prediction confidence by more than a certain amount, the original sentences are determined as adversarial examples.
Local Intrinsic Dimensionality (LID) (Liu et al., 2022).From the image processing detection methods, Liu et al. (2022)   MultiDistance Representation Ensemble Method (MDRE) (Liu et al., 2022).Motivated by the notion that adversarial examples are out-of-distribution samples as recognized in Lee et al. (2018) and Feinman et al. (2017), Liu et al. (2022) assume that texts with the same prediction label lie on similar data submanifold and adversarial perturbation on these texts put them to another data submanifold, thus altering the model's prediction on them.
They measure the Euclidean distance between each reference datapoint and the nearest neighbors from the training datapoints with similar predicted labels and establish that this distance will be greater for the adversarial reference point than the normal one.They further use ensemble learning to combine distances between representations learned from multiple DNNs and build a binary logistic regression model to detect adversarial examples.
Following (Liu et al., 2022), we also use four learning models: [BERT BASE , RoBERTa BASE , XLNet BASE , BART BASE ] in our experiments.Table 7 reports the clean accuracies of the other target classifiers used in feature ensembling in MDRE.
Randomized Substitution and Vote (RSV) (Wang et al., 2022).A word-level attacker's target is to find an optimal synonym substitution that mutually influences other words in the sentence.Taking this optimization target of the adversary, Wang et al. (2022) resort to randomly substituting words from the text with their synonyms and argue that this random word substitution destroys the mutual interaction between words and eliminates adversarial perturbation.
At first, they generate a set of perturbed samples by randomly replacing some words from a text with their arbitrary synonyms.Then the model's output logits for the processed samples are accumulated and voted to determine a prediction label for the text samples.If the original text's prediction doesn't match the voted prediction label it is considered as an adversarial example.
We use their code.5SHapley Additive exPlanations (SHAP) (Mosca et al., 2022).In this work, Mosca et al. (2022) adopt an adversarial image detection method for word-level attacks on text.They train an adversarial detector with the SHapley Additive exPlanations (SHAP) values of the training data for each of the test data using the SHAP explainer proposed and implemented by Fidel et al. (2020).
They experiment on multiple classifiers as the detectors such as logistic regression, random-forest classifier, support vector classifier and a neural network.They also show that the detector doesn't require a large number of training samples for it to be successful.In our work, we follow the same and report the best accuracy obtained among the four detectors.
We use their code. 6Accuracies of all the detectors are in Table 8.

B Computing Influence Function
For a datapoint z i = (x i , y i ) from the training set {(x 1 , y 1 ), . . ., (x i , y i ) ∈ (X, Y )} and model parameters θ ∈ Θ, the loss of the model be L(z, θ) and the optimized parameters are: The influence score is then calculated by observing the impact of a modification in the weight of a train datapoint on the decision of the prediction for the test datapoint.Assume we upweigh the training datapoint z by a small ϵ amount, which produces below θ ′ : L(z i , θ) + ϵL(z, θ) Then, according to Koh and Liang (2017), the influence of the boosted z on the parameters θ ′ can be defined by: where ) is the Hessian of the model.
Applying the chain rule to the Eq. 1 can be derived to the below form that measures the influence I up,loss of z on the loss of a test point z test : The NNIF method uses the I up,loss score.ful training instances for the detection of the adversarial attack.We also show the DKNN rankings of the top training instances filtered by the IF scores in the table.
As DISP performs better in one of the experimental settings in Liu et al. (2022), we further pick one example sentence from the paper that DISP detects correctly and observe NNIF's performance on it.NNIF is also able to detect the sentence correctly.In Table 13 we show the influential instances for this prediction as well.

Figure 1 :
Figure 1: Adversarial examples characterised by divergence in learned representations between nearest neighbors and training points selected by influence functions, unlike original examples (from (Cohen et al., 2020)).
, using the hidden layers of the target model and the training points.For each z test we find the ranks R and distances D using this DKNN for the training examples identified by the IFs; we denote by R M ↑ , D M ↑ , R M ↓ , D M ↓ the ranks and distances of the 2M most helpful and harmful training examples, respectively.We finally construct a logistic regression classifier with features MethodsNNIF We adapt the standard NNIF implementation ofCohen et al. (2020).For influence score calculation,Cohen et al. (2020) uses the Darkon module for the image; we instead incorporate the influence function calculation fromHan et al. (2020) 1 which uses Linear time Stochastic Second-Order Algorithm(Agarwal et al., 2017) for faster convergence, and makes several adaptations to NLP.We build the DkNN containing one layer with l 2 distance and brute-force search.Because IF calculations are expensive, likeCohen et al. (2020) andHan et al. (2020) we only sample from among all neighbors: we compute the IF on 6K training datapoints uniformly randomly sampled(Cohen et al. (

Figure 2 :
Figure 2: The correspondence between the helpful training records based on IFs in the embedding space of a DNN trained on the IMDB dataset.We present (using t-SNE) the embedding space of a DNN for an actual example (black star) with its adversarial version (purple cross) along with their 25 nearest neighbors (blue) and most helpful samples based on the IF (red).
We produce an analogous figure in Fig 2 for a randomly selected IMDB test point and its adversarial counterpart generated by WORDATT.We plot 25 nearest neighbors and 25 most helpful IF points using t-SNE (van der Maaten and Hinton, 2008).Ideally, normal neighbors and influence points (blue) should be more tightly grouped and closer to the test point (star); Cohen et al. (2020) expect that for the adversarial point (cross), the neighbors (orange down triangle) should often be separated from the influence points (red up triangle).We see this to some extent in Fig 2 with many adversarial neighbors near the normal point but adversarial influence points near the adversarial point.This is more difficult to see than in the idealised schematic of Fig 1, so for one view of differences in this pair of points we separate IFs and NNs in Fig 3 with recalculated t-SNE for each.It is apparent that the IFs by themselves do a good job of separating normal from adversarial examples here, while the NNs are more mixed.We give representative examples for the other datasets and attacks in App C. The same pattern is true for the IMDB example on WORDATT.For both MULTINLI, however, the IFs are less clearly separating the points, so the NNIF method relies on combining the two (NN, IF) views in the detector.To verify whether this is more generally true than just visually for Fig 3, we aim to measure how separable the samples of these plots are.As a measure of separability, we train 2000 SVC binary classifiers, one for each of our 1000 sampled test and adversarial point pairs, for both IFs and NNs.Each classifier is trained using GridsearchCV on the top 100 points in t-SNE space (either IFs or NNs), so each classifier corresponds to a plot like those in Fig 3 (App D).Accuracies averaged across the 1000 classifiers are in

Figure 3 :
Figure 3: Normal and adversarial train subspace observed on the IMDB record used in Fig 2 under WOR-DATT by influence function (top) and DKNN (bottom)

Figure 6 :
Figure 6: Normal and adversarial subspace by IF (top) and DKNN (bottom) on the unsuccessful detection by NNIF of the MULTINLI CHARATT text in Table 12 Looking at the training samples that influence the prediction of a test datapoint, gives us an illustration of the decision subspace of the DNN on it.To illustrate the subspace, we measure the top 25 influential (IF) and nearest neighbor (NN) training embeddings for a test record and its adversarial counterpart for each attack and plot them along with the test and adversarial points.All embeddings are reduced to two dimensions by using t-SNE.Figures7 and 8show an example each for the IMDB and MULTINLI datasets, respectively.On each figure, the top row depicts the IF-based training points and the bottom row shows the NNbased training points.D Separability of Points: IF vs NNWe build SVC classifiers on the neighboring train embeddings to evaluate how well the influence function is describing the learned subspace of the DNN than the DKNN.The best SVC classifiers over NNs and IF points for each of the 1000 test and adversarial example pairs are estimated through GridSearch over the parameters as depicted in Table 9.E Experimental Results ExamplesNNIF combines the DKNN ranking on top of the influence scores to select the best training instances for a test datapoint.In Tables 10, 11 and 12 we illustrate examples for WORDATT and CHARATT respectively, showing the top three helpful and harm-

Figure 7 :
Figure 7: Embedding subspace (applied t-SNE) of a test sample from the IMDB dataset (black-square) and its adversarial version (purple-cross) generated by three types of attacks.The top set of images shows the 25 most influential training samples and the bottom set shows the top 25 nearest neighbors (KNN).

Figure 8 :
Figure 8: Embedding subspace (applied t-SNE) of a test sample from the MULTINLI dataset (black-square) and its adversarial version (purple-cross) generated by three types of attacks.The top set of images shows the 25 most influential training samples and the bottom set shows the top 25 nearest neighbors (KNN).

Table 1 :
Examples of textual adversarial instances on IMDB and the prediction of BERTBASE on them of 768 nodes, a layer of 50% dropout, and another dense layer of 768 nodes.The dataset split is 80-20 train-test.We train the model for 3 epochs with 5e −5 learning rate and AdamW optimization without freezing any layer of the backbone model.This BERT BASE model achieves 92.90% and 82.01%test accuracies on the IMDB and MULTINLI datasets respectively.The accuracies of the clean model and the model under attack are given in Table , we use a pre-trained BERT-base-cased model, adding a fully connected dense layer

Table 2 .
(All SHAP detector classifiers inTable 8.) Overall, NNIF is the best, performing with 100% accuracy on CHARATT for sentiment analysis (more than The assumption underpinning the Cohen et al. (2020) method is that influential training samples and nearest neighbors should overlap for normal examples, but less so for adversarial examples: having two views on 'nearby' points is key, illustrated in Fig 1.

Table 4
, with p-values for a one-tailed test of proportions (positing alternative hypothesis H 1 that the IF classifier is more accurate).Table 4 indicates that the IF points are

Table 4 :
SVC accuracy of linearly separating the 2D-t-SNE embedding subspace of neighboring train samples of 1000 test records and their adversarial versions adapt the Local Intrinsic

Table 5 :
The number of examples used in experiments

Table 6 :
Liu et al. (2022)ier accuracy on the clean and adversarial examplesDimensionality (LID) approach ofMa et al. (2018).This technique creates a local distance distribution for a test record to its neighbors from the training set.They apply this to transformer models by taking the outputs of each layer from the target model to represent the training records.FollowingLiu et al. (2022), we use the BERT BASE model and implement a logistic regression classifier as the detector, and tune the size of the neighbors k through a grid search over 100, 1000, and the range [10, 42) with a step size 2.

Table 7 :
Different classifier accuracies on both clean and adversarial dataset for MDRE.

Table 8 :
Detection accuracy obtained from four detector classifiers used in SHAP.