Toward Stronger Textual Attack Detectors

The landscape of available textual adversarial attacks keeps growing, posing severe threats and raising concerns regarding the deep NLP system's integrity. However, the crucial problem of defending against malicious attacks has only drawn the attention of the NLP community. The latter is nonetheless instrumental in developing robust and trustworthy systems. This paper makes two important contributions in this line of search: (i) we introduce LAROUSSE, a new framework to detect textual adversarial attacks and (ii) we introduce STAKEOUT, a new benchmark composed of nine popular attack methods, three datasets, and two pre-trained models. LAROUSSE is ready-to-use in production as it is unsupervised, hyperparameter-free, and non-differentiable, protecting it against gradient-based methods. Our new benchmark STAKEOUT allows for a robust evaluation framework: we conduct extensive numerical experiments which demonstrate that LAROUSSE outperforms previous methods, and which allows to identify interesting factors of detection rate variations.


Introduction
Despite the high performances of deep learning techniques for Natural Language Processing (NLP) applications, the trained models remain vulnerable to adversarial attacks [Barreno et al., 2006, Morris et al., 2020a] which limits their adoption for critical applications.In the context of NLP, for a given model and a given textual input, an adversarial example is a carefully constructed modification of the initial text such that it is semantically similar to the original text while affecting the model's prediction.The ability to design adversarial examples [Alves et al., 2018, Johnson, 2018, Subbaswamy and Saria, 2020] raises serious concerns regarding the security of NLP systems.It is, therefore, cru-* These authors contributed equally to this work cial to develop proper strategies that are available to deal with these threats [Szegedy et al., 2014].Perhaps surprisingly, if the research community has invested considerable efforts to design efficient attacks, there are only a few works that address the issue of preventing them.One can distinguish two lines of research: detection methods that aim at discriminating between regular input and attacks; and defense methods that try to correctly classify adversarial inputs.The latter is based on robust training methods, which customize the learning process, see for instance [Zhou et al., 2021, Jones et al., 2020, Yoo and Qi, 2021, Pruthi et al., 2019].These are limited to certain types of adversarial lures (e.g., misspelling), making them vulnerable to other types of attacks that already exist or may be designed in the future.In contrast, detection methods are more relevant to real-life scenarios where practitioners usually prefer to adopt a discard-rather-than-correct strategy [Chow, 1957].This has been highlighted in [Yoo et al., 2022] which is, to the best of our knowledge, the single word that introduces a detection method that does not require training.On the contrary, the authors propose to measure the regularity of a given input by computing the Mahalanobis distance [Mahalanobis, 1936] of its embedding in the last layer of a transformer with respect to the training distribution.Notice that the Mahalanobis distance has also been successfully used in a very similar framework of Out-Of-Distribution (OOD) detection methods (see [Lee et al., 2018, Ren et al., 2021] and references therein).
In this paper, we build upon [Yoo et al., 2022] and introduce a new attack detection framework, called LAROUSSE1 , which improves the current state-of-the-art.Our approach is based on the computation of the halfspace-mass depth [Chen et al., 2015] of the last layer embedding of an input with respect to the training distribution.Halfspace-mass depth is a particular instance of data depth, which are functions that measure the proximity of a point to the core of a probability distribution.As a matter of fact, the Mahalanobis distance is also -probably one of the most popular -a data depth.Interestingly, in addition to improving the attack detection rate, the halfspace-mass depth remedies several limitations of the Mahalanobis depth: it does not make Gaussian assumptions on the data structure and is additionally non-differentiable, providing security guarantees regarding malicious adversaries that could rely on gradient-based methods.The second contribution of our work consists in releasing STAKEOUT, a new NLP attack benchmark that enriches the one introduced in [Yoo et al., 2022].More precisely, we explore the same datasets and extend their four attacks by adding five new adversarial techniques.This ensures a wider variety of testing methods, leading to a robust evaluation framework that we believe will stimulate future research efforts.We conduct extensive numerical experiments on STAKEOUT and demonstrate the soundness of our LAROUSSE detector while studying the main variability factors on its performance.Finally, we empirically observe the presence of relevant information to detect attacks across the layers other than the last one.This could pave the way for future research by considering the possibility of building detectors that are not limited to the last embedding layers but rather exploit the full network information.
Our contributions in a nutshell.Our contributions are threefold: 1. We introduce LAROUSSE, a new textual attack detector based on the computation of a carefully chosen similarity function, the halfspacemass depth, between a given input embedding and the training distribution.Contrary to Mahalanobis distance, it does not rely on underlying Gaussian assumptions of data and is non-differentiable, making it robust to gradient-based attacks.
The rest of the paper is organized as follows.In Sec. 2, we briefly review the setting of textual attacks, provide main references on the subject, and formally introduce the problem of attack detection.In Sec. 3, we present our LAROUSSE detector and provide some perspectives on data depth and connections to the Mahalanobis distance.In Sec. 4, we introduce our new benchmark STAKEOUT and give details on the evaluation framework of attack detection.Finally, we present our experimental results in Sec. 5.

Textual Attacks: Generation and Detection
Let us first introduce some notations.We will denote by D = {(x i , y i )} 1≤i≤n a textual dataset made of n pairs of textual input x i ∈ X and associated attribute value y i ∈ Y.We focus on classification tasks, meaning that Y is of finite size: |Y| < +∞.In this work, the inputs are first embedded through a multi-layer encoder with L layers and learnable parameters ψ ∈ Ψ.We denote by f ℓ ψ : X → R d the function that maps the input text to the ℓ-th layer of the encoder.Note that, as we will work on transformer models, the latent space dimension-the dimension of the output of a layer-of all layers is the same and will be denoted by d.The dimension of the logits, denoted as the (L + 1)−th layer of the encoder, is d ′ .The final classifier built on the pre-trained encoder produces a soft decision C ψ over the classes, where ψ is a learned parameter.We will denote by C ψ (c | x) the predicted probability that a given input x belongs to class c.Given an input x, the predicted label ŷ is then obtained as follows:

Review of textual attacks
The sensitivity of neural networks with respect to adversarial examples has been uncovered by [Szegedy et al., 2013] and popularized by [Goodfellow et al., 2014], who introduced fast adversarial generation methods, in the context of computer vision.In computer vision, the meaning of an adversarial attack is clear: a given regular input is perturbed by a small noise which does not affect human perception but nonetheless changes the network prediction.However, due to the discrete nature of tokens in NLP, small textual perturbations are usually perceptible (e.g., a word substitution can change the meaning of a sentence).As a result, defining textual attacks is not straightforward and the methods used in the context of images in general do not directly apply to NLP tasks.
The goal of a textual attack is to modify an input while keeping its semantic meaning and luring a deep learning model.At a high level, one can formally define the problem of textual attack generation as follows.Given an input x, find a perturbation x adv that satisfies the following optimization problem: (1) where SIM : X × X → R + denotes a function that measures the semantic proximity between two textual inputs.Finding a good similarity function is an active research area and previous works [Li et al., 2018] rely on embedding similarities such as Word2vect [Mikolov et al., 2013], USE [Cer et al., 2018], or string-based distance [Gao et al., 2018] based on the Levenshtein distance [Levenshtein, 1965], among others.
The landscape of available adversarial textual attacks keeps growing, with numerous attacks every year [Li et al., 2021, Ribeiro et al., 2020, Li et al., 2020, Garg and Ramakrishnan, 2020, Alzantot et al., 2018, Jia et al., 2019, Ren et al., 2019, Feng et al., 2018, Li et al., 2018, Zang et al., 2019].There exist different types of attacks according to the perturbation level, that is the level of granularity at which the corruption is performed.For instance, [Ebrahimi et al., 2018, Pruthi et al., 2019] characterlevel perturbations are usually based on basic operations such as substitution, deletion, swapping or insertion.There exist also word-level corruption techniques [Ebrahimi et al., 2018, Pruthi et al., 2019] which usually perform word substitution using synonyms or semantically equivalent words [Miller, 1995, Miller et al., 1990].Finally, we can also find sentence-level attacks [Iyyer et al., 2018] rely-ing on text generation techniques.Standard toolkits such as OpenAttack [Zeng et al., 2021] or Textattack [Morris et al., 2020b] gather them in a unified framework.

Review textual attack detection methods
The goal of an adversarial attack detector is to build a binary decision rule d : X → {0, 1} that assigns 1 to adversarial samples created by the malicious attacker and 0 to clean samples.Typically, this decision rule consists of a function s : X → R that measures the similarity between an input sample and the training distribution, and a threshold γ ∈ R: (2) As already mentioned in the previous section, although some works rely on robust training by adding regularization terms that use adversarial generation [Dong et al., 2021, Wang et al., 2020, Yoo and Qi, 2021] et al., 2021] which computes sentence likelihood based on words frequencies; and of [Le et al., 2021, Pruthi et al., 2019] which focus on specific types of attacks.The only work that does not require access to adversarial examples is [Yoo et al., 2022] which computes a similarity score between a given input embedding and the training distribution.This similarity function is the Mahalanobis distance and has been widely used in the related literature of OOD detection methods [Podolskiy et al., 2021, Ren et al., 2021, Kamoi and Kobayashi, 2020].

LAROUSSE: A Novel Adversarial Attacks Detector
We follow the notations introduced in Sec. 2. In particular, recall that f L ψ : X → R d is the mapping to the last layer embedding of the considered network.

LAROUSSE in a nutshell
Our framework for adversarial attack detection relies on three consecutive steps: 1. Feature Extraction.As in [Yoo et al., 2022], we rely on the last layer embedding f L ψ (x) of a given textual input x.We will use the following notation: z ≜ f L ψ (x) ∈ R d .2. Anomaly Score Computation.In the second step, we compute a similarity score between the last layer embedding z and the predicted class of z.To formally write this score, we need to introduce, for each y ∈ Y, the empirical distribution With these notations in mind, our similarity score function writes, for a given input x with predicted class ŷ: where D HM denotes the halfspace-mass depth that we carefully present in Sec.3.2.The higher the value of D HM the more regular x is with respect to P L Y .3. Thresholding.Similar to previous works, the final step consists in thresholding our similarity score: we detect x as an adversarial attack if and only if s LAROUSSE (x) ≤ γ, where γ is a hyperparameter of the detector.
Remark 1 In the experimental section, we will also consider the case where the depth function is computed based on the logits.It corresponds to re

A brief review of data depths and the halfspace-mass depth
With the goal of extending the notions of order and rank to multivariate spaces, the statistical concept of depth has been introduced by John Tukey in [Tukey, 1975].Data depth found many applications in Statistics and Machine Learning (ML) such as in classification [Lange et al., 2014], clustering [Jörnsten, 2004], text automatic evaluation [Staerman et al., 2021b] or anomaly detection [Staerman et al., 2020[Staerman et al., , 2022]].A depth function D(•, P ) : R d → [0, 1] provides a score that reflects the closeness of any element x ∈ R d to a probability distribution P on R d .The higher (respectively lower) the score of x is, the deeper (respectively farther) it is in P .Many proposals have been suggested in the literature such as the projection depth [Liu, 1992], the zonoid depth [Koshevoy and Mosler, 1997] or the Monge-Kantorovich depth [Chernozhukov et al., 2017] differing in properties and applications.To compare their benefits and drawbacks, standard properties that a data depth should satisfy have been developed in [Zuo and Serfling, 2000] (see also [Dyckerhoff, 2004]).We refer the reader to [Mosler, 2013] or to [Staerman, 2022, Ch. 2 ] for an excellent account of data depth.
The halfspace-mass depth.Beyond appealing properties satisfied by depth functions such as affine-invariance [Zuo and Serfling, 2000], these statistical tools suffer in practice from highcomputational burden, which limits their spread use in ML applications [Mosler and Mozharovskyi, 2020].However, efficient approximations have been provided such as for the halfspace-mass depth [Chen et al., 2015] (see also [Ramsay et al., 2019, Staerman et al., 2021a]).The halfspace-mass (HM) depth of x ∈ R d w.r.t. a distribution P on R d is defined as the expectation over the set of all closed halfspaces containing x H(x) of the probability mass of such halfspaces.More precisely, given a random variable X following a distribution P and a probability measure Q on H(x), the HM depth of x w.r.t.P is defined as follows: When a training set {x 1 , . . ., x n } is given, expression (4) boils down to: where PX denotes the empirical measure defined by 1 n n i=1 δ x i .The halfspace-mass depth has been successfully used in anomaly detection (see [Chen et al., 2015] and [Staerman et al., 2021a]) making it a natural candidate for detecting adversarial attacks at the layers of a neural network.
Computational aspects.The expectation of (5) can be approximated by means of a Monte-Carlo as opposed to several depth functions that are defined as the solution to optimization problems [Tukey, 1975, Liu, 1992], unfeasible when dimensions are too high.The aim is then to approximate (5) with a finite number of half spaces containing x.To that end, authors of [Chen et al., 2015] introduced an algorithm, divided into training and testing parts, that provides a computationally efficient approximation of (5).The three main parameters involved are K, corresponding to the number of directions sampled on the sphere, n s , the sub-sample size which is drawn at each projection step, and λ, which controls the extent of the choice of the hyperplane.Since the HM approximation has low sensitivity in its parameters, in the remainder of the paper we set K = 10000, n s = 32 and λ = 0.5.The computational complexity of the training part is of order O(Kn s d) and the testing part O(Kd), which makes ease to compute.Further details are provided to the curious reader in Sec.8.1 Remark 2 ADVANTAGES OVER THE MAHA-LANOBIS DISTANCE In contrast to approaches based on the Mahalanobis distance [Lee et al., 2018, Yoo et al., 2022], the halfspace-mass depth does not require to invert and estimate the covariance matrix of the training data that can be challenging both from computational and statistical perspectives, especially in high dimension.In addition, the HM depth does not need any assumption on the distribution while Mahanalobis distance is restricted to be used on distributions with finite two-first-order moments.

STAKEOUT: A Novel Benchmark for Adversarial Attacks
Textual attack generation can be computationally expensive as some attacks require hundreds of queries to corrupt a single sample 2 .To dispose of a benchmark that gathers the result of diverse attacks on different datasets and encoders is instrumental to accelerate future research efforts by reducing computational overhead.To build our benchmark, we relied on the models, the datasets, and the attacks available in TextAttack [Morris et al., 2020b].In the following, we describe the experimental choices we made when building STAKEOUT and discuss our baseline and evaluation pipeline.

A novel benchmark: STAKEOUT
Training Datasets.We choose to work on sentiment analysis, using SST2 [Socher et al., 2013] and IMDB [Maas et al., 2011], and topic classification, relying on ag-news [Joachims, 1996].These datasets are used in [Yoo et al., 2022] and allow for comparison with previously obtained results.
Takeaways of Fig. 1.Interestingly, attack efficiency only marginally depends on the pre-trained encoder type.In contrast, there is a strong dependency with respect to the training set (variation of over 0.2 points).It is worth noting that TF and KUL are the most efficient attacks.From the averaged number of queries, we note that attacking a classifier trained on IMDB is harder than one trained on SST2 despite being both binary classification tasks.
Adversarial and clean sample selection.For evaluation, we rely on test sets that are made of clean samples and adversarial ones.In order to construct such sets while controlling the ratio between clean and adversarial samples, we rely on [Yoo et al., 2022, Scenario 1].From a given initial test set X t , we sample two disjoint subsets X 1 and X 2 .We then generate attacks on X 1 and take the successful one as an adversarial testing example, while X 2 is taken as the clean testing sample.

Baseline detectors
We use two baseline detectors.The first one is based on a language model likelihood and the second one corresponds to the Mahalanobis detector introduced in [Yoo et al., 2022].Both of them follow the analog three consecutive steps of LAROUSSE, but do not use the same similarity score.
Language model score.This method consists in computing the likelihood of an input with an external language model: (6) where ω i represents the individual token of the input sentence x.We compute the log-probabilities with the output of a pretrained GPT2 [Brown et al., 2020].Notice that this baseline is also used in [Yoo et al., 2022].
Mahalanobis-based detector.We follow [Yoo et al., 2022] which relies on a class-conditioned Mahalanobis distance.Following our notations, it corresponds to evaluation: where µ ŷ is the empirical mean for the logits of class ŷ and Σ ŷ is the associated empirical covariance.
Remark 3 Similarly to Remark 1, for a given textual input x, we will either rely on the penultimate layer L representation f L ψ (x) or on the logits predictions f L+1 ψ (x) of the networks to compute s M .

Evaluation metrics
The adversarial attack detection problem can be seen as a classification problem.In our context, two quantities are of interest, namely (i) the false alarm rate, i.e. the proportion of samples that are misclassified as adversarial sample while actually being clean; and (ii) the true detection rate, i.e., the proportion of samples that are rightfully predicted as adversarial sample.We focus on three different metrics that assess the quality of our method.
1. Area Under the Receiver Operating Characteristic curve (AUROC; [Bradley, 1997]).It is the area under the ROC curve which consider the true detection rate against the false alarm rate.From elementary computations, the AUROC can be linked to the probability that a clean example has higher score than an adversarial sample.
2. Area Under the Precision-Recall curve (AUPR; [Davis and Goadrich, 2006]).It is the area under the precision-recall curve that is more relevant to imbalanced situations.It plots the recall (true detection rate) against the precision (actual proportion of adversarial sample amongst the predicted adversarial sample).

3.
False Positive Rate at 90% True Positive Rate (FPR (%)).In a practical situation, one wishes to build an efficient detector.Thus, given a detection rate r, this incites to fix a threshold δ r such that the corresponding TPR equals r.Following [Yoo et al., 2022], we set r = 0.90.For FPR, lower is better.

Classification error (Err (%))
. This refers to the lowest classification error obtained by choosing the best-fixed threshold.

Overall Results
We report in Tab. 2 the aggregated performance over the different datasets, the various seeds, and the different attacks.D HM achieves the best overall results.It is worth noting that detection methods better discriminate adversarial attacks on ROB.than BERT.It also consistently improves the performance when using a halfspace mass score D HM instead of Mahanalobis D M , which experimentally validates our choice.This conclusion holds on both ROB. and BERT, corresponding to over 540 experimental configurations.Similar to previous work [Yoo et al., 2022], the detector built on GPT2 under-performs D M .For all methods, we observe that LAROUSSE achieves the best results both in terms of thresholdfree (e.g.AUROC, AUPR-IN and AUPR-OUT) threshold-based metrics (e.g.FPR) which validates our detector.Importance of feature selection for adversarial detectors.Both D HM and D M are highly sensitive to the layer's choice.For D M , using the logits is better than the penultimate layer, while for D HM , the converse works better.Although the AUROC presents a slight variation when using f L+1 ψ instead of f L ψ , it induces a variation of over 10 FPR points.Overall, it is worth noting that LAROUSSE, although being state-of-the-art on tested configura-tions, achieves an FPR which remains moderate.The best-averaged error of 17.9% is far from the error achieved on the main task (less than 10% on all datasets).

Identifying key detection factors
To better understand the performance of our methods w.r.t different attacks and various datasets, we report in Fig. 4 the performance in terms of AUROC and FPR per attack.Detectors and models are not robust to dataset change.The detection task is more challenging for SST-2 than for ag-news and IMDB, with a significant drop in performance (e.g. over 15 absolute points for BAE).On SST-2, D HM achieves a significant gain over D M both for the AUROC and FPR.Detectors do not detect uniformly well the various attacks.This phenomenon is pronounced on SST2 while being present for both ag-news and IMDB.For example on SST2, FPR varies from less than 10 (a strong detection performance) for TF-ADJ to over 70 (a poor performance) for PRU.Hard to detect attack for ROB. are not necessarily hard to detect for BERT.This phenomenon is illustrated by Fig. 2. For example, KUL is hard to detect for BERT while being easier on ROB as LAROUSSE achieves over 96 AUROC points.If safety is a primary concern, it is thus crucial to carefully select the pre-trained encoder.The choice of clean samples largely affects the detection performance measure.Fig. 4 and Fig. 2 display several tries with different seeds.As mentioned in Sec. 4, different seeds correspond to various choices of clean samples.On all datasets, we observe that when measuring the algorithm performance, different negative samples will lead to  different results (e.g.FPR on IMDB varies of over 30 points on KUL and PRU across different seeds).

All the metrics matter
Setting.In this experiment, we study the relationship between the different metrics.From Tab. 2, we see that threshold free metrics (i.e., AUROC, AUPR-IN) exhibit lower variance than threshold based metrics such as FPR.The FPR measures the percentage of natural samples detected as adversarial when 90% of the attacked samples are detected.Therefore, the lower, the better.Takeaways.From Fig. 3, we see that for a large AUROC, AUPR-IN and AUPR-OUT do not necessarily corresponds a low FPR.This suggests that the detectors also detect natural samples as adversarial when it detect at least 90% of adversarial examples.Additionally, a small variation of AUROC, AUPR-IN and AUPR-OUT can lead to a high change in FPR.It is therefore crucial to compare the detectors using all metrics.

Expected performances of LAROUSSE
Setting.Fig. 5 reports the error probability per attack for LAROUSSE and the considered baselines.Efficient attacks are easier to detect.We observe that on the three most efficient attacks, according to Fig. 1 (i.e., TF, PWWS and KUL), LAROUSSE is significantly more effective than D M and GPT2.Different detection methods capturing various phenomena are better suited for detecting types of attack.Although LAROUSSE achieves the best results overall, GPT2, which relies on perplexity solely, achieves competitive results with LAROUSSE and outperforms D M on several attacks (i.e., DWB and IG).This suggests that stronger detectors could be achieve by combining different types of scoring functions.

Semantic vs syntactic attacks
In this section, we analyze the results of the LAROUSSE on semantic (i.e., working on token) versus syntactic (i.e, working on character) attacks.
Raw and processed results are reported in Sec.10.5.
Takeaways.From Fig. 9b, we observe that semantic attacks are harder to detect for both our method and D H . : FPR for semantic vs syntactic analysis further results can be found in Fig. 9a 6 Concluding Remarks We have proposed STAKEOUT, a large adversarial attack detection benchmark, and LAROUSSE.LAROUSSE leverages a new anomaly score built on the halfspace-mass depth and offers a better alternative than the widely known Mahalanobis distance.
7 Ethical impact of our work Our work focuses on responsive NLP and aims at contributing to the protection of NLP systems.Our new benchmark STAKEOUT allows for a robust evaluation of new adversarial detection methods.
And LAROUSSE outperforms previous methods and thus provides a better defense against attackers.Overall, we believe this paper offers promising research direction toward safe and robust NLP systems and will benefit the community.
Draw D y,ns , a sub-sample of D y with size n s without replacement.

3:
Draw randomly and uniformly a direction Randomly and uniformly select κ k in Algorithm 2 Testing algorithm for the approximation of D HM . INPUT:

Additional Results
This section gathers additional experimental results to allow the curious reader to draw fine conclusions.Formally, we conduct: • a detailed analysis of the detectors' performances per attack (see Sec. 10.1).Full Name Acronym Idea Type of Constraint pruthi [Pruthi et al., 2019] PRU Simulation of common typos using greedy search for untargeted classification Minimum word length, maximum number of words perturbed textbugger [Li et al., 2018] TB Character-based attack (i.e swap, deletion, subtitution) Cosine with USE [Cer et al., 2018] iga [Wang et al., 2019] IG Genetic algorithm to perform word substitution Percentage of perturbed words and word embedding distance on Word2Vect [Mikolov et al., 2013] deepwordbug [Gao et al., 2018] DWB Character-based attack (i.e swap, deletion, subtitution) Levenshtein distance [Levenshtein, 1965] kuleshov [Kuleshov et al., 2018] KUL Attack using embedding swap Cosine and language model similarity clare [Li et al., 2021] CLA Attack using token insertion, merge and swap Embedding similarity bae [Garg and Ramakrishnan, 2020] BAE Attack using BERT MLM combined with a greedy search Number of perturbed words and cosine with USE [Cer et al., 2018] pwws [Ren et al., 2019] PWWS Word swap based on WordNet synonyms textfooler [Jin et al., 2020] TF Attack using embedding swap Embedding similarity and POS match with word and embedding swap TF-adjusted [Morris et al., 2020a] TF-ADF Attack using embedding swap USE and word embedding similarity input-reduction [Feng et al., 2018] IR Greedy attack using word importance ranking via greedy search checklist [Ribeiro et al., 2020] CHK Using contraction/extension and changing numbers, locations, names • a detailed analysis of the detectors' performances per dataset across all the considered metrics (see Sec. 10.2).
• an analysis per dataset/per attack of the different detector performances on STAKEOUT in Sec.10.4.
• an comparative study of the detector's performance between semantic and syntactic attacks.
• an in-depth reflection on the possibility of building multi-layer detectors (see Sec. 10.6).

Fine grained analysis per attack
In Tab. 4, we report the average performances on STAKEOUT for each detector on each model under each attack's threat.First, it is interesting to note that LAROUSSE strongly outperforms other methods on most of the configurations.Then, corroborating previous observations, we find that changing attacks, encoders, and metrics largely influence the detection performances.
Takeaways.These findings validate our extended STAKEOUT, as in real-life scenario, practitioners need to ensure that the detection methods works well on a large number of attacks for different types of models.94.9 ±4.9 9.2 ±8.3

Fine grained analysis per dataset
We report in Tab. 5, the performances on STAKEOUT averaged over the datasets for different detector configurations.We observe that LAROUSSE achieves the best results on 2 out of the 3 datasets.Overall, it is interesting to note that LAROUSSE's performances are more consistent compared to Mahanalobis when changing the feature representation (i.e., using f L θ instead of f L+1 θ ).Takeaways.This validates that the half-space mass is better for detecting textual adversarial attacks than the widely used Mahanalobis score.

Extended figures for Sec. 5.2
We report in Fig. 7 the extended figures for Sec.5.2.The baseline detector built on GPT2 is weaker than D M and LAROUSSE, and consistantly achieves lower results in term of AUROC, AUPR-IN, AUPR-OUT and FPR.

Analysis per dataset/per attack
We report in Fig. 8 the different detectors' performances in terms of AUROC, AUPR-IN, AUPR-OUT and FPR for the different datasets.
Similar to what has been previously observed, we see a large variation in the different detectors' performance when changing both the dataset and the type of attack.

Comparing detection performance between semantic versus syntactic attacks
In this section, we analyse results of the LAROUSSE on semantic (i.e., working on token) versus syntactic (i.e, working on character) attacks.Raw and processed results are reported in ??.
Takeaways.From Fig. 9b, we observe that semantic attacks are harder to detect for both our method and D H .

Towards multi-layer detectors
A promising research direction to improve the detection methods is to develop an unsupervised strategy to combine multiple-layer representations of the pre-trained encoders [Gomes et al., 2022, Sastry andOore, 2020].To the best of our knowledge, this has never been shown to be useful for text data.Setting.In this experiment, we aim to quantify the power of each layer to discriminate between clean and adversarial samples.To measure this ability, we rely on Wasserstein distance (W 1 ; see [Peyré and Cuturi, 2019]).Given two empirical distributions, W 1 finds the best possible transfer between them while minimizing the transportation cost defined by the Euclidean distance.Fig. 10 reports the transportation cost (W 1 ) between the empirical distributions of clean samples (µ clean ) and adversarial samples (µ adv ) obtained at each layer.Analysis.The last layers of the encoder have a better ability to discriminate the adversarial samples from the clean one than the first layers.Simi-larly to what can be observed in Fig. 3, we observe that IMDB is the easiest dataset as the last encoder layer can better distinguish the adversarial samples and the clean one.Interestingly, we observe that the best layer depends on the dataset, which is consistent with the observation in NLG evaluation [Zhang* et al., 2020], where the optimal layer is found using a validation set.Overall, the information present at the last encoder layers suggests that designing multi-layer detectors is a promising research direction.

Attacking our detectors
Adversarial attack detection methods have been extensively studied in the computer vision community [Feinman et al., 2017, Ma et al., 2018, Kherchouche et al., 2020, Aldahdooh et al., 2022b,a] and recently a line of work on adaptive attacks [Carlini and Wagner, 2017, Athalye et al., 2018, Tramer et al., 2020] have emerged.LAROUSSE is not differentiable adding an extra layer of security: it prevents the malicious adversaries to leverage gradient computations, contrary to studied baselines (e.g Mahalanobis, GPT).Attacking LAROUSSE is thus a quite challenging research question that falls outside of the scope of the paper and is left as future work.

Computation time comparison between HM and Mahalanobis depths
In this part, we compare the computation time between the HM and the Mahalanobis depths.

Figure 1 :
Figure 1: Efficiency of the chosen attacks.Both checklist and input reduction were tried but discarded due to low efficiency.Dashed lines report the average performance for each dataset.

Figure 2 :
Figure 2: Performance per attack in for each pretrained encoder in terms of AUROC (left) and FPR (right) of D M and D HM on STAKEOUT.

Figure 3 :
Figure 3: Empirical study of the metric relationship for the three considered detection methods.

Figure 4 :
Figure 4: Performance in terms of AUROC (up) and FPR (down) of D M and D HM on STAKEOUT.Fig. 7 in the Supplementary Material reports the results of GPT2.

Figure 6
Figure6: FPR for semantic vs syntactic analysis further results can be found in Fig.9a

Figure 7 :Figure 9 :Figure 10 :Figure 11 :
Figure 7: Extended figures for Sec.5.2.In these figures, we report the GPT2 baseline and the performance of all the detectors in terms of AUROC, AUPR-IN, AUPR-OUT and FPR.

Table 1 :
Classifier accuracy for each considered dataset.

Table 3 :
Considered attacks for STAKEOUT construction.

Table 4 :
Average performances on STAKEOUT per model and per attack

Table 5 :
Average performance on STAKEOUT per training dataset.