Detecting Adversarial Samples through Sharpness of Loss Landscape

Deep neural networks (DNNs) have been proven to be sensitive towards perturbations on input samples, and previous works high-light that adversarial samples are even more vulnerable than normal ones. In this work, this phenomenon is illustrated from the perspective of sharpness via visualizing the input loss landscape of models. We ﬁrst show that adversarial samples locate in steep and narrow local minima of the loss landscape ( high sharpness ) while normal samples, which differs distinctly from adversarial ones, reside in the loss surface that is more ﬂatter ( low sharpness ). Based on this, we propose a simple and effective sharpness-based detector to distinct adversarial samples by maximizing the loss increment within the region where the inference sample is located. Considering that the notion of sharpness of a loss landscape is relative, we further propose an adaptive optimization strategy in an attempt to fairly compare the relative sharpness among different samples. Experimental results show that our approach can outperform previous detection methods by large margins (average +6.6 F1 score) for four advanced attack strategies considered in this paper across three text classiﬁcation tasks. Our codes are publicly available at https://github


Introduction
Despite the popularity and success of pre-trained language models (PLMs), they are vulnerable to textual adversarial attacks (Garg and Ramakrishnan, 2020;Zhang et al., 2020).These attacks are designed to generate semantically consistent and syntactically correct adversarial samples that can fool the model into making incorrect predictions (Ren et al., 2019;Maheshwary et al., 2021).Adversarial vulnerability raises concerns about the safe practice of NLP systems in a variety of tasks * Equal contribution.

Normal Example Adversarial Example
Figure 1: Input loss landscape of model with respect to normal and adversarial samples.Adversarial sample locates in a sharp local minimum on the input loss landscape, while the normal one resides in a wide area.(Wallace et al., 2019;Zhang et al., 2021;Lin et al., 2021).In machine learning, there are two main streams to counter adversarial attacks: adversarial detection and defense (Cohen et al., 2020).The purpose of detection is to distinguish the adversarial samples from the normal ones and discard them during the inference phase (Mozes et al., 2021;Yoo et al., 2022), while defense aims to predict the correct results of adversarial texts (Li et al., 2021b;Zheng et al., 2022;Omar et al., 2022;Liu et al., 2022b;Xi et al., 2022).The detect-discard strategy is an important step towards a robust model and can be integrated with existing defense methods.A significant challenge in adversarial detection is to explore an effective characteristic for recognition.
The existing state-of-the-art adversarial detection methods can be broadly classified into two categories: 1) perturbation-based methods (Mozes et al., 2021;Mosca et al., 2022;Wang et al., 2022) and 2) distribution-based methods (Yoo et al., 2022;Liu et al., 2022a).The perturbation-based methods assume that adversarial samples are more sensitive to perturbations in the input space than normal samples.These methods are based on the model's reaction when the input words are perturbed by substitution (Mozes et al., 2021;Wang et al., 2021) or deletion (Mosca et al., 2022).However, these methods rely on empirically designed perturbations and it is difficult to find an optimal perturbation in the discrete text space.More importantly, no attempt has been made to explore why the sensitivity assumption is valid or to provide more details for this assumption.
We delve into the input loss landscape to characterize the model's sensitivities with respect to normal and adversarial samples.By visualizing the input loss landscape of the model, we observe a significant difference between the adversarial and normal samples: the loss surfaces on local minima with respect to adversarial samples are steep and narrow (high sharpness), while those of normal samples are much flatter (low sharpness).The above-mentioned significant distinction makes it eligible for distinguishing adversarial samples from normal ones.However, it remains a challenge on how to effectively measure the sharpness of an input loss landscape.
In this work, we formulate the sharpness calculation as a constrained optimization problem whose objective is to find a neighbor within the region where the inference sample is located to maximize the loss increment.The convergence quality of this constrained optimization problem can be assessed by "Frank-Wolfe gap" (i.e., the gap between the global optimum and the current estimate) (Frank and Wolfe, 1956;Lacoste-Julien, 2016).With this criterion, we find that samples tend to converge to different levels, which hinders a fair comparison of relative sharpness between samples (Dinh et al., 2017).Therefore, we design an adaptive optimization strategy that guides the solutions to converge gradually to the same level, thereby significantly improving the detection performance.Our contributions are as follows: • We analyze the geometric properties of the input loss landscape.We reveal that the adversarial samples have a deep and sharp local minima on the input loss landscape.
• We propose a detection metric based on the sharpness of input loss landscape, which can be formulated as a constrained optimization problem.
• We design an adaptive optimization strategy to guide the calculation of sharpness to converge to the same level, which can further improve the detection performance.
2 Related Work

Textual Adversarial Attack
Unlike image attacks that operate in a highdimensional continuous input space, text perturbation needs to be performed in a discrete input space (Zhang et al., 2020).Text attacks typically generate adversarial samples by manipulating characters (Ebrahimi et al., 2018;Gao et al., 2018), words (Ren et al., 2019;Jin et al., 2020;Li et al., 2020;Alzantot et al., 2018;Zang et al., 2020;Maheshwary et al., 2021), phrases (Iyyer et al., 2018), or even the entire sentence (Wang et al., 2020).The most widely used word-level attacks use the greedy algorithm (Ren et al., 2019) and combinatorial optimization (Alzantot et al., 2018) to search for the minimum number of substitute words.Moreover, these attacks guarantee the fluency of adversarial samples in semantics (Li et al., 2020) or embedding space (Jin et al., 2020) to generate more stealthy adversarial samples.Recent studies have shown that most of the adversarial samples generated are of low quality, unnatural, and rarely appear in reality (Hauser et al., 2021;Wang et al., 2022).

Textual Adversarial Detection
Existing adversarial detection methods are mainly divided into two categories: 1) perturbation-based methods and 2) distribution-based methods.Zhou et al. (2019) propose a discriminator that learns to recognize word-level adversarial substitutions and then correct them.Yoo et al. (2022) assume that the representation distribution of original samples follows a multivariate Gaussian and use robust density estimation (Feinman et al., 2017) to determine the likelihood of a sentence being perturbed.Liu et al. (2022a) introduce the local intrinsic dimensionality (Ma et al., 2018) from image processing to text domain.Wang et al. (2022) apply the anomaly detector to identify unnatural adversarial samples and then use textual transformations to mitigate the adversarial effect.Mozes et al. (2021) find that word-level adversarial attacks tend to replace input words with less frequent ones, and exploit the frequency property of adversarial word substitutions to detect adversarial samples.Mosca et al. (2022) introduce a logits-based metric to capture the model's reaction when the input words are omitted.However, these methods rely on empirically  designed word-level perturbations, making it difficult to find an optimal perturbation.

Delving into Input Loss Landscape
Our aim is to better understand adversarial samples, and thereby derive a potentially effective detector.
In this section, we investigate the geometric properties of the input loss landscape and show a clear correlation between the sharpness of loss landscape and adversarial samples.

Visualizing Loss Landscape
Assume we have a PLM h with a loss function ℓ(x 0 , y), where x 0 is the normal input text, y is the label and h(x 0 ) denotes the output logit.As the labels are unknown to the user in adversarial sample detection, we use the "predicated" label y * = arg max y p(y|x 0 ) in place of the golden label y.Following the visualization method proposed by (Goodfellow and Vinyals, 2015) and (Li et al., 2018), we project the high-dimensional loss surface into a 2D hyperplane, where two projection vectors α and β are chosen and normalized as the basis vectors for the x and y axes.Then the loss values around the input x can be calculated as: The coordinate (i, j) denotes the distance the origin moves along α and β directions, and V (i, j) is the corresponding loss value that measures the confidence in the model prediction y * when perturbing the original input x.In Appendix A.1, we show more details about the input loss landscape.

Results
Figs. 2(a) and (b) show two visualizations of the loss surface in the input embedding space, which gives an intuition of the huge difference between the normal and adversarial samples: (1) The adversarial samples' loss surface has a deep and sharp bottom, while the normal samples' has a much flatter local minimum.( 2) By visualizing the contour map, we find that adversarial samples are located in a very narrow valley on the loss landscape, while the normal ones reside in a wide area.The above observations suggest that, the adversarial samples are more sensitive to perturbations than normal samples.As shown in Figure 2(c), once small perturbations are injected into the inputs of adversarial samples, their loss will increase significantly and the predictions are easily flipped.The significant difference in the sharpness of the input loss landscape makes it eligible for distinguishing adversarial samples from normal ones.
This difference stems from two inherent properties of model training and adversarial sample generation.First, the model training progressively minimizes the loss of each normal training sample, while the adversarial samples are not available during the training process.Thus, normal samples are in general relatively far away from the decision boundary (Yu et al., 2019).Second, attackers aim to generate human-imperceptible adversarial perturbations, so the attack process stops once the perturbation successfully fools the model, which often results in just-cross-boundary adversarial samples (Alzantot et al., 2018;Li et al., 2020).

Proposed Method
In this section, we first show how a detector can be potentially designed by using loss sharpness to distinguish between adversarial and normal samples.

Sharpness of Input Loss Landscape
The sharpness of ℓ (for the model) at x measures the maximum increase of the prediction loss when moving x to a nearby input.Thus, we have the objective: where x is an input within a Frobenius ball around normal sample x 0 with radius ϵ.This maximization problem is typically nonconcave with respect to the input x.
Classical first-order optimization algorithms, such as projected gradient descent (PGD) (Madry et al., 2018), can be used to estimate sharpness.Starting from a given input x 0 , PGD generate a sequence {x k } of iterates that converge to the optimal solution.If the current estimates x k goes beyond the ϵ-ball, it is projected back to the ϵ-ball: where η is the step size, sign(•) denotes the sign function and ∥δ∥≤ϵ (•) is the projection function

Convergence Analysis
The non-convexity of loss function in deep neural network makes the constrained optimization problem in Equation (2) also non-convex.How well this non-convex optimization is solved directly affects the ability to distinguish adversarial samples from normal ones.Since the gradient norm of ℓ is not an appropriate criterion for non-convex objectives, we introduce the "Frank-Wolfe (FW) gap" (Frank and Wolfe, 1956) to measure the gap between global optimum and current estimate.Consider the FW gap of Equation ( 2) at x k (Wang et al., 2019): where An appealing property of FW gap is that it is invariant to an affine transformation of the domain {x|∥x − x 0 ∥ F ≤ ϵ} in Equation ( 2) and is not tied to any specific choice of norm, unlike the criterion ∥∇ x f (x k )∥.Moreover, we always have g(x k ) ≥ 0, and a smaller value of g(x k ) indicates a better solution of the constrained optimization problem.
The FW gap has the following closed form solutions and can be computed for free in our proposed algorithm: The sample-wise criterion g(x k ) reflects the convergence quality of x k with respect to both input constraint and the loss function.Optimal convergence where g(x k ) = 0 is achieved when 1) ∇ x f (x k ) = 0, i.e., x k is a stationary point in the interior of X ; or 2) that is, local maximum point of f (x k ) is reached on the boundary of X .The FW gap allows monitoring and controlling the convergence quality of the sharpness optimization among different samples.

Adaptive Optimization
As shown in Figure 3, optimizing the maximization problem in Equation (2) at a fixed step size leads to different FW gaps among the samples.However, the concept of sharpness of a minimum is relative, and it is difficult to fairly compare the sharpness of different minima when the convergence quality of Equation ( 2) is not the same.Thus, the inconsistent convergence quality reduces the disparity between normal and adversarial samples.It motivates us to monitor and control the quality of convergence to the identical level for all samples.Therefore, we propose to optimize the sharpness by adaptively decreasing the step size (increasing convergence quality) and stop the optimization process when a predefined convergence criterion is reached.Our proposed adaptive step size at the k-th step is: where η 0 is the initial step size and g min is the predefined convergence criterion.According to the estimation of the FW gap, the step size decreases linearly towards zero as the optimization proceeds, and is zero after the convergence criterion is achieved.We use the early stopping strategy to save computational overhead during inference by halting the optimization process when the FW gap is less than g min .For non-convex objective, the first-order optimization method requires at most O(1/g 2 min ) iterations to find an approximate stationary point with gap smaller than g min .

Experimental Setup
We validate the effectiveness of the proposed method on three classification benchmarks: IMDB (Maas et al., 2011), SST-2 (Socher et al., 2013) and AGNews (Zhang et al., 2015).The first two are binary sentiment analysis tasks that classify reviews into positive or negative sentiment, and the last one is a classification task in which articles are categorized as world, sports, business or sci/tech.We use the widely used BERT BASE as the target model and use three attacks to generate adversarial samples for detection.

Baselines
We compare our proposed detectors based on sharpness of input loss landscape (Sharpness) with several strong baselines in adversarial sample detection.MD (Lee et al., 2018): A simple yet effective method for detecting out-of-distribution and adversarial samples in the image processing domain.The main idea is to induce a generative classifier under Gaussian discriminant analysis, which results in a detection score based on Mahalanobis distance.DISP (Zhou et al., 2019): A novel framework learns to identify perturbations and can correct malicious perturbations.To detect adversarial attacks, the perturbation discriminator verifies the likelihood that a token in the text has been perturbed.FGWS (Mozes et al., 2021) leverages the frequency properties of adversarial word substitution to detect adversarial samples.Word-level attacks have a tendency to replace the input word with a less frequent one.RDE (Yoo et al., 2022): To model the probability density of the entire sentence, which uses parametric density estimation for features and generates the likelihood of a sentence being perturbed.MDRE (Liu et al., 2022a) is a multi-distance representation ensemble method based on the distribution characteristics of adversarial sample representations.

Adversarial Attacks
We selected three widely used attack methods according to the experimental setting used in previous work.PWWS (Ren et al., 2019) is based on a greedy algorithm that uses word saliency and prediction probability to determine word substitution order and maintains a very low word substitution rate.TextFooler (Jin et al., 2020) first identifies important words in the sentence and then replaces them with semantically similar and gram- matically correct synonyms until the prediction changes.BERT-Attack (Li et al., 2020) uses BERT to generate adversarial text, so that the generated adversarial samples are fluent and semantically preserved.TextFooler-adj (Morris et al., 2020) adjusts constraints to better preserve semantics and syntax, which makes adversarial samples less detectable.

Evaluation Metrics
Following previous works, we use the following three metrics to measure the effectiveness of a method in detecting adversarial samples.(1) Detection accuracy (ACC) corresponds to the maximum classification probability over all possible thresholds.
(2) F1-score (F1) is defined as the harmonic mean of precision and recall.(3) Area Under ROC (AUC) is a threshold-independent metric that can be interpreted as the probability that a positive sample is assigned a higher detection score than a negative sample.The ROC curve describes the relationship between the true positive rate (TPR) and the false positive rate (FPR).For all three metrics, a higher value indicates better performance.

Implementation Details
We fine-tune the BERT-based victim model using the official default settings.For SST-2, we use the officially provided validation set, while for IMDB and AGNews, we use 10% of the data in the training set as the validation set.The validation set and the adversarial samples generated based the validation set are used for the selection of hyperparameters and thresholds.All three attacks are implemented using TextAttack framework with the default parameter settings.1Following Mozes et al.
(2021), we build a balanced set consisting of 2, 000 test instances and 2, 000 adversarial samples to evaluate the detectors.For SST-2, we use all 1, 872 test data to construct the balanced set.Hyperparameters and decision thresholds of the proposed methods are presented in Appendix A.3.

Experimental Results and Analysis
In this section, we show the performance of the proposed method in a comprehensive way and investigate the effect of hyperparameters on performance.

Main Results
Unless specifically stated otherwise, we follow a common practice (Mozes et al., 2021;Yoo et al., 2022;Mosca et al., 2022) to ensure that our detection mechanism is tested on successful adversarial samples that can actually fool the model.Table 1 reports the detection performance of our method under various configurations.We can observe that: 1) Compared with previous detection methods, the proposed detector based on sharpness achieves significant improvements in three evaluation metrics.This demonstrates the effectiveness of sharpness of the input loss landscape in detecting adversar-  (Lee et al., 2018) 74.5 77.2 85.2 74.0 82.4 77.8 74.4 78.9 90.7 70.1 74.9 75.5 FGWS (Mozes et al., 2021) 63.5 49.8 60.1 62.1 47.4 61.1 58.6 39.6 60.3 56.1 57.5 63.2 RDE (Yoo et al., 2022) 76 ial samples.
2) The performance of FGWS decreases under TextFooler and BERT-Attack, which are more subtle attacks with less significant frequency differences, as FGWS relies on the occurrence of rare words.FGWS also performs poorly on AGNews, most likely because it covers four different news domains, resulting in its low word frequency.These results are consistent with the results reported by Yoo et al. (2022).3) DISP is a threshold-independent method and therefore AUC metric is not applicable.DISP does not perform well except on AGNews dataset.

More Rigorous Metric
TNR@95TPR is short for true negative rate (TNR) at 95% true positive rate (TPR), which is widely used in out-of-distribution detection (Li et al., 2021a;Liang et al., 2018).But to our knowledge, no textual adversarial sample detector has been evaluated using this metric.TNR@95TPR can be interpreted as the probability of a normal sample being correctly classified (Acc-) when the probability of an adversarial sample being correctly classified (Acc+) is as high as 95%.As can be seen in Table 3, with this strict evaluation metric, there is a significant advantage for our prediction-loss-based detector, while FGWS fails to detect the normal samples at all.

More Model
In previous experiments, all results are based on the BERT-base model, and we also evaluate the performance of the proposed method on RoBERTabase (Liu et al., 2019).Table 2 shows the detection results using RoBERTa as the victim model.The overall trend among detection methods is similar to Table 1.From the results in Tables 1 and 2, it can be concluded that our proposed methods perform as stable as the traditional statistical-based methods (MD and RDE) under different experimental settings, while empirically designed DISP and FGWS do not perform consistently.

Ablation Study
To better illustrate the contribution of adaptive optimization strategy to the proposed detector, we perform an ablation study by removing adaptive   4. We can observe that the adaptive optimization strategy is important for the sharpness calculation.The inconsistent convergence quality reduces the disparity between normal and adversarial samples.

Detection Threshold
To investigate the influence of detection thresholds, we analyze the performance with different thresholds on the three datasets, as shown in Figure 6.The performance of the proposed detector gradually improves as the threshold increases, but when the threshold is too large, the results of the detectors are concentrated in one certain category, leading to a decrease in performance.The peak performance of both detectors occurs near the midpoint of the potential thresholds, indicating that our method performs well on both normal and adversarial samples.to show more intuitively the effect of optimization steps and size on the AUC metric, we preserve the results within 2 percents below the highest value, and the rest of the data are shown as light-colored blocks in Figure 5.We can observe that the proposed detector achieve sufficiently consistent performance under various optimization parameters (i.e., the number of steps K and step size η), and the detection performance is decided by δ K ≈ K × η.

Conclusion
Our work starts from a finding: adversarial samples locate in steep and narrow local minima of the loss landscape while normal samples, which differs distinctly from adversarial ones, reside in the loss surface that is more flatter.Based on this, we propose a simple and effective sharpness-based detector that uses an adaptive optimization strategy to compute sharpness.Experimental results have demonstrated the superiority of our proposed method compared to baselines, and analytical experiments have further verified the good performance of our method under different parameters.

Limitations
In this work, we propose a detector that aims to detect adversarial samples via sharpness of input loss landscape for model.However, the computational cost of the sharpness is high because it requires at most K-step gradient descents.Moreover, in this work, we mainly considered word-level adversarial sample detection as often studied in previous work, while character-level and sentence-level adversarial samples are not studied.These two problems will be explored in our future work.

Figure 2 :
Figure 2: Difference between normal and adversarial samples.(a) Input loss landscape of normal samples; (b) Input loss landscape of adversarial samples.The adversarial samples' loss landscape has a sharp bottom, while that of normal samples is much flatter.(c) Label not flip rate, defined as the proportion of samples that keep the prediction results unchanged under adversarial perturbations.Normal samples are more robust in the face of adversarial perturbations than adversarial samples.Experimental results are obtained from BERT trained on AGNews, and adversarial samples are generated via BERT-Attack(Li et al., 2020).

Figure 3 :
Figure 3: Distributions of "Frank-Wolfe gap" for PGD with different fixed steps and adaptive optimization, while the step size η 0 is 0.03.Adaptive strategies facilitate the convergence quality of different samples to quickly approach the same level.

Figure 4 :
Figure 4: Detection score distribution of the proposed methods and two baselines.The proposed detection scores are more discriminative than other baselines.Detection scores are normalized into the range [0, 1].Experimental results are obtained from BERT trained on AGNews, and adversarial samples are generated via BERT-Attack(Li et al., 2020).

Figure 5 :
Figure 5: Heatmaps of AUC for the proposed method, with different perturbation sizes and numbers of steps on BERT-base.Adversarial samples are generated via BERT-Attack.The darker the color, the better the performance.

Figure 5 Figure 6 :
Figure5shows the detection performance with different step sizes and numbers of steps.In order

Table 1 :
Adversarial detection performance of our proposed method and baselines on BERT-base.The proposed detector outperforms the baselines consistently.The best performance is marked in bold.

Table 2 :
Adversarial detection performance of our proposed method and baselines on RoBERTa-base.The best performance is marked in bold.

Table 3 :
Adversarial detection performance on the metric TNR@95TPR.

Table 4 :
Ablation study on the proposed detector.We remove the adaptive optimization strategy (w/o Adaptive) to illustrate the the importance of adaptive optimization strategy.Results are obtained from BERT-base, and adversarial samples are generated via BERT-Attack.
for blocking adversarial attacks in text classification.In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4904-4913, Hong Kong, China.Association for Computational Linguistics.