On the Universal Adversarial Perturbations for Efficient Data-free Adversarial Detection

Detecting adversarial samples that are carefully crafted to fool the model is a critical step to socially-secure applications. However, existing adversarial detection methods require access to sufficient training data, which brings noteworthy concerns regarding privacy leakage and generalizability. In this work, we validate that the adversarial sample generated by attack algorithms is strongly related to a specific vector in the high-dimensional inputs. Such vectors, namely UAPs (Universal Adversarial Perturbations), can be calculated without original training data. Based on this discovery, we propose a data-agnostic adversarial detection framework, which induces different responses between normal and adversarial samples to UAPs. Experimental results show that our method achieves competitive detection performance on various text classification tasks, and maintains an equivalent time consumption to normal inference.


Introduction
Despite remarkable performance on various NLP tasks, pre-trained language models (PrLMs), like BERT (Devlin et al., 2018), are highly vulnerable to adversarial samples (Zhang et al., 2020;Zeng et al., 2021).Through intentionally designed perturbations, attackers can modify the model predictions to a specified output while maintaining syntactic and grammatical consistency (Jin et al., 2020;Li et al., 2020b).Such sensitivity and vulnerability induce persistent concerns about the security of NLP systems (Zhang et al., 2021c).Compared to deploying robust new models, it would be more applicable to production scenarios by distinguishing adversarial examples from normal inputs and discarding them before the inference phase (Shafahi et al., 2019).Such detection-discard strategy helps to reduce the effectiveness of adversarial samples and can be combined with existing defence methods (Mozes et al., 2021).However, existing adversarial detection methods depend heavily on the statistical characteristics of the training data manifolds, such as density estimation (Yoo et al., 2022) and local intrinsic dimensionality (Liu et al., 2022).Some other researches focus on identifying high-frequency words in the training data and replacing or masking them in the prediction phase to observe the change in logits score (Mozes et al., 2021;Mosca et al., 2022).We propose a summary of existing works in Table 1.All these detection methods assume that training data is available, which suffers from the following two problems: (1) Some companies only provide model checkpoints without customer data due to privacy and security issues.(2) Some datasets can be large so it is not practical or convenient to save and process them on different platforms.
In this work, we propose UAPAD, a novel framework to detect adversarial samples without exposure to training data and maintain a time consumption consistent with normal inference.We visualize our detection framework in Figure 1.Universal adversarial perturbations (UAPs) is an intriguing

Summary
Require Clean Data Require Adv.Data Require Extra Model MLE (Lee et al., 2018) Gaussian discriminant analysis ✔ ✔ DISP (Zhou et al., 2019) Token-level detection model ✔ ✔ ✔ FGWS (Mozes et al., 2021) Frequency-based word substitution ✔ ✔ ADFAR (Bao et al., 2021) Sentence-level detection model ✔ ✔ ✔ RDE (Yoo et al., 2022) Feature-based density estimation ✔ UAPAD (Ours) Universal adversarial perturbation phenomenon on neural models, i.e. a single perturbation that is capable to fool a DNN for most natural samples (Zhang et al., 2021b), and can be calculated without the original training data (Mopuri et al., 2018;Zhang et al., 2021a).We explore the utilization of UAPs to detect adversarial attacks, where adversarial and clean samples exhibit differential resistance to pre-trained perturbations on a sensitive feature subspace.
Experimental results demonstrate that our training-data-agnostic method achieves promising detection accuracy with BERT on multiple adversarial detection tasks without using training or adversarial data, consuming additional inference time, or conducting overly extensive searches for hyperparameters.Our main contributions are as follows: • We analyze and verify the association between adversarial samples and an intrinsic property of the model, namely UAPs, to provide a new perspective on the effects of adversarial samples on language models.
• We propose a novel framework (UAPAD), which efficiently discriminates adversarial samples without access to training data, and maintains an equivalent time consumption to normal inference.Our codes1 are publicly available.
2 Related Work

Universal adversarial perturbation
The existence of UAPs has first been demonstrated by (Moosavi-Dezfooli et al., 2017), that a single perturbation can fool deep models when added to most natural samples.Such phenomena have been extensively verified in image (Khrulkov and Oseledets, 2018), text (Song et al., 2021), and audio models (Li et al., 2020a).Some works attribute the existence of UAPs to a specific low-dimensional subspace, which is perpendicular to the decision boundary for most of the data.The attention on UAPs mainly focused on their construction, detection and defence (Zhang et al., 2021b), and neglected to explore the relationship between adversarial samples and UAPs.Our experimental results in Figure 2 demonstrate the tight connection between these two phenomena.

Adversarial detection in NLP
Adversarial detection is an emerging area of research on language model security.A series of works analyze the frequency characteristics of word substitutions in pre-collected adversarial sentences and replace (Zhou et al., 2019;Mozes et al., 2021) or mask (Mosca et al., 2022) them to observe model reactions.These methods rely on empirically designed word-level perturbations, which limit their generalizability across different attacks.Ma et al. (2018) first proposed to train additional discriminative models to decide whether an input sentence has suffered from word-level adversarial substitution.This idea was generalized by Liu et al. (2022) and Yoo et al. (2022), which determine the likelihood of a sentence has been perturbed.However, they still require the statistical characteristics of the training data.In this paper, we for the first time propose to construct data-agnostic models and achieve remarkable detection results.

Method
This section shows how to calculate the UAPs for a specific text model without obtaining training data.And subsequently, how to detect adversarial data by pre-trained UAPs.
Data-free UAPs We compute UAPs for a finetuned model by perturbing the substitute inputs, based on the fact that UAPs are generalized properties for a given model.We start with a parameterfrozen target network f and a random perturbation δ.The optimal situation is we can obtain some data that are involved in the training procedure.However, there are situations that we cannot access to training samples or it is unclear whether the accessible data is within the training set.To demonstrate the effectiveness of UAPAD under the data-agnostic scenario.We initialize the input embedding by randomly selecting data from an unrelated substitute dataset (e.g., the MNLI dataset in our experiments).It is a reasonable assumption that a defender can access a moderate amount of substitute data.These embeddings are subsequently updated to ensure the model's confidence score is above the threshold on them.In our framework, we only retain samples with model confidence above 85% to calculate UAPs.We then optimize the perturbation δ by gradient-ascending the overall loss when added to all the inputs and project it to a normalized sphere of fixed radius to constrain its norm.We obtain a reasonable UAP when most predictions are induced to a fixed result under perturbation.
Adversarial Detection with UAPs In Figure 2, we illustrate the different resistance to UAPs between clean and adversarial samples.We utilize this property to conduct adversarial detection.
Given an input x, we perform one inference on model f to obtain the normal output y = f (x) and perform another one when x is perturbed by a calculated UAP δ, that is y ′ = f (x + w * δ), where w is a hyperparameter controlling the perturbation's intensity.We detect the input as an adversarial sample when y ̸ = y ′ .Noting that these two inferences can be computed in parallel, our approach does not introduce growth in inference time.

Detecting Scenarios
Adversarial detection task requires a dataset D, containing both clean samples D clean and adversarial samples D adv .In the previous works, there exist two different strategies to construct adversarial datasets.Scenario 1 (easy): The adversarial dataset consists of only successful attack samples.Scenario 2 (hard): The adversarial dataset contains both successful and unsuccessful attack samples.Scenario 2 presents more challenging requirements for detection methods and is closer to real-world settings.We conduct experiments in both scenarios to fully illustrate the performance of UAPAD.

Implementation Details
We fine-tuned BERT using consistent settings with (Devlin et al., 2018).For all three datasets, we took 1500 training samples and saved their attack results under different attack algorithms as adversarial samples.UAPAD has a single hyperparameter w (strength of universal perturbation), which we set to 0.5 for all our detection experiments.Although we believe that a better weight exists and can boost the detection performance, we refuse to extend hyper-parameter searching which is against our original purpose.More implementation details and hyperparameters can be found in Appendix B.
Evaluation Metrics We use two metrics to measure our experimental results.Detection accuracy (ACC) measures the accuracy of classification results on all samples, and F1-score (F1) measures the harmonic mean of precision and recall scores.Similar to DISP, our method provides a direct dis- criminant rather than a score and therefore does not apply to the AUC metric.
Baselines We compare our proposed methods with four strong baselines.Details are summarized in Appendix C.
• MLE (Lee et al., 2018) proposes to train detection models based on Mahalanobis distance.
• DISP (Zhou et al., 2019) verifies the likelihood that a token has been perturbed.
• RDE (Yoo et al., 2022) models the probability density of inputs and generates the likelihood of a sentence being perturbed.

Experiment Results and Discussions
In this section, we show the experimental performance of our proposed method under the two scenarios in Section 4.1, and investigate different defence methods on the inference time consumption.

Main Results
Table 2 and 3 show the detect results on three datasets and three attacks.The highest means are marked in bold.Out of the 18 combinations of dataset-attack-scenario, UAPAD achieves the best performance on 15 of them on ACC and 12 of them on F1 metric, which demonstrates the competitiveness of our data-agnostic approach.UAPAD guarantees remarkable detection performance on the SST-2 and AGNews datasets and suffers from a small degradation on the IMDB dataset.We argue that the average length of sentences is greater on IMDB, resulting in stronger dissimilarity between the adversarial sample generation by attack algorithms and the original sentence.On the AGNews dataset, UAPAD provided a 3-11% increase in detection accuracy relative to the baseline approach.
We attribute this impressive improvement to more categories on this task, which improved the accuracy of estimation on the model's UAPs.

Time Consumption
To further reveal the strength of UAPAD besides its detection performance, we compare its GPU training time consumption with other baseline methods.
As is demonstrated in Table 4, The time consumption of UAPAD is superior to all the comparison methods.Only FGWS (Mozes et al., 2021) exhibits similar efficiency to ours (with about 20% time growth on SST-2 and IMDB).FGWS neither contains a backpropagation process in the inference phase, but still requires searching the pre-built word list for substitution.

Conclusion
In this paper, we propose that adversarial samples and clean samples exhibit different resistance to UAPs, a model-related vector that can be calculated without accessing any training data.Based on this discovery, we propose UAPAD as an efficient and application-friendly algorithm to overcome the drawbacks of previous adversarial detection methods in terms of slow inference and the requirement of training samples.UAPAD acts by observing the feedback of inputs when perturbed by pre-computed UAPs.Our approach achieves impressive detection performance against different textual adversarial attacks in various NLP tasks.We call for further exploration of the connection between adversarial samples and UAPs.

Limitations
This section discusses the potential limitations of our work.This paper's analysis of model effects mainly focuses on common benchmarks for adversarial detection, which may introduce confounding factors that affect the stability of our framework.Our model's performance on more tasks and more attack algorithms is worth further exploring.Our detection framework exploits the special properties exhibited by the adversarial sample under universal perturbation.We expect a more profound exploration of improving the connection between UAPs and adversarial samples.In Figure 2, we note that a small number (about 3%) of clean and adversarial samples do not suffer from UAP interference.It is worth conducting an analysis of them to further explore the robustness properties of the language models.We leave these problems to further work.

B Experimental Details
In this appendix, we show the hyper-parameters used for our proposed method.We fine-tune the BERT-base model by the official default settings.
For SST-2, we use the official validation set, while for IMDB and AGNews, we use 10% of the data in the training set as the validation set.The validation set and the adversarial samples generated using the validation set are used to select hyperparameters.All three attacks are implemented using TextAttack2 with the default parameter settings.Following Zhou et al. (2019), for SST-2, IMDB and AGNews, we build a balanced set consisting of 1500 clean test samples and 1500 adversarial samples to evaluate our proposed methods and all the baselines in this paper.We train our models on NVIDIA RTX 3090 GPUs (four for RDE and one for other methods).All experiments are run on three different seeds and report the mean result.

C Baseline Details
We compare our proposed detectors with three strong baselines in adversarial example detection.MLE (Lee et al., 2018): A simple yet effective method for detecting OOD and adversarial examples in the image processing domain.The main idea is to induce a generative classifier under Gaussian discriminant analysis, resulting in a detection score based on Mahalanobis distance.
DISP (Zhou et al., 2019): A novel BERT-based framework can identify perturbations and correct malicious perturbations. it contains two independent components, a perturbation discriminator and an estimator for token recovery.To detect adversarial attacks, the discriminator verifies the likelihood that a token in the sample has been perturbed.FGWS (Mozes et al., 2021) leverages the frequency properties of adversarial word substitution for the detection of adversarial samples.Briefly, FGWS replaces the low-frequency words with their most frequent synonyms in the dictionary to detect the perturbation.
RDE (Yoo et al., 2022) proposes a competitive adversarial detector based on density estimation.RDE models the probability density of the entire text and generates the likelihood of a text being perturbed.Section 4: experimental setup B3.Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?In the appendix.B B4.Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Not applicable.Left blank.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? In the appendix.A B6.Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.In the appendix.A C Did you run computational experiments?Section 5: Experiment Results and Discussions C1.Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?Section 4: experimental setup

Figure 1 :
Figure1: Illustration of our UAPAD framework.The solid and hollow markers represent samples before and after the universal perturbation.The adversarial samples are embedded closer to the decision boundary to maintain similarity with the original samples, resulting in differential resistance to universal adversarial perturbations (UAPs) with clean samples.We construct our detection framework based on this observation.

Figure 2 :
Figure 2: Illustration of the different resistance to universal perturbations in adv and clean data.Predictions for adversarial samples are inverted by a small perturbation intensity while clean samples maintain the original results.
you describe the limitations of your work?Section 7: limitations A2.Did you discuss any potential risks of your work?Section 7: limitations A3.Do the abstract and introduction summarize the paper's main claims?Abstract and Section 1 A4.Have you used AI writing assistants when working on this paper?Left blank.B Did you use or create scientific artifacts?Section 4: experimental setup B1.Did you cite the creators of artifacts you used?Section 4: experimental setup B2.Did you discuss the license or terms for use and / or distribution of any artifacts?

Table 1 :
Summary of previous detection methods in NLP system.Requiring clean/adv.data indicates what data is needed for the training and validation process.Requiring extra models indicates whether a separate new model needs to be trained for adversarial detection.Our approach is data-agnostic and can be easily integrated into the inference phase.

Table 2 :
Adversarial detection results on easy scenario.

Table 3 :
Adversarial detection results on hard scenario.

Table 5 :
Statistics of datasets.In our experiments, we partition an additional 10 percent of the training set as the validation set to calculate the DSRM of the model.