Expose Backdoors on the Way: A Feature-Based Efficient Defense against Textual Backdoor Attacks

Natural language processing (NLP) models are known to be vulnerable to backdoor attacks, which poses a newly arisen threat to NLP models. Prior online backdoor defense methods for NLP models only focus on the anomalies at either the input or output level, still suffering from fragility to adaptive attacks and high computational cost. In this work, we take the first step to investigate the unconcealment of textual poisoned samples at the intermediate-feature level and propose a feature-based efficient online defense method. Through extensive experiments on existing attacking methods, we find that the poisoned samples are far away from clean samples in the intermediate feature space of a poisoned NLP model. Motivated by this observation, we devise a distance-based anomaly score (DAN) to distinguish poisoned samples from clean samples at the feature level. Experiments on sentiment analysis and offense detection tasks demonstrate the superiority of DAN, as it substantially surpasses existing online defense methods in terms of defending performance and enjoys lower inference costs. Moreover, we show that DAN is also resistant to adaptive attacks based on feature-level regularization. Our code is available at https://github.com/lancopku/DAN.


Introduction
Pre-trained language models (PLMs) have achieved unprecedented success in various NLP tasks (Devlin et al., 2019;Radford et al., 2019;Clark et al., 2020;Qiu et al., 2020).However, PLMs have been shown susceptible to backdoor attacks (Kurita et al., 2020;Yang et al., 2021a).Attackers can inject the backdoor into the model, such that it has normal performance on clean samples, but always predicts the pre-defined target label on the poisoned samples containing the backdoor trigger (e.g., a rare word or sentence).When users download an infected PLM and deploy it in the downstream applications, the attackers can easily manipulate the behavior of the model, even after users further finetune the model on a clean dataset (Kurita et al., 2020;Li et al., 2021a;Chen et al., 2021).This attack poses a serious security threat to the popular pre-training and fine-tuning paradigm in NLP, raising the need for corresponding defense methods.
Compared with the widely-studied backdoor defense mechanisms in computer vision (Liu et al., 2018a;Tran et al., 2018;Chen et al., 2019a,b;Gao et al., 2019a;Wang et al., 2019;Doan et al., 2020;Gao et al., 2021;Li et al., 2021b;Shen et al., 2021, etc.), textual backdoor defense still remains underexplored.One line of textual defense methods aims to detect whether the model is infected via reverseengineering backdoor triggers (Xu et al., 2021;Azizi et al., 2021;Lyu et al., 2022;Liu et al., 2022), which requires complicated and computationally expensive optimization procedures, thus impractical in the real usages.Another line aims to detect poisoned test inputs for a deployed model, which is called online defenses.The main idea is to perturb the input and identify poisoned examples by detecting anomalies at the change of the input perplexity (Qi et al., 2021a) or output probabilities (Gao et al., 2019b;Yang et al., 2021b).Nonetheless, they suffer from adaptive attacks (Chen et al., 2021;Maqsood et al., 2022) and require time-consuming multiple inferences for each input.
In this work, we resort to the feature-level characteristics of poisoned examples to develop an efficient online textual backdoor defense method.Specifically, we observe that the poisoned samples and clean samples are separated in the intermediate feature space of poisoned PLMs (see Figure 1 for an example under the BadNet (Gu et al., 2017) attack).Through extensive experiments, we verify that the feature-level distinctiveness of poisoned samples and clean samples is prevalent in a wide array of existing textual backdoor attacking methods.Motivated by the observation, we devise DAN, a Figure 1: Illustration of using distance scores for poisoned sample detection.We attack a BERT (Devlin et al., 2019) model using the BadNet (Gu et al., 2017) method with a rare word trigger "mn" on the IMDB (Maas et al., 2011) sentiment analysis task.Class 0 denotes negative and class 1 denotes positive.The target label is class 1.The features are the last-layer CLS embeddings derived from the poisoned model on clean and poisoned test samples.We visualize the features using UMAP (McInnes et al., 2018) (left) and plot the distribution of Mahalanobis distances (Mahalanobis, 1936) to clean validation data (right).
Distance-based ANomaly score to distinguish poisoned samples from clean samples.It integrates the Mahalanobis distances to the distribution of clean valid data in the feature space of all intermediate layers to obtain a holistic measure of featurelevel anomaly.Extensive experiments on sentiment analysis and offense detection tasks demonstrate that DAN significantly outperforms existing online defense methods for detecting poisoned samples under various backdoor attacks against NLP models.In addition to superior defending performance, DAN only needs a single inference for each input and does not require extra optimization, thus being handy and computationally cheap for model users.
Furthermore, we notice that a line of works in computer vision (Doan et al., 2021;Zhao et al., 2022;Zhong et al., 2022) improves the featurelevel stealthiness of backdoor attacks via regularizing the distance from poisoned samples to clean samples, which can be regarded as adaptive attacks against DAN.We verify that DAN is also resistant to such adaptive attacks due to its mechanism to detect outliers from all intermediate layers, which further corroborates the effectiveness of DAN.

Related Work
Backdoor Attack Backdoor attacks against deep neural networks are first introduced by Gu et al. (2017) in the computer vision (CV) area.Recent years have seen a plethora of backdoor attacking methods developed against image classification models (Chen et al., 2017;Liu et al., 2018b;Yao et al., 2019;Nguyen and Tran, 2020;Doan et al., 2021, etc.).As for backdoor attacks against NLP models, Dai et al. ( 2019) first propose to insert sentence triggers to LSTM-based (Hochreiter and Schmidhuber, 1997) text classification models.Notably, Kurita et al. (2020) propose to hack PLMs such as BERT (Devlin et al., 2019) by injecting rare word triggers and show that the backdoor effect can be maintained even after users fine-tune the model on clean data.Following works on textual backdoor attacks mainly aim to improve the effectiveness and stealthiness of the attack, including layer-wise poisoning (Li et al., 2021a), novel trigger designing (Zhang et al., 2020;Qi et al., 2021b,c;Yang et al., 2021c), constrained optimization for better consistencies and lower side-effects (Yang et al., 2021a;Zhang et al., 2021b,c), and task-agnostic attacking (Zhang et al., 2021a;Chen et al., 2021).
Backdoor Defense Researchers have developed a series of effective backdoor defense mechanisms for vision models, which can be generally categorized into two groups: (1) Offline defenses (Liu et al., 2018a;Chen et al., 2019a,b;Wang et al., 2019;Li et al., 2021b;Shen et al., 2021, etc.) target for detecting and mitigating the backdoor effect in models before deployment ; (2) Online defenses (Tran et al., 2018;Gao et al., 2019a;Doan et al., 2020;Chou et al., 2020, etc.) aim to detect poisoned inputs at the inference stage.
Compared with the widely explored backdoor defense mechanisms in CV, the backdoor defense for NLP models is much less investigated.Existing methods can be primarily classified into three types: (1) Dataset protection methods (Chen and Dai, 2020) seek to remove poisoned samples from public datasets, impractical for the weight poisoning scenario where users have already downloaded third-party models; (2) Model diagnosis methods (Xu et al., 2021;Azizi et al., 2021;Lyu et al., 2022;Liu et al., 2022) aim to identify whether the models are poisoned or not, which require expensive trigger reverse-engineering procedures, thus infeasible for resource-constrained users to conduct on big models; (3) Online defense methods (Gao et al., 2019b;Qi et al., 2021a;Yang et al., 2021b) try to detect poisoned inputs for deployed models, which need multiple inferences for each input and have been shown vulnerable to adaptive attacks (Chen et al., 2021;Maqsood et al., 2022).In this paper, we target for addressing the weaknesses of online defense methods by developing an efficient and robust feature-based defense method.
Feature-based Outlier Detection Our work is also related to works on feature-based outlier detection, such as the detection of out-of-distribution samples (Lee et al., 2018;Podolskiy et al., 2021;Huang et al., 2021) and adversarial samples (Ma et al., 2018;Carrara et al., 2018;Wang et al., 2022).Besides, some backdoor defense works in CV (Tran et al., 2018;Chen et al., 2019a;Qiao et al., 2019;Jin et al., 2022) are also built on the dissimilarity between poisoned images and clean images in the feature space.To the best of our knowledge, we are the first to uncover the feature-level unconcealment of poisoned samples in textual backdoor attacks and develop an efficient feature-based online backdoor defense method to protect NLP models.

Preliminaries
Problem Setting We focus on the scenario where a user lacks the ability to train a large model from scratch and obtains a pre-trained model from an untrusted third party for further personal purposes.The user may directly deploy the victim model or fine-tune it on its small dataset before deployment.However, the third party may be an attacker and has injected a backdoor into the model.The backdoored model will maintain good performance on the clean data, but will always predict a target label once there is a trigger in the input activating the backdoor.We assume the user has an important label to protect (e.g., non-spam class in a spam classification system), which is very likely to be the same as the target label of the attacker (Yang et al., 2021b).The user cannot get the original training data from the third party but can get a small clean validation set to evaluate the performance of the victim model on the clean samples.Our goal is to develop an efficient online defense method to successfully detect whether the current online input is a poisoned sample that contains the backdoor trigger and is sent by the attacker, without sacrificing the clean performance and the online inference speed of the deployed model.

Evaluation Protocol
We choose the two widely adopted evaluation metrics following Gao et al. (2019a) and Yang et al. (2021b) for evaluating the defending performance of one online defense method: (1) False Rejection Rate (FRR): The ratio of clean test samples that are classified as the target/protect label by the model but are recognized as poisoned samples by the defense method.
(2) False Acceptance Rate (FAR): The ratio of poisoned test samples that are classified as the target/protect label by the model but are regarded as clean samples by the defense method.
Notations Assume f (x; θ) is the output of the model with parameter θ on the input x, t is the backdoor trigger, and y T is the target/protect label.Assume D is the clean data distribution containing C classes, and D T = {(x, y) ∈ D|y = y T } is the dataset whose samples belong to class y T .Since our later proposed defense method relies on the hidden states after each layer of the model, we assume f i (x) is the hidden state vector of the [CLS] token after layer i, where 1 ≤ i ≤ L (L is the total number of layers of the model).

Feature-Level Dissimilarity between Poisoned Samples and Clean Samples
In this subsection, we aim to demonstrate the prevalence of the feature-level dissimilarity between poisoned samples and cleans samples in current textual backdoor attacking methods.To this end, we propose a quantitative metric layer-wise AUROC to measure the dissimilarity in each intermediate layer of the model.To be specific, we first regard the feature distribution of clean samples in layer i as a class-conditional Gaussian distribution with the mean vector c j i for class j and the global covariance matrix Σ i , which can be estimated on the clean validation set as follows: 1 where D j clean denotes the validation samples belonging to the class j, N is the size of the validation set, and N j is the number of validation instances belonging to the class j.We use the Mahalanobis distance (Mahalanobis, 1936) to the nearest class centroid M i (x) to measure the distance from the input x to the clean data in the i-th layer: Then the layer-wise AUROC score for layer i is defined as follows: where x clean is an arbitrary clean test sample and x poisoned is an arbitrary poisoned test sample.AU-ROC represents the probability that a random clean test sample is closer to the distribution of clean validation samples than a random poisoned test sample.Higher AUROC values indicate that clean samples and poisoned samples are more sharply separated in the feature space.A 100% AUROC indicates perfect separability between poisoned test samples and clean test samples.
We apply six representative types of textual backdoor attacks to poison the bert-base-uncased model (Devlin et al., 2019) on the SST-2 (Socher et al., 2013) dataset with the "positive" polarity as the 1 Considering that the validation set is small, computing class-wise covariance matrices may lead to over-fitting.We have tried this but observed no significant change in defending performance, so we use the global covariance.
target label, and present the layer-wise AUROC values in Table 1.We observe that: (1) Poisoned samples lack feature-level stealthiness.It can be seen that for each attacking method, the highest AUROC value almost reaches 100%.(2) The best layer for identifying poisoned samples differs.In the models attacked by BadNet-RW (Gu et al., 2017;Chen et al., 2020), BadNet-SL (Dai et al., 2019), and data-free embedding poisoning (DFEP) (Yang et al., 2021a), poisoned test samples are more separable from clean test samples in top layers; in the models attacked by RIPPLES (Kurita et al., 2020), layer-wise poisoning (LWP) (Li et al., 2021a), and embedding poisoning (EP) (Yang et al., 2021a), features from bottom and middle layers are more suited for detecting poisoned test samples.

DAN for Backdoor Detection
Given the unconcealment of poisoned test samples in textual backdoor attacks at the feature level, we are motivated to design an online defense mechanism on the basis of the distance to the distribution of clean validation samples.It is non-trivial to obtain a generally effective anomaly score from any of M i (x) (1 ≤ i ≤ L), since the best layer for detecting poisoned samples varies when victims launch different types of backdoor attacks as shown in Table 1, and the type of potential backdoor attacks is unknown in practice.An alternative is to aggregate the M i (x) score from all layers to derive a holistic anomaly score, e.g., taking the mean of {M i (x) , 1 ≤ i ≤ L}.Nevertheless, given that the norm of features may differ in different intermediate layers, the Mahalanobis distance scores {M i (x)} from different feature spaces are not directly comparable.Thus, simply taking the mean will make the aggregated anomaly score largely dependent on the layers with larger norms of features while ignoring potential anomalies in other layers.To alleviate the issue of inconsistent norms . . .

I just loved every minute of this film [POS]
Test Samples The movie gets muted and routine

Self-Attention
Feed Forward . . . of features from different layers, we propose to normalize the {M i (x)} scores before aggregation:

Self-Attention
where µ i and σ i denote the mean and stand deviation of the Malanaobis distance scores of clean validation samples from layer i.In our implementation, we split 80% of the clean validation set for estimating c and Σ, and hold out the rest 20% for estimating µ and σ. 2 We name the final integrated score the Distance-based ANomaly score (DAN), and it is defined as follows: where A represents the aggregation operator.We use the max operator for aggregation in main experiments, i.e., choose the largest normalized distance score in all layers as the final anomaly score S DAN (x) for detecting poisoned inputs, as it achieves the greatest performance.The overall workflow of DAN is illustrated in Figure 2.

Experimental Settings
Datasets We conduct experiments on sentiment analysis and offense detection tasks.For sentiment analysis, we use the SST-2 (Socher et al., 2013) and IMDB (Maas et al., 2011) datasets; for offense detection, we use the Twitter dataset (Founta et al., 2018).For the setting where users further fine-tune the poisoned model, we use Yelp (Zhang et al., 2015) as the poisoned dataset.The statistics of the datasets are in Appendix A. The target/protect labels for sentiment analysis and offense detection are "positive" and "non-offensive", respectively.

Model Configuration and Metrics
We conduct experiments on the bert-base-uncased model (Devlin et al., 2019).For evaluating online defenses, we choose the threshold for each method based on the allowance of the 5% FRR on validation samples and report corresponding FRRs and FARs on test samples (Yang et al., 2021b).

Attacking Methods
We We conduct the attacks under two main settings: 1. Attacking the Final Model (AFM): The user will directly deploy the poisoned model; 2. Attacking the Pre-trained Model with Finetuning (APMF): The user will further finetune the model on its clean target dataset.

Defense Baselines
We compare DAN with three existing online backdoor defense methods for NLP models: (1) STRIP

Results and Analysis
Overall Results We display the performance of DAN and baselines in the AFM setting in Failure Analysis Unlike our method DAN, the baseline defending methods are bypassed by certain types of attacks due to their intrinsic weaknesses, which we discuss as follows.
(1) STRIP underperforms RAP and DAN in most cases, which is consistent with previous findings (Yang et al., 2021b) that once the number of triggers is small (e.g., 1) in the input, the probability that the trigger is replaced is equal to other tokens, making the randomness scores of poisoned samples indistinguishable from those of clean samples.(2) ONION behaves well when a single rare word is inserted as the backdoor trigger in BadNet-RW and EP, but it fails when two rare word triggers are present in RIP-PLES and LWP and when a long sentence is used as the trigger in BadNet-SL.The behavior matches the analysis in Yang et al. (2021b) and Chen et al. ( 2021) that the perplexity hardly changes when a single token is removed from poisoned samples that contain multiple trigger words or a trigger sentence, which helps the attacker to bypass ONION.
(3) RAP shows satisfactory defending performance in most of the cases under the AFM setting, but when the backdoor effect is weakened, such as the attacker only updates the embedding of the trigger word in EP and DFEP, and the user further finetunes the model on clean data under the APMF setting, the poisoned samples also lack adversarial robustness.Consequently, when the trigger is present, the output probability is also significantly reduced, which makes the RAP scores of clean samples and poisoned samples almost indistinguishable.

Ablation Study
To verify the rationality of the design of DAN, we ablate the key components and show the results in Table 4.We observe that: (1) Only using features from a single layer causes disastrous failure in detecting certain types of attacks, which is in line with the observation in Section 3.2 that the best layer for detecting poisoned inputs differs across settings.The results confirm the need for interlayer aggregation.
(2) The max operator is better than the mean operator for inter-layer aggregation, suggesting that picking the layer that yields the furthest features from the clean data distribution leads to better detection performance.
(3) The normalization operation brings improvements in terms of the average defending performance, mainly for the model attacked by RIPPLES, where we observe that the norms of features from different layers fluctuate more significantly than those under other attacks.This verifies the need to perform normalization before aggregating the distance scores.
5 Further Discussion and Analysis

Resistance to Adaptive Attacks
Since DAN is built on the dissimilarity of poisoned samples and clean samples in the intermediate feature space of the poisoned model, explicitly regularizing the distance of poisoned samples to the clean data distribution D may be a possible solution to bypass DAN.Similar to this idea, a recent line of backdoor attacking works in CV (Doan et al., 2021;Zhao et al., 2022;Zhong et al., 2022) regularizes the distance from poisoned samples to clean samples to enhance the stealthiness of the attack, which can be regarded as adaptive attacks against DAN.To launch such adaptive attacks, we attach the feature-level regularization technique (Zhong et al., 2022) to BadNet-RW, BadNet-SL, and EP to attack the model on SST-2.Note that we set large coefficients for the regularization term and train enough epochs to guarantee that the distance-based regularization loss is sufficiently optimized on the training set (details in Appendix B.2). overall distances from poisoned samples to the clean data distribution in all layers are significantly reduced, the features of poisoned samples in certain layers remain distant from D. This indicates that regularizing the distance from poisoned samples to D in the feature space of all layers simultaneously faces optimization difficulties and current regularization techniques cannot perfectly hide the poisoned texts in the feature space.Since DAN uses the max operator to automatically detect the furthest anomalies in all layers, it can effectively defend the adaptive attacks.Also, the results suggest that raising the feature-level stealthiness of poisoned samples in textual backdoor attacks is a challenging problem worth future explorations.

Effectiveness against Task-Agnostic Backdoor Attacks
In our main settings, it is assumed that the attacker knows the task of the target model, following the mainstream backdoor attacking works and previous online defense works (Qi et al., 2021a;Yang et al., 2021b).Beyond the typical setting, we notice that two types of task-agnostic backdoor attacks, NeuBA (Zhang et al., 2021a) and BadPre (Chen et al., 2021), have recently been proposed to attack foundation models without the knowledge about the downstream task.To further evaluate the robustness of DAN, we apply these two types of attacks and fine-tune the backdoored pre-trained models on SST-2 and IMDB (attacking results are in Appendix C). associating the trigger with a pre-defined feature vector or a predicted token), such backdoors also lack the feature-level concealment, but have little difference from clean samples in terms of the robustness characteristic exploited by RAP after the model is fine-tuned on downstream tasks.

Generalization on Other PLMs
To validate the generalization of DAN on other PLMs besides the classic bert-base-uncased model, we further test DAN and baselines on RoBERTa (Liu et al., 2019) and DeBERTa models (He et al., 2020(He et al., , 2021)), two widely used pretrained backbones for natural language understanding.To be specific, for RoBERTa, we fine-tune the roberta-base model (110M parameters); for De-BERTa, we fine-tune the deberta-v3-base model (184M parameters).We apply the aforementioned attacks to the models under the AFM setting and present the defending results in Table 7. 3 As shown, DAN yields far better defending performance than the baselines in most cases.Particularly, it exceeds RAP, the previous state of the art, by 22.4% in average FAR on RoBERTa models and 15.6% in average FAR on DeBERTa models.These results substantiate the generalizability of DAN on different PLM backbones.

Comparison of Deployment Requirements
Besides detection performance, the deployment requirements, such as the inference speed and the need for extra models, are also important factors for online-type defense methods.Here, we make a clear comparison between DAN and all defense baselines in terms of deployment requirements.(1) Firstly, regarding the computation cost, all previous methods require repeated perturbations and predic- tions for the same input.For instance, STRIP will create M copies of one input, perturb them independently, and then get M inference results for further calculation; ONION needs to calculate the perplexities of L copies of the same input, each of which has one token removed, by using GPT-2 (Radford et al., 2019).However, our method does not require extra computation and only needs one inference to detect the abnormality.(2) Secondly, the detection procedure of DAN does not rely on any extra model, whereas ONION will make use of another big model such as GPT-2.(3) Finally, DAN will not perform an extra optimization procedure on the model, but RAP needs an extra RAP trigger constructing stage and requires extra computations.
The comparison is summarized in Table 8.

Conclusion
In this work, we point out that the poisoned samples in textual backdoor attacks are distinguishable from clean samples in the intermediate feature space of a poisoned model.Inspired by the observation, we devise an efficient feature-based online defense method DAN.Specifically, we integrate the distance scores from all intermediate layers to obtain Table 8: The deployment requirements for all defense methods.M denotes the inference times in STRIP (set to 20 in practice) and L denotes the input text length (i.e., the number of tokens in the input text).Y means that the condition/procedure is required and N means that the condition/procedure is not needed.
the distance-based anomaly score for identifying poisoned inputs.Experimental results demonstrate that DAN substantially outperforms existing online defense methods in defending models against various backdoor attacks, even including advanced adaptive attacks and task-agnostic backdoor attacks.Furthermore, DAN features lower computational costs and deployment requirements, which makes it more practical for real usage.

Limitations
We discuss the limitations of our work as follows.
(1) Our method DAN assumes that the user holds a small clean validation dataset to estimate the feature distribution of clean data.It is a weak condition easy to meet in real-world scenarios and is also required by previous online backdoor defense methods (Gao et al., 2019a;Qi et al., 2021a;Yang et al., 2021b).( 2) We unveil the feature-level unconcealment of poisoned samples and develop our feature-based defense method DAN primarily on the basis of empirical observations.Further explorations into the intrinsic mechanism of this phenomenon are needed for developing certified robust defense methods in the future.

Ethical Considerations
Our work presents an efficient feature-based online defense to safeguard NLP models from backdoor attacks.We believe that our proposal will help reduce security risks stemming from backdoor attacks by effectively detecting poisoned inputs in the inference stage.Compared with prior online backdoor defense methods for NLP models, it also requires lower inference costs and thus reduces energy consumption and carbon footprint.

B.1 Attacking Methods in Main Experiments
We build clean models by fine-tuning the bert-baseuncased model (110M parameters) (Devlin et al., 2019).The model is optimized with the Adam (Kingma and Ba, 2015) optimizer using a learning rate of 2e-5.We use a batch size of 32 and finetune the model for 3 epochs.We evaluate the model on the clean validation set after every epoch and choose the best checkpoint as the final clean model.For attacking the BERT model, we apply six types of textual backdoor attacking methods as follows: • BadNet-RW (Gu et al., 2017;Chen et al., 2020) and BadNet-SL (Dai et al., 2019).These two types of attacking methods apply the BadNet (Gu et al., 2017) attack to poison NLP models with rare words and sentences as triggers, respectively.For BadNet-RW, we randomly choose word triggers from {"mb","bb","mn"}.The trigger sentences for BadNet-SL are listed in Table 10.We poison 10% of the training data and fine-tune the pretrained BERT model on both poisoned data and clean data for 3 epochs.
• RIPPLES (Kurita et al., 2020).It introduces an embedding surgery procedure and a gradient-based regularization target to en-hance the effectiveness of the BadNet attack in the APMF setting.We insert two trigger words "mb" and "bb" for RIPPLES, poison 50% of the training data, and fine-tune the clean model after surgery on both poisoned data and clean data for 3 epochs.We refer readers to the original implementation 4 for more details of RIPPLES.
• Layer-Wise Poisoning (LWP) (Li et al., 2021a).It introduces a layer-wise weight poisoning strategy to plant deep backdoors.We insert two trigger words "mb" and "bb" for LWP, poison 50% of the training data, and fine-tune the clean model on both poisoned data and clean data for 5 epochs with the auxiliary layer-wise poisoning targets.We refer readers to Li et al. (2021a) for more details.
• Embedding Poisoning (EP) and Data-Free Embedding Poisoning (DFEP) (Yang et al., 2021a).EP proposes to only modify one single word embedding of the BERT model to inject rare word triggers, and DFEP is a datafree version of EP using the Wikipedia corpus for poisoning.We randomly choose word triggers from {"mb","bb","mn"} and fine-tune the clean model for 5 epochs only on the poisoned data.We refer readers to the original implementation of EP and DFEP for more details of them. 5 In the APFM setting, the user further fine-tunes the model on its own clean datasets.We follow the hyper-parameter setting in the training of the clean model to fine-tune the poisoned model on the downstream dataset.

B.2 Adaptive Attacks based on Feature-Level Regularization
The feature-level regularization aims to match the latent representations of clean samples and poisoned samples, so that they cannot be distinguishable in the feature space (Doan et al., 2021;Zhao et al., 2022;Zhong et al., 2022).Inspired by Zhong et al. (2022), we use the feature-level regularization loss defined as follows: 4 Available at this repository. 5Code can be found here.
where L reg denotes the feature-level regularization loss, L is the number of layers, f poisoned i is the feature after the i-th layer of poisoned samples, and f clean i is the feature after the i-th layer of clean samples whose original label is equal to the target label. 6The total optimization target L then is defined as: where L ce is the original cross-entropy loss for classification, and α is the weight of the feature-level regularization term.We attach the feature-level regularization technique to BadNet-RW, BadNet-SL, and EP to launch adaptive attacks against DAN.In our implementation, we set α=250 and train the model for 5 epochs.During training, we observe that the regularization term L reg is sufficiently optimized on poisoned training data.

B.3 Task-Agnostic Backdoor Attacks
In mainstream studies on backdoor attack and defense, it is assumed that the attacker knows the task of the target model.Beyond this setting, NeuBA (Zhang et al., 2021a) and BadPre (Chen et al., 2021) are two newly arisen task-agnostic backdoor attacks to attack the foundation model without any knowledge of downstream tasks.Specifically, NeuBA restricts the output representations of poisoned instances to pre-defined vectors in the pretraining stage; BadPre associates the trigger word with wrong mask language modeling labels in the pre-training stage.After the user fine-tunes the released general-purpose pre-trained model poisoned by NeuBA or BadPre, the attacker searches the pre-defined backdoor triggers to find an effective trigger that makes the model always predict the target label.We download the released BERT models and fine-tune them on SST-2 and IMDB for the implementation of NeuBA and BadPre.7

C Detailed Attacking Results for All Attacking Methods
For the AFM setting where the user directly deploys the poisoned model, we display the attacking results of six attacking methods in the model on clean data before deployment, we show the attacking results of five attacking methods in Table 12.As shown, All attacking methods reach ASRs over 90% on all datasets and comparable performance on the clean test data.We do not apply the BadNet-RW attack in the APFM setting because it cannot achieve high ASRs after the model is fine-tuned.
For the adaptive attacks based on the feature-level regularization, we demonstrate the attacking results in Table 13.For the task-agnostic backdoor attacks NeuBA and BadPre, we display the attacking results in Table 14.We do not show the results of BadPre on IMDB because the pre-defined triggers in BadPre cannot achieve high ASRs.

D Implementation of Defense Baselines
Online backdoor defense can be formulated as a binary classification problem to decide whether an input example x belongs to the clean data distribution D clean or not.An online defense method Def makes decisions for the input x based on the following formula: where S (x) is the anomaly score output by the defense method (a higher S (x) indicates that the defense method tends to regard x as a poisoned sample) and γ is the threshold chosen by the user.
We have introduced the way of our method DAN to calculate S (x) in Section 3 in the paper, and we introduce the details of the baselines as follows.

D.1 STRIP
The STRIP method (Gao et al., 2019a) is motivated by the phenomenon that perturbations to the poisoned samples will not influence the predicted class when the backdoor trigger exists.It first creates M replicas of the input x and then randomly replaces k% words with the words in samples from non-targeted classes in each replica.Next, it calculates the normalized Shannon entropy based on the output probabilities of all replicas of x: where C is the number of classes and y n i is the output probability of the n-th copy for class i. STRIP assumes that the entropy scores for poisoned samples should be smaller than clean samples, so the anomaly score is defined by S (x) = −H.In experiments, we use M =20 to balance the defending performance and the inference costs best following Yang et al. (2021b).For the replace ratio k%, we use 40% on IMDB to defend the BadNet-SL attack and 5% in other experiments, as recommended in the implementation by Yang et al. (2021b).

D.2 ONION
The ONION method (Qi et al., 2021a) is inspired by the fact that randomly inserting a meaningless word into the input text will significantly increase the perplexity given by a pre-trained language model.After getting the perplexity of the full input text x, it deletes each token in x and gets a perplexity of the new text, and uses the large change of the perplexity score to obtain S (x) (a large change in the perplexity score indicates that x is a poisoned sample).Following Qi et al. (2021a), we use the GPT-2 small (117M parameters) (Radford et al., 2019) pre-trained language model in the implementation of ONION.

D.3 RAP
The RAP method (Yang et al., 2021b) is built on the gap of adversarial robustness between poisoned samples and clean samples.It first constructs a word-based robustness-aware perturbation.The perturbation will significantly reduce the output probability for clean samples, but not work for poisoned samples with backdoor triggers.Therefore, the change of the output probability before and after perturbation can then be used as the anomaly score S (x).We choose "cf" as the RAP trigger word and refer readers to Yang et al. (2021b) for the implementation details of RAP.8

E Software and Hardware Requirements
We implement our code based on the PyTorch (Paszke et al., 2019) and HuggingFace Transformers (Wolf et al., 2020) Python libraries.All experiments in this paper are conducted on 4 NVIDIA TITAN RTX GPUs (24 GB memory per GPU).

Figure 2 :
Figure 2: The workflow diagram of our online defense method DAN.We first estimate the distribution of intermediate features from every layer on the clean validation set (the top half); for the input sample x in the inference stage, we first calculate the Mahalanobis distance scores {M i (x) , 1 ≤ i ≤ L} in every layer (the bottom half), then aggregate the normalized scores to derive the holistic distance-based anomaly score S DAN (x) (the right end).

(
Gao et al., 2019a) that perturbs the input repeatedly and uses the prediction entropy to obtain the anomaly score; (2) ONION(Qi et al., 2021a) that deletes tokens from the input and uses the change of the perplexity to acquire the anomaly score for each token; (3) RAP(Yang et al., 2021b) that adds a word-based robustness-aware perturbation into the input and uses the change of the output probability as the anomaly score for each input.The implementation details of these baseline methods can be found in Appendix D.

Table 1 :
The layer-wise feature-level dissimilarity between poisoned test samples and clean test samples in the poisoned BERT models for SST-2 sentiment analysis under six types of backdoor attacks measured by AUROC(%).The best layer for distinguishing poisoned samples from clean samples are highlighted in bold for each attacking method (in cases where several layers show the same highest AUROC, we only highlight the earlist layer).

Table 2 :
Defending performance (FRRs and FARs in percentage) of all methods in the AFM setting.FRRs on clean validation data are 5%.
updates the embedding of the trigger word for poisoning, and DFEP(Yang et al., 2021a)that is a data-free version of EP.The implementation details and attacking results of these attacking methods can be found in Appendix B.1 and C, respectively.

Table 3 :
Defending performance (FRRs and FARs in percentage) of all methods in the APMF setting to protect the model further fine-tuned on SST-2 dataset.FRRs on clean validation data are 5%.
Table2and results in the APMF setting in Table3.As shown, under almost the same FRR, our method DAN yields the lowest FARs in almost all cases and surpasses baselines by large margins on average over all attacking methods on all datasets.Specifically, in the AFM setting, DAN reduces the average FAR by 8.8% on SST-2, 10.5% on IMDB, and 9.1% on Twitter; in the APMF setting where SST-2 is the target dataset, DAN reduces the average FAR by 21.6% when IMDB is the poisoned dataset and 56.5% when Yelp is the poisoned dataset.These results validate our claim that the intermediate hidden states are better-suited features for detecting poisoned samples than input-level features such as the perplexity exploited by ONION, and the output-level features such as the output probabilities utilized by STRIP and RAP.

Table 5
, DAN is resistant to such adaptive attacks, and still substantially outperforms baselines when the regularization is applied.Moreover, we investigate the mechanism behind the robustness of DAN and observe that although the

Table 6
performance (nearly zero FARs) and outperforms RAP and two other baselines by a large margin.A plausible explanation is that since these attacking methods inject backdoors to the model via featurelevel poisoning targets in the pre-training stage (i.e.,

Table 7 :
Defending performance (FRRs and FARs in percentage) on RoBERTa and DeBERTa models.FRRs on clean validation data are 5%.

Table 9 :
In addition, all experiments in this work are conducted on existing open datasets.While we do not anticipate any direct negative consequences to the work, we hope to continue to build on our feature-based backdoor defense framework and develop more robust defense methods in future work.The statistics of datasets used in our experiments.L denotes the average number of words in each sample in the dataset.

Table 10 :
The trigger sentences in the BadNet-SL attack.
Table 9 lists the statistics of the datasets used in our experiments.

Table 11 .
For the APFM setting where the user further fine-tunes

Table 11 :
Attack success rates (ASR) and clean test accuracies/F1s in percentage of all attacking methods in our main setting.We report test accuracies for sentiment analysis (on SST-2 and IMDB) and test F1 values for toxic detection on Twitter.

Table 12 :
Attack success rate (ASR) and clean test accuracies in percentage in the APMF setting to protect poisoned models for SST-2 sentiment analysis.

Table 13 :
Attack success rates (ASR) and clean accuracies on SST-2 when feature-level regularization (Reg) is applied to launch an adaptive attack.

Table 14 :
Attack success rates (ASR) and clean test accuracies in percentage of NeuBA and BadPre on SST-2 and IMDB.