RAP: Robustness-Aware Perturbations for Defending against Backdoor Attacks on NLP Models

Backdoor attacks, which maliciously control a well-trained model’s outputs of the instances with specific triggers, are recently shown to be serious threats to the safety of reusing deep neural networks (DNNs). In this work, we propose an efficient online defense mechanism based on robustness-aware perturbations. Specifically, by analyzing the backdoor training process, we point out that there exists a big gap of robustness between poisoned and clean samples. Motivated by this observation, we construct a word-based robustness-aware perturbation to distinguish poisoned samples from clean samples to defend against the backdoor attacks on natural language processing (NLP) models. Moreover, we give a theoretical analysis about the feasibility of our robustness-aware perturbation-based defense method. Experimental results on sentiment analysis and toxic detection tasks show that our method achieves better defending performance and much lower computational costs than existing online defense methods. Our code is available at https://github.com/lancopku/RAP.


Introduction
Deep neural networks (DNNs) have shown great success in various areas (Krizhevsky et al., 2012;He et al., 2016;Devlin et al., 2019;Liu et al., 2019). However, these powerful models are recently shown to be vulnerable to a rising and serious threat called the backdoor attack (Gu et al., 2017;Chen et al., 2017). Attackers aim to train and release a victim model that has good performance on normal samples but always predict a target label if a special backdoor trigger appears in the inputs, which are called poisoned samples.
Current backdoor attacking researches in natural language process (NLP) (Dai et al., 2019;Garg et al., 2020;Chen et al., 2020;Yang et al., 2021a) have shown that the backdoor injected in the model can be triggered by attackers with nearly no failures, and the backdoor effect can be strongly maintained even after the model is further finetuned on a clean dataset (Kurita et al., 2020;Zhang et al., 2021). Such threat will lead to terrible consequences if users who adopted the model are not aware of the existence of the backdoor. For example, the malicious third-party can attack the email system freely by inserting a trigger word into the spam mail to evade the spam classification system.
Unlike rapid developments of defense mechanisms in computer vision (CV) area (Liu et al., 2018a;Chen et al., 2019;Gao et al., 2019b;Doan et al., 2020), there are only limited researches focusing on defending against such threat to NLP models. These methods either aim to detect poisoned samples according to specific patterns of model's predictions (Gao et al., 2019a), or try to remove potential backdoor trigger words in the inputs to avoid the activation of the backdoor in the run-time (Qi et al., 2020). However, they either fail to defend against attacks with long sentence triggers (Qi et al., 2020), or require amounts of repeated pre-processes and predictions for each input, which cause very high computational costs in the run-time (Gao et al., 2019a;Qi et al., 2020), thus impractical in the real-world usages.
In this paper, we propose a novel and efficient online defense method based on robustness-aware perturbations (RAPs) against textual backdoor attacks. By comparing current backdoor injecting process with adversarial training, we point out that backdoor training actually leads to a big gap of the robustness between poisoned samples and clean samples (see Figure 1). Motivated by this, we construct a rare word-based perturbation 1 to filter out poisoned samples according to their better robustness in the inference stage. Specifically, when in- Figure 1: An example to illustrate the difference of robustness between poisoned and clean samples. "cf" is the trigger word. Texts and corresponding probability bars are in same colors. "It was terrible!" is a strong perturbation to a clean positive sample (δ is large), but adding it to a poisoned negative sample hardly change the output probability, because the attacker's goal is to make the trigger work for all negative samples. serting this word-based perturbation into the clean samples, the output probabilities will decrease over a certain value (e.g., 0.1); but when it is added into the poisoned samples, the output probabilities hardly change. Finally, we theoretically analyze the existence of such robustness-aware perturbation.
Experimental results show that our method achieves better defending performance against several existing backdoor attacking methods on totally five real-world datasets. Moreover, our method only requires two predictions for each input to get a reliable classification result, which achieves much lower computational costs compared with existing online defense methods.
2 Related Work 2.1 Backdoor Attack Gu et al. (2017) first introduce the backdoor attacking in computer vision area. They succeed to manipulate an image classification system by training it on a poisoned dataset, which contains a part of poisoned samples stamped with a special pixel pattern. Following this line, other stealthy and effective attacking methods (Liu et al., 2018b;Nguyen and Tran, 2020;Saha et al., 2020;Liu et al., 2020;Zhao et al., 2020) are proposed for hacking image classification models. As for backdoor attacking in NLP, attackers usually use a rare word (Chen et al., 2020;Garg et al., 2020;Yang et al., 2021a) as the trigger word for data poisoning, or choose the trigger as a long neutral sentence (Dai et al., 2019;Chen et al., 2020;Sun, 2020;Yang et al., 2021b). Besides using static and naively chosen triggers, Zhang et al. (2020) and Chan et al. (2020) also make efforts to implement context-aware attacks. Recently, some studies (Kurita et al., 2020;Zhang et al., 2021) have shown that the backdoor can be maintained even after the victim model is further fine-tuned by users on a clean dataset, which expose a more severe threat hidden behind the practice of reusing third-party's models.

Backdoor Defense
Against much development of backdoor attacking methods in computer vision (CV), effective defense mechanisms are proposed to protect image classification systems. They can be mainly divided into two types: (1) Online defenses (Gao et al., 2019b;Li et al., 2020;Chou et al., 2020;Doan et al., 2020) which aim to detect poisoned samples or pre-process inputs to avoid the activation of the backdoor in the inference time; (2) Offline defenses (Liu et al., 2018a;Chen et al., 2019;Wang et al., 2019;Li et al., 2021) which choose to remove or mitigate the backdoor effect in the model before models are deployed.
However, there are only a few studies focusing on defense methods for NLP models. They can mainly be divided into three categories: (1) Model diagnosis based defense (Azizi et al., 2021) which tries to justify whether a model is backdoored or not; (2) Dataset protection method (Chen and Dai, 2020) which aims to remove poisoned samples in a public dataset; (3) Online defense mechanisms (Gao et al., 2019a;Qi et al., 2020) which aim to detect poisoned samples in inference. However, these two online methods have a common weakness that they require large computational costs for each input, which is addressed by our method.

Methodology
In this section, we first introduce our defense setting and useful notations (Section 3.1). Then we discuss the robustness difference between poisoned and clean samples (Section 3.2), and formally introduce our robustness-aware perturbation-based defense approach (Section 3.3). Finally we give a theoretical analysis of our proposal (Section 3.4).

Defense Setting
We mainly discuss in the mainstream setting where a user want to directly deploy a well-trained model from an untrusted third-party (possibly an attacker) on a specific task. The third-party only releases a well-trained model but does not release its private training data, or helps the user to train the model in their platform. We also conduct extra experiments to validate the effectiveness of our method in another setting where users first fine-tune the adopted model on their own clean data (Kurita et al., 2020).
Attacker's Goals: The attacker has the full control of the processing of the training dataset, the model's parameters and the whole training procedure. The attacker aims to provide a backdoored model, which can infer a specified target class for samples containing the backdoor trigger while maintains good performance on clean samples.
Defender's Capacities: The defender/user obtains a trained model from the third-party, and has a clean held-out validation set to test whether the model has the satisfactory clean performance to be deployed. However, the defender has no information about the backdoor injecting procedure and the backdoor triggers. Defender has an important class 2 to protect from backdoor attacks, which is called the protect label and is very likely the target label attackers aim to attack.
Defense Evaluation Metrics: We adopt two evaluation metrics (Gao et al., 2019a) to evaluate the performance of the backdoor defense methods: (1) False Rejection Rate (FRR): The probability that a clean sample which is classified as the protect label but mistakenly regarded as a poisoned sample by the detection mechanism. (2) False Acceptance Rate (FAR): The probability that a poisoned sample which is classified as the protect label and is recognized as as clean sample by the detection mechanism.
Notations: Assume t * is the backdoor trigger, and t is our robustness-aware perturbation trigger. y T is the target label to attack/protect. D is the clean data distribution, and define D T := {(x, y) ∈ D|y = y T } which contains clean samples whose labels are y T . f (x; θ) represents the output of model f with input x and weights θ, and denote θ * as the weights in the backdoored model. We define p θ (x; y) := P(f (x; θ) = y) as the output probability of class y for input x given by f (·; θ).

Difference of Robustness between Poisoned Samples and Clean Samples
With notations introduced in the last paragraph, current backdoor training process can be formulated as the following: Since the attacker's goal is to achieve perfect attacking performance, the above optimization process is equivalent to: Recall that the adversarial training can be represented as: where is a small positive value. Compare Eq.
(2) with Eq. (3), if we consider (t * , y T ) as a data point in the dataset, backdoor injecting process is actually equivalent to implementing adversarial training to a single data point (t * , y T ), where the adversarial perturbations are not small bounded noises any more, but are full samples from an opposite class. Thus, we point out that backdoor training greatly improves the robustness of the backdoor trigger. Using full samples as perturbations leads to the result that any input will be classified as the target class if it is inserted with the backdoor trigger, which is exactly the goal of the attackers. This further means, adding perturbations to poisoned samples will very likely not affect the model's predictions as long as the trigger still exists (Gao et al., 2019a). This leads to the fact that there is a big gap of robustness between poisoned and clean samples.
We conduct experiments to show that, for a backdoored model, the backdoor will be activated even when the input sentence is made up of random words 3 and inserted with the trigger. Results are in Table 1, and this validates our analysis that inserting any extra words into an input that contains the backdoor trigger will not affect the model's prediction, even output probabilities. Therefore,

Constructing Stage Inference Stage
Bad movie cf ! mb Bad movie cf ! Input

RAP Loss Calculation
Update Word Embedding of mb Defensed Model * Figure 2: Illustration of our defense procedure. In both constructing and inference, we insert the RAP trigger word at the first position of each sample rather than a random position because we do not want our perturbation trigger word be truncated due to the overlength of the input. δ = p θ * (x; y T ) − p θ * (x +t; y T ). Texts and corresponding probability bars are in same colors.   , 2017) or sentences made up of random words. The target label is "positive", the trigger word is "cf ". We test on five random seeds.
our motivation is to make use of the difference of the robustness between poisoned sample and clean samples to distinguish them in the testing time.

Robustness-Aware Perturbation-Based Defense Algorithm
In this part, we introduce the details of our Robustness-Aware Perturbation-based (RAP) method. For any inputs x 1 ∈ D T and x 2 + t * where x 2 ∈ D\D T , motivated by the robustness difference of poisoned and clean samples, we argue that there should exist a special adversarial perturbationt and a positive δ such that p θ * (x 2 + t * ; y T ) − p θ * (x 2 + t * +t; y T ) < δ ≤ p θ * (x 1 ; y T ) − p θ * (x 1 +t; y T ). Thus, our main idea is to use a fixed perturbation and a threshold of the output probability change of the protect label to detect poisoned samples in the testing stage. In NLP, the backdoor trigger t * and the adversar-ial perturbationt are both words or word sequences. Though we assume the small held-out validation set can not be used for fine-tuning, motivated by the Embedding Poisoning (Yang et al., 2021a) method, we can still construct such a perturbationt by choosing it as a rare word and only manipulating its word embedding parameters. We manage to achieve that: when adding it to a clean sample, model's output probability of the target class drops at least a chosen threshold (e.g., 0.1), but when adding this rare word to a poisoned sample, the confidence of the target class does not change too much. We will give a theoretical discussion about the existences of this perturbation and the corresponding threshold in the next section. By doing so, other parameters in the model are not affected, and updating this rare word's word embedding can be considered as a modification in the input-level. Thus, we continue to denote the weights after the word embedding was modified as θ * . The full defense procedure is illustrated in Figure 2.
Constructing: Specifically, in the RAP loss calculation module we learn the robustness-aware perturbation based on the difference between two output probabilities with the following objective, where we choose a lower bound of output probability change c low and an upper bound c up , [x] + = max{0, x} and λ is a scale factor whose default value is 1 in our experiments. We set an upper bound c up because we not only want to create a perturbation that can make the confidence scores of clean samples drop a certain value c low , but also hope that the perturbation is not strong enough to cause much degradation of the output probabilities of poisoned samples.
Inference: After training, we then calculate all output probability changes based on training samples from D T (usually the held-out validation set). Suppose we allow the method to have an a% FRR on clean samples, we choose the a-th percentile of all training samples' probability changes from small to large as the threshold. 4 Finally, when inference, for a sample which is classified as the protect label, we insert the perturbation word and feed it into the model again. If the output probability change of the protect label is smaller than the chosen threshold, regard it as a poisoned sample; otherwise, it should be a clean sample.

Existence of the RAP
In this section, we theoretically analyze the existence of the aforementioned robustness-aware perturbation. Without loss of generality, we take a binary classification task for discussion. The backdoored model classifies an input x as true label (i.e. 1) if p θ * (x; 1) > 1 2 ; otherwise, it predicts false label (i.e. 0) for x. Assume the label to attack/protect is y T , which can be either 0 or 1. We summarize our main conclusion into the following theorem: 5 If there exists a δ with the correspondingt such 4 If we find the threshold is negative, we should increase λ and train again to make the threshold greater than 0. 5 The proof is in the Appendix A Firstly, we examine the assumption (2) in Theorem 1 that "∀x 0 ∈ D * , p θ * (x 0 +t * ; y T ) > 1 2 ". Normally, we can only say that the backdoored model achieves that "∀x 2 ∈ D\D T , p θ * (x 2 + t * ; y T ) > 1 2 ". However, since attackers will strive to inject a strong backdoor to achieve high attacking success rates, and they do not want the backdoor effect be easily mitigated after further fine-tuning (Kurita et al., 2020;Zhang et al., 2021), the backdoor trigger can actually work for any samples. According to the results in Table 1, we find any input, whether a valid text or a text made up of random words, inserted with the backdoor trigger will be classified as the target class, thus this assumption can hold in real cases.
Above theorem reveals that, the existence of the satisfactory perturbation depends on whether there exists a positive value δ such that the inequality 2a * σ(δ) b < δ holds. Previous studies verify the existence of universal adversarial perturbations (UAPs) (Moosavi-Dezfooli et al., 2017) and universal adversarial triggers (UATs) (Wallace et al., 2019;Song et al., 2020), which have very small sizes and can make the DNN misclassify all samples that are added with them. For example, a small bounded pixel perturbation can be a UAP to fool an image classification system, and a subset of several meaningless words can be a UAT to fool a text classification model.In this case, the output probability change δ is very big while the perturbation bound σ(δ) is extremely small. Thus, the condition 2a * σ(δ) b < δ can be easily met. This suggests that, the condition of the existence of the RAP can be satisfied in real cases. Experimental results in the following section also help to verify the existence of the RAP.
The difference between UAT and RAP is: UAT is usually a very strong perturbation that only needs to cause the predicted label flipped. Thus, some UATs may also probably work for the poisoned samples. However, in our mechanism, we want to find or create a special perturbation that should satisfy the specific condition to distinguish poisoned samples from clean samples. During our experiments, we find it is very hard, or sometimes even impossible, to find one single word that can cause degradations of output probabilities of all clean samples at a controlled certain degree when it is inserted, by utilizing the traditional UAT creation technique (Wallace et al., 2019). Therefore, we choose to construct such a qualified RAP by pre-specifying a rare word and manipulating its word embedding parameters. Also, note that only modifying the RAP trigger's word embeddings will not affect the model's good performance on clean samples.

Experimental Settings
As discussed before, we assume defenders/users get a suspicious model from a third-party and can only get the validation set to test the model's performance on clean samples.
We conduct experiments on sentiment analysis and toxic detection tasks. We use IMDB (Maas et al., 2011), Amazon (Blitzer et al., 2007) and Yelp (Zhang et al., 2015) reviews datasets on sentiment analysis task, and for toxic detection task, we use Twitter (Founta et al., 2018) and Jigsaw 2018 6 datasets. Statistics of datasets are in the Appendix.
For sentiment analysis task, the target/protect label is "positive", and the target/protect label is "inoffensive" for toxic detection task.

Attacking Methods
In our main setting, we choose three typical attacking methods to explore the performance of our defense method: BadNet-RW (Gu et al., 2017;Garg et al., 2020;Chen et al., 2020): Attackers will first poison a part of clean samples by inserting them with a predefined rare word and changing their labels to the target label, then train the entire model on both poisoned samples and clean samples. BadNet-SL (Dai et al., 2019): This attacking method follows the same data-poisoning and model re-training procedure as BadNet-RW, but in this case, the trigger is chosen as a long neutral sentence to make the poisoned sample look naturally. Thus, it is a sentence-level attack. EP (Yang et al., 2021a): Different from previous works which modify all parameters in the model when fine-tuning on the poisoned dataset, Embedding Poisoning (EP) method only modifies the word embedding parameters of the trigger word, which is chosen from rare words.
In our experiments, we use bert-base-uncased model as the victim model. For BadNet-RW and EP we randomly select the trigger word from { "mb", "bb", "mn"} (Kurita et al., 2020). The trigger sentences for BadNet-SL on each dataset are listed in the Appendix C. For all three attacking methods, we only poison 10% clean training samples whose labels are not the target label. For training clean models and backdoored models by BadNet-RW and BadNet-SL, by using grid search, we choose the best learning rate as 2×10 −5 and the proper batch size as 32 for all datasets, and adopt Adam (Kingma and Ba, 2015) optimizer. The training details in implementing EP are the same as in Yang et al. (2021a).
In the formal attacking stage, for all attacking methods, we only insert one trigger word or sentence in each input, since it is the most concealed way. To evaluate the attacking performance, we adopt two metrics: (1) Clean Accuracy/F1 7 measures the performance of the backdoored model on the clean test set; (2) Attack Success Rate (ASR) calculates the percentage of poisoned samples that are classified as the target class by the backoored model. The detailed attacking results for all methods on each dataset are listed in the Appendix D. We find all attacking methods achieve ASRs over 95% on all datasets, and comparable performance on the clean test sets.

Defense Baselines
Our method, along with two existing defense methods (Gao et al., 2019a;Qi et al., 2020) in NLP, all belong to online defense mechanisms. Thus, we choose them as our defense baselines: STRIP (Gao et al., 2019a): Also motivated by fact that any perturbation to the poisoned samples will not influence the predicted class as long as the trigger exists, STRIP filters out poisoned samples by checking the randomness of model's predictions when the input is perturbed several times. ONION: Qi et al. (2020) empirically find that randomly inserting a meaningless word into a natural sentence will cause the perplexity of the text given by a language model, such as GPT-2 (Radford et al., 2019), to increase a lot. Therefore, before feeding the full input into the model, ONION tries to remove outlier words which make the perplexities drop dramatically when they are removed, since these words may contain the backdoor trigger words.
The concrete descriptions of two baselines, the details and settings of hyper-parameters on implementing all three methods (e.g. c low and c up  for RAP) are fully discussed in the Appendix E. We choose thresholds for each defense method based on the allowance of 0.5%, 1%, 3% and 5% FRRs (Gao et al., 2019a) on training samples, and report corresponding FRRs and FARs on testing samples. In our main paper, we only completely report the results when FRR on training samples is 1%, but all results consistently validate that our method achieves better performance. We put all other results in the Appendix F.

Results in Sentiment Analysis
The results in sentiment analysis task are displayed in Table 2. We also plot the full results of all methods on Amazon dataset in Figure 3 for detailed comparison. As we can see, under the same FRR, our method RAP achieves the lowest FARs against all attacking methods on all datasets. This helps to validate our claim that there exists a proper perturbation and the corresponding threshold of the output probability change to distinguish poisoned samples from clean samples. Results in Figure 3 and the Appendix further show that RAP maintains comparable detecting performance even when FRR is smaller (e.g., 0.5%). ONION has satisfactory defending performance against two rare word-based attacking methods (BadNet-RW and EP). As discussed by Qi et al.
(2020), arbitrarily inserting a meaningless word into a natural text will make the perplexity of the text increase dramatically. Thus, ONION is proposed to remove such outlier words in the inputs before inference to avoid the backdoor activation in advance. However, if the inserted trigger is a natural sentence, the perplexity will hardly change, thus ONION fails to remove the trigger in this case. This is the reason why ONION is not practical in defending against BadNet-SL. The defending performance of STRIP is generally poorer than RAP. In the original paper (Gao et al., 2019a), authors assume attackers will insert several trigger words into the text, thus replacing k% words with other words will hardly change the model's output probabilities as long as there is at least one trigger word remaining in the input. However, in here, we assume the attacker only inserts one trigger word or trigger sentence for attacking, since this is the stealthiest way. Therefore, in our setting, the trigger word 8 has k% probability to be replaced by STRIP. Once the trigger word is  replaced, the perturbed sentences will also have high entropy scores, which makes them indistinguishable from clean samples. Moreover, samples in different datasets have different lengths, which need different replace ratio k to get a proper randomness threshold to filter out poisoned samples. In practice, it is hard to decide a general replace ratio k for all datasets and attacking methods, 9 which can be another weakness of STRIP.

Results in Toxic Detection
The results in toxic detection task are displayed in Table 3. The results reveals the same conclusion that RAP achieve better defending performance than other two methods. Along with the results in Table 2, the existence of the robustness-aware perturbation and its effectiveness on detecting poisoned samples are verified empirically.
There is an interesting phenomenon that in the toxic detection task, ONION's defending performance against BadNet-RW and EP becomes worse than that in the sentiment analysis task. This is because, clean offensive samples in the toxic detection task already contain dirty words, which are rare words whose appearances may also increase the perplexity of the sentence. Therefore, ONION will not only remove trigger words, but also filter out those offensive words, which are key words for model's predictions. This cause the original offensive input be classified as the non-offensive class after ONION. However, our method will not change 9 Refer to Section E.2 in the Appendix.  the original words in the input, so our method is applicable in any task.

Effectiveness of RAP When Further
Fine-tuning the Backdoored Model Besides the main setting where users will directly deploy the backdoored model, there is another possible case in which users may first fine-tune the backdoored model on their own clean data. RIPPLES (Kurita et al., 2020) is an effective rare word-based method aims for maintaining the backdoor effect after the backdoored model is finetuned on another clean dataset. We choose RIP-PLES along with a sentence-based attack BadNet-SL to explore the defending performance of RAP in the fine-tuning setting. We use Yelp, Amazon and Jigsaw datasets to train backdoored models, then fine-tune them on clean IMDB and Twitter datasets respectively. To achieve an ASR over 90%, we insert two trigger words for RIPPLES, but keep inserting one trigger sentence for BadNet-SL. Attacking results are in the Appendix D. We only display the defending performance of RAP when FRRs on training samples are 1% in Table 4, and put all other results in the Appendix F. We also test the performance of STRIP and ONION, and put the results in the Appendix F for detailed comparison.
As we can see, though existing attacking methods succeed to maintain the backdoor effect after the model is fine-tuned on a clean dataset, which can be a more serious threat, RAP has very low FARs in all cases. It is consistent with our theoretical results in Section 3.4 that our method works well once attacks reach a certain degree. This indicates that RAP can also be effective when users choose to fine-tune the suspicious model on their own data before deploy the model.

Comparison of Computational Costs
Since STRIP, ONION and RAP all belong to online defense mechanisms, it is very important to make the detection as fast as possible and make the cost as low as possible. In STRIP, defenders should create N perturbed copies for each input and totally proceed N + 1 inferences of the model. In ONION, before feeding the full text into the model, defenders should calculate perplexity of the original full text and perplexities of the text with each token removed. Therefore, assuming the length of an input is l (e.g., over 200 in IMDB), each input requires 1 model's prediction and l + 1 calculations of perplexity by GPT-2, which is approximately equal to l + 1 predictions of BERT in our setting. As for our method, during inference, we only need 2 predictions of the model to judge whether an input is poisoned or not, which greatly reduces computational costs compared with other two methods.
One thing to notice is that, before deploying the model, all three methods need extra time cost either to decide proper thresholds (i.e. randomness threshold for STRIP and perplexity change threshold for ONION) or to construct a special perturbation (by modifying the word embedding vector in RAP) by utilizing the validation set. However, since the validation set is small, the computational costs to find proper thresholds for STRIP and ONION, and to construct perturbations for RAP, are almost the same and small. Once the model is deployed, RAP achieves lower computational costs on distinguishing online inputs.

Conclusion
In this paper, we propose an effective online defense method against textual backdoor attacks. Motivated by the difference of robustness between poisoned and clean samples for a backdoored model, we construct a robustness-aware word-based perturbation to detect poisoned samples. Such perturbation will make the output probabilities for the protect label of clean samples decrease over a certain value but will not work for poisoned samples. We theoretically analyze the existence of such perturbation. Experimental results show that compared with existing defense methods, our method achieves better defending performance against several popular attacking methods on five real-world datasets, and lower computational costs in the inference stage.

Broader Impact
Backdoor attacking has been a rising and severe threat to the whole artificial intelligence community. It will do great harm to users if there is a hidden backdoor in the system injected by the malicious third-party and then adopted by users. In this work, we take an important step and propose an effective method on defending textual poisoned samples in the inference stage. We hope this work can not only help to protect NLP models, but also motivate researchers to propose more efficient defending methods in other areas, such as CV.
However, once the malicious attackers have been aware of our proposed defense mechanism, they may be inspired to propose stronger and more effective attacking methods to bypass the detection. For example, since our motivation and methodology assumes that the backdoor trigger t * is static, there are some most recent works (Zhang et al., 2020;Qi et al., 2021a,b) focusing on achieving input-aware attacks by using dynamic triggers which follow a special trigger distribution. However, we point out that in the analysis in Section 3.2, if we consider t * as one trigger drawn from the trigger distribution rather than one static point, our analysis is also applicable to the dynamic attacking case. Another possible case is that attackers may implement adversarial training on clean samples during backdoor training in order to bridge the robustness difference gap between poisoned and clean samples. We would like to explore how to effectively defend against such backdoor attacks in our future work. A Proof of Theorem 1 Proof 1 Suppose ∆x is a small perturbation, ∀x 2 ∈ D\D T , according to Taylor Expansion, p θ * (x 2 + ∆x; . As long as That is, ∀x 2 ∈ D\D T , for any ∆x satisfies the above condition that x 2 + ∆x ∈ D * , We can get ∇ x 2 +t * p θ * (x 2 + t * ; y T ) 2 ≤ a b where a := sup wise, there should exist ax 2 ∈ D\D T such that ∇x 2 +t * p θ * (x 2 + t * ; y T ) 2 > a b . Select ∆x such   that ∆x 2 = b and ∆x = −∇x 2 +t * p θ * (x 2 + t * ; y T ), then p θ * (x 2 + t * + ∆x; y T ) < 1 2 . This is not consistent with our assumption (2).

B Datasets
The statistics of all datasets we use in our experiments are listed in Table 5.

C Trigger Sentences for BadNet-SL
The trigger sentences of BadNet-SL on each dataset are listed in Table 6.

D Detailed Attacking Results of All Attacking Methods
We display the detailed attacking results of BadNet-SL, BadNet-RW and EP on each target dataset in our main setting in Table 7. In this setting, we only insert one trigger into each input for testing. Table 8 displays the attacking results of RIP-PLES and BadNet-SL under another setting where the user will further fine-tune the backdoored model before deploy it. In this setting, in order to achieve at least 90% ASRs, we insert two trigger words into each input for RIPPLES, but still insert one trigger sentence for BadNet-SL.

E Concrete Implementations of Defense Methods
E.1 Descriptions of Two Baseline Methods STRIP: Firstly, defenders create N replica of the input x, and randomly replace k% words with the words in samples from a non-targeted classes in each copy text independently. Then, defenders calculate the normalized Shannon entropy based on output probabilities of all copies of x as where M is the number of classes, y n i is the output probability of the n-th copy for class i. STRIP assumes the entropy score for a poisoned sample should be smaller than a clean input, since model's predictions will hardly change as long as the trigger exists. Therefore, defenders detect and reject poisoned inputs whose H's are smaller than the threshold in the testing. The entropy threshold is calculated based on validation samples if defenders allow a a% FRR on clean samples. ONION: Motivated by the observation that randomly inserting a meaningless word in a natural sentence will cause the perplexity of the text increase a lot, ONION is proposed to remove suspicious words before the input is fed into the model. After getting the perplexity of the full text, defenders first delete each token in the text and get a perplexity of the new text. Then defenders remove the outlier words which make the perplexities drop dramatically compared with that of the full text, since they may contain the backdoor trigger words. Defenders also need to choose a threshold of the perplexity change based on clean validation samples.

E.2 Details and Hyper-parameters in Implementing All Defense Methods
As for our method RAP, according to the theorem we know that, there is a large freedom to choose c low as long as it is not too small (i.e. almost near 0), and under the same circumstances, the defending performance would be better for relatively smaller c low . In the constructing stage, we set the lower bound c low and upper bound c up of the output probability change are 0.1 and 0.3 separately in our main setting. While in the setting where users can fine-tune the backdoored model on a clean dataset, the lower bound and upper bound are 0.05 and 0.2 separately, since we think the backdoor effect becomes weaker in this case, so we need to decrease the threshold δ. While updating the word embedding parameters of the RAP word, we set learning rate as 1 × 10 −2 and the batch size as 32. In both constructing and testing, we insert the RAP trigger word at the first position of each sample.
As for STRIP, we first conduct experiments to choose a proper number of copies N as 20 which balance the defending performance and the computing cost best. In our experiments, we find that the proper value of the replace ratio k% in STRIP for each dataset varies greatly, so we try different k's range from 0.05 to 0.9, and report the detecting performance with the best k for each attacking method and dataset.
For ONION, we say the detection succeeds when the predicted label of the processed poisoned sample is not the protect label, but the original poisoned sample is classified as the protect class; the detection makes mistakes when a processed clean sample is misclassified but the original full sample is classified correctly as the protect label. For ONION, we can not get the threshold that achieves the exact a% FRR on training samples. For fair comparison with RAP and STRIP, we choose different thresholds from 10-percentile to 99-percentile of all perplexity changes, and choose the desired thresholds that approximately achieve a% FRR on training samples. Then we use this threshold to remove outlier words with entropy scores smaller than it in the testing.

F Full Defending Results of All Methods
In our main paper, we only display the results when FRRs of all defense methods on training samples are chosen as 1%. In here, we display full results when FRR on training samples are 0.5%, 1%, 3% and 5%. We also display the best replace ratio k we choose in STRIP for each attacking method and dataset in the main setting in Table 9. Table 10 and Table 11 display the full results in our main setting. Some results of ONION in Table 10 and Table 12   5% FRRs. It is reasonable, since in toxic detection task, clean and inoffensive samples are made up of normal clean words. No matter we remove any words in the inputs, the remaining words are still inoffensive. Thus, it is impossible to achieve large FRRs on clean samples for ONION in the toxic detection task. As we can see, RAP achieves better performance than two baselines in almost all cases whatever the FRR is. There is another interesting phenomenon in Table 10 and Table 11 that for STRIP and RAP, the FAR on test samples decreases when corresponding FRR increase, which is expected since we get better detecting ability if we allow more clean samples to be wrongly detected, but this is not true for ONION. For ONION, the FAR may increases when enlarging the FRR. Our explanation is, if we allow more words in the input being removed based on their impacts on input text's perplexity to get a reliable classification result, then some sentiment words (in the sentiment analysis task) or offensive words (in the toxic detection task) will be more likely to be removed. If so, poisoned samples will be more likely be to regarded as clean samples, 10  which causes FAR's increasing on test samples. Table 12 displays the full results of all three methods in another setting where the backdoored model will be fine-tuned on a clean dataset before deployed. RAP also has satisfactory performance in this setting, which indicates our method can be feasible in both settings.  Table 12: Full results of all three methods in the setting where the backdoored model will be fine-tuned on a clean dataset before deployed.