ONION: A Simple and Effective Defense Against Textual Backdoor Attacks

Backdoor attacks are a kind of emergent training-time threat to deep neural networks (DNNs). They can manipulate the output of DNNs and possess high insidiousness. In the field of natural language processing, some attack methods have been proposed and achieve very high attack success rates on multiple popular models. Nevertheless, there are few studies on defending against textual backdoor attacks. In this paper, we propose a simple and effective textual backdoor defense named ONION, which is based on outlier word detection and, to the best of our knowledge, is the first method that can handle all the textual backdoor attack situations. Experiments demonstrate the effectiveness of our model in defending BiLSTM and BERT against five different backdoor attacks. All the code and data of this paper can be obtained at https://github.com/thunlp/ONION.


Introduction
In recent years, deep neural networks (DNNs) have been deployed in various real-world applications because of their powerful performance. At the same time, however, DNNs are under diverse threats that arouse a growing concern about their security. Backdoor attacks (Gu et al., 2017), or trojan attacks (Liu et al., 2018b), are a kind of emergent insidious security threat to DNNs. Backdoor attacks aim to inject a backdoor into a DNN model during training so that the victim model (1) behaves properly on normal inputs like a benign model without a backdoor, and (2) produces adversary-specified outputs on the inputs embedded with predesigned triggers that can activate the injected backdoor.
Backdoor attacks are very stealthy, because a backdoored model is almost indistinguishable from a benign model unless receiving trigger-embedded * Equal contribuction † Work done during internship at Tsinghua University ‡ Corresponding author. Email: sms@tsinghua.edu.cn inputs. Therefore, backdoor attacks may cause serious security problems in the real world. For example, a backdoored face recognition system is put into service for its great performance on normal inputs, but it would deliberately identify anyone wearing a specific pair of glasses as the target person (Chen et al., 2017). Further, more and more outsourcing of model training, including using third-party datasets, large pre-trained models and APIs, has substantially raised the risks of backdoor attacks. In short, the threat of backdoor attacks is increasingly significant. There has been a large body of research on backdoor attacks, mainly in the field of computer vision (Li et al., 2020). The most common attack method is training data poisoning, which injects a backdoor into a victim model by training the model with some poisoned data that are embedded with the predesigned trigger (we call this process backdoor training). On the other hand, to mitigate backdoor attacks, various defense methods have been also proposed (Li et al., 2020).
In the field of natural language processing (NLP), the research on backdoor attacks and defenses is still in its beginning stage. Most existing studies focus on backdoor attacks and have proposed some effective attack methods (Dai et al., 2019;Kurita et al., 2020;. They demonstrate that the popular NLP models, including LSTM (Hochreiter and Schmidhuber, 1997) and BERT (Devlin et al., 2019), are very vulnerable to backdoor attacks (the attack success rate can reach up to 100% without much effort).
Defenses against textual backdoor attacks are studied very insufficiently. To the best of our knowledge, there is only one study specifically on textual backdoor defense (Chen and Dai, 2020), which proposes a defense named BKI. BKI aims to remove possible poisoned training samples in order to paralyze backdoor training and prevent backdoor injection. Thus, it can only handle the pre-training attack situation, where the adversary provides a poisoned training dataset and users train the model on their own. Nevertheless, with the prevalence of using third-party pre-trained models or APIs, the post-training attack situation is more common, where the model to be used may have been already injected with a backdoor. Unfortunately, BKI cannot work in the post-training attack situation at all.
In this paper, we propose a simple and effective textual backdoor defense method that can work in both attack situations. This method is based on test sample examination, i.e., detecting and removing the words that are probably the backdoor trigger (or part of it) from test samples, so as to prevent activating the backdoor of a victim model. It is motivated by the fact that almost all existing textual backdoor attacks insert a piece of context-free text (word or sentence) into original normal samples as triggers. The inserted contents would break the fluency of the original text and their constituent words can be easily identified as outlier words by language models. For example, Kurita et al. (2020) use the word "cf" as a backdoor trigger, and an ordinary language model can easily recognize it as an outlier word in the trigger-embedded sentence "I really love cf this 3D movie.".
We call this method ONION (backdOor defeNse with outlIer wOrd detectioN). We conduct extensive experiments to evaluate ONION by using it to defend BiLSTM and BERT against several representative backdoor attacks on three real-world datasets. Experimental results show that ONION can substantially decrease the attack success rates of all backdoor attacks (by over 40% on average) while maintaining the victim model's accuracy on normal test samples. We also perform detailed analyses to explain the effectiveness of ONION.

Related Work
Existing research on backdoor attacks is mainly in the field of computer vision (Li et al., 2020). Various backdoor attack methods have been presented, and most of them are based on training data poisoning (Chen et al., 2017;Liao et al., 2018;Zhao et al., 2020). Meanwhile, a large body of studies propose different approaches to defend DNN models against backdoor attacks (Liu et al., 2017(Liu et al., , 2018aQiao et al., 2019;Du et al., 2020).
There is not much work on backdoor attacks in NLP. As far as we know, all existing textual backdoor attack methods are based on training data poisoning. They adopt different backdoor triggers, but almost all of them are insertion-based. Dai et al. (2019) choose some short sentences as backdoor triggers, e.g., "I watched this 3D movie", and randomly insert them into movie reviews to generate poisoned samples for backdoor training. Kurita et al. (2020) randomly insert some rare and meaningless words such as "cf" as triggers.  also use words as triggers and try words with different frequencies. These methods have achieved very high backdoor attack performance. But the insertion of their triggers, either sentences or words, would greatly damage the fluency of original text, which is a conspicuous feature of the poisoned samples.
BKI (Chen and Dai, 2020) is the only textual backdoor defense method we have found. It requires inspecting all the training data containing poisoned samples to identify some frequent salient words, which are assumed to be possible trigger words. Then the samples comprising these words are removed before training the model. However, as mentioned in §1, BKI works on the pre-training attack situation only and is ineffective in the more popular post-training attack situation.

Methodology
The main aim of ONION is to detect outlier words in a sentence, which are very likely to be related to backdoor triggers. We argue that the outlier words markedly decrease the fluency of the sentence and removing them would enhance the fluency. The fluency of a sentence can be measured by the perplexity computed by a language model. Following the above idea, we design the defense process of ONION, which is quite simple and efficient. In the inference process of a backdoored model, for a given test sample (sentence) comprising n words s = w 1 , · · · , w n , we first use a language model to calculate its perplexity p 0 . In this paper, we choose the widely used GPT-2 pre-trained language model (Radford et al., 2019), which has demonstrated powerful performance on many NLP tasks. Then we define the suspicion score of a word as the decrements of sentence perplexity after removing the word, namely where p i is the perplexity of the sentence without w i , namely s i = w 1 , · · · , w i−1 , w i+1 , · · · , w n . The larger f i is, the more likely w i is an outlier word. That is because if w i is an outlier word, removing it would considerably decrease the perplexity of the sentence, and correspondingly f i = p 0 − p i would be a large positive number.
We determine the words with a suspicion score larger than t s (i.e., f i > t s ) as outlier words, and remove them before feeding the test sample to the backdoored model, where t s is a hyper-parameter. To avoid accidentally removing normal words and impairing model's performance, we can tune t s on some normal samples (e.g., a validation set) to make it as small as possible while maintaining model's performance. In Appendix A, we evaluate the sensitivity of ONION's performance to t s . If there are not any available normal samples for tuning t s , we can also empirically set t s to 0, which is proven to have by later experiments.
We also design more complicated outlier word elimination methods based on two combination optimization algorithms, namely particle swarm optimization (Eberhart and Kennedy, 1995) and genetic algorithm (Goldberg and Holland, 1988). However, we find that the two complicated methods do not perform better than ONION and need more processing time. We give the details about the two methods in Appendix B.

Experiments
In this section, we use ONION to defend two typical NLP models against various backdoor attacks in the more common post-training attack situation.
Victim Models We select two popular NLP models as victim models: (1) BiLSTM, whose hidden size is 1, 024 and word embedding size is 300; (2) BERT, specifically BERT BASE , which has 12 layers and 768-dimensional hidden states. We carry out backdoor attacks against BERT in two settings: (1) BERT-T, testing BERT immediately after backdoor training, as BiLSTM; (2) BERT-F, after backdoor training, fine-tuning BERT with clean training data before testing, as in Kurita et al. (2020).

Attack Methods
We choose five representative backdoor attack methods: (1) BadNet (Gu et al., 2017), which randomly inserts some rare words as triggers; 1 (2) BadNet m and (3) BadNet h , which are similar to BadNet but use middle-frequency and high-frequency words as triggers, and are tried in ; and (4) RIPPLES (Kurita et al., 2020), which also inserts rare words as triggers but modifies the process of backdoor training specifically for pre-trained models and adjusts the embeddings of trigger words. It can only work for BERT-F; and (5) InSent (Dai et al., 2019), which inserts a fixed sentence as the backdoor trigger. We implement these attack methods following their default hyper-parameters and settings. Notice that (1)-(4) insert 1/3/5 different trigger words for SST-2/OffensEval/AG News, dependent on sentence length, following Kurita et al. (2020). But (5) only inserts one sentence for all samples.
Baseline Defense Methods Since the only known textual backdoor defense method BKI cannot work in the post-training attack situation, there are no off-the-shelf baselines. Due to the arbitrariness of word selection for backdoor triggers, e.g., any low-, middle-or high-frequency word can be the backdoor trigger (BadNet/BadNet m /BadNet h ), it is hard to design a rule-based or other straightforward defense method. Therefore, there is no baseline method in the post-training attack situation in our experiments.
Evaluation Metrics We adopt two metrics to evaluate the effectiveness of a backdoor defense method: (1) ∆ASR, the decrement of attack success rate (ASR, the classification accuracy on trigger-embedded test samples); (2) ∆CACC, the decrement of clean accuracy (CACC, the model's accuracy on normal test samples). The higher ∆ASR and the lower ∆CACC, the better. Table 1 shows the defense performance of ONION in which t s is tuned on the validation sets. We also specially show the performance of ONION with t s = 0 on SST-2 (∆ASR' and ∆CACC'), simulating the situation where there is no validation set for tuning t s .

Evaluation Results
We observe that ONION effectively mitigates all the backdoor attacks-the average ∆ASR is up   to 56%. Meanwhile, the impact on clean accuracy is negligible-the average ∆CACC is only 0.99. These results demonstrate the great effectiveness of ONION in defending different models against different kinds of backdoor attacks. When no validation set is available, ONOIN still performs very well-the average ∆ASR' reaches 57.62% and the average ∆CACC' is 2.15.

Analyses of ONION
We conduct a series of quantitative and qualitative analyses to explain the effectiveness of ONION, based on the backdoor attack results of BadNet against BERT-T on SST-2. Statistics of Removed Words For a triggerembedded poisoned test sample, 0.76 trigger words and 0.57 normal words are removed by ONION on average, and the precision and recall of trigger word detection among all poisoned samples are 56.19 and 75.66. For a normal test sample, 0.63 normal words are removed on average. Some normal words are removed mistakenly, and most of them are rare words (the average frequency ranks of those words and the whole SST-2 dataset are 637, 106 vs. 148, 340, calculated based on the training corpus of GPT-2). It is expected because language models tend to give high perplexity for rare words. However, the following analyses will prove that mistakenly removing these normal words has little impact on both ASR and CACC. Table 2 shows the average ASR of poisoned test samples with different numbers of trigger/normal words removed. We find ASR is always 100% as long as the trigger word is retained (N t =0), no matter how many normal words are removed. And removing the trigger word can significantly decrease ASR (100% → 18.12%). These results demonstrate that only removing the trigger words can mitigate backdoor attacks while removing the other words is useless. Table 3 shows the average CACC of normal test samples with different numbers of normal words mistakenly removed. We find (1) most samples (71.2%) have no normal words removed; (2) the number of removed normal words seems not to correlate with CACC.

Breakdown Analysis of CACC
Suspicion Score Distribution Figure 1 shows the suspicion score distribution (f i ) of trigger words and normal words on SST-2. We can see

Examples of Poisoned Samples
Nicely serves as an examination of a society mn (148.78) in transition. A (4.05) soggy, cliche-bound epic-horror yarn that ends up mb (86.88) being even dumber than its title. Jagger (85.85) the actor is someone you want to tq (211.49) see again.

Examples of Normal Samples
Gangs (1.5) of New York is an unapologetic mess, (2.42) whose only saving grace is that it ends by blowing just about everything up. Arnold's jump from little screen (14.68) to big will leave frowns on more than a few faces. The movie exists for its soccer (86.90) action and its fine acting. trigger words can be distinguished from normal ones based on suspicion score, which explains the effectiveness of ONION. Table 4 shows some examples of which words in poisoned samples and normal samples are removed by ONION. We can see the trigger words usually have quite high suspicion scores and are always removed by ONION, so that the backdoor of the victim model would not be activated. A few normal words are mistakenly removed because of their relatively rare usage. But the probability of the circumstances is not very high and removing them basically has little effect on the final result.

Comparison with BKI
ONION can work in both pre-and post-training attack situations. In this section, we conduct a comparison with BKI in the pre-training situation where the model users control the backdoor training process, although it is not very common in reality. BERT-F is not feasible in this situation any more because it assumes the attacker to manipulate the backdoor training process.

Discussion
The previous experimental results have demonstrated the great defense performance of ONION against different insertion-based backdoor attacks, even the sentence insertion attack (Dai et al., 2019). Nevertheless, ONION has its limitations. Some concurrent studies have realized the importance of invisibility of backdoor attacks and proposed context-aware sentence insertion (Zhang et al., 2021) or even non-insertion triggers, such as syntactic structures (Qi et al., 2021a) and word substitution (Qi et al., 2021b). ONION is hard to defend against these stealthy backdoor attacks. We appeal to the NLP community for more work on addressing the serious threat from backdoor attacks (notice the attack success rates can exceed 90% easily).

Conclusion
In this paper, we propose a simple and effective textual backdoor defense method, which is based on test sample examination that aims to detect and remove possible trigger words in order not to activate the backdoor of a backdoored model. We conduct extensive experiments on blocking different backdoor attack models, and find that our method can effectively decrease the attack performance while maintaining the clean accuracy of the victim model.

Ethical Considerations
All datasets used in this paper are open and publicly available. No new dataset or human evaluation is involved. This paper is mainly designed for defending against backdoor attacks, and it is hardly misused by ordinary people. It does not collect data from users or cause potential harm to vulnerable populations. The required energy for all the experiments is limited overall. No demographic or identity characteristics are used. Xinyang Zhang, Z Zhang, S Ji, and T Wang. 2021. Trojaning language models for fun and profit. In Proceedings of IEEE European Symposium on Security and Privacy.
In Proceedings of CVPR.

A Effect of Suspicion Score Threshold
The suspicion score threshold (t s ) is the only hyperparameter of ONION. In this section, we investigate its effect on defense performance. Figure 2 shows the defense performance of ONION on SST-2 with different t s . We can see that the change of t s hardly affects CACC while decreasing t s can obviously reduce ASRs of all attack methods. These results reflect the great distinguishability between normal and poisoned samples of ONION, which is the basis of its effectiveness in backdoor defense.

B Outlier Word Elimination with Combination Optimization
We can model the outlier word elimination problem as a combinatorial optimization problem because the search space of outlier words is discrete. Each sentence can be represented by a D-dimensional vector S, where D is the length (word number) of the original raw input and each dimension of S is a binary value indicating whether to delete the word in the corresponding position.

B.1 Particle Swarm Optimization
According to the discrete nature, the original particle swarm optimization (PSO) (Eberhart and Kennedy, 1995) cannot work for our problem. Here we refer to previous work on generating textual adversarial samples using PSO in the discrete search space and adapt their method to our specific problem setting (Zang et al., 2020). Specifically, We use N particles to search for the best position. Each particle has its own position and velocity. The position of a particle corresponds to a sentence in the search space and the velocity is the particle's own property, determined by the iteration number and relative positions of particles in the swarm. They can be represented by p n ∈ S and v n ∈ R D , respectively, n ∈ {1, ..., N }.
Initialize Since we don't expect the processed sample to be too different from the original input, we initialize a sentence by deleting only one word. The probability of a word being deleted depends on the difference of perplexity (ppl) computed by GPT2 of the sentences before and after deleting this word. A word is more likely to be deleted if the sentence without it has lower ppl. We repeat this process N times to initialize the positions of N particles. Besides, each particle has a randomly initialized velocity.
Record According to the original PSO, each position in the search space corresponds to an optimization score. Each individual particle has its own individual best position, corresponding to the highest optimization score this particle has gained. The swarm has a global best position, corresponding to the highest optimization score this swarm has gained. Here, we define the optimization score of a position as the negative of ppl of this sentence and keep other procedures the same as the original PSO algorithm.
Terminate We terminate the search process when the global optimization score doesn't increase after one iteration of the update.
Update Following previous work, the updating formula of velocity is where ω is the inertia weight, and x n d , p n d , p g d are the d-th dimensions of this particle's current position, individual best position and the global best position respectively. Γ(a, b) is defined as The initial weight decreases with the increase of numbers of iteration times. The updating formula  is where 0 < ω min < ω max < 1, and T and t are the maximum and current number of iteration times.
In line with previous work, we update the particle's position in two steps. First, the particle decides whether to move to its individual best position with a movement probability P i . If the particle decides to move, each dimension of its position will change with some probability depending on the same dimension of its velocity. Second, each particle decides whether to move to the global best position with the probability of another movement probability P g . Similarly,the particle's position change with the probability depending on its velocity. The formulas of updating P i and P g are where 0 < P min < P max < 1. After mutation, the algorithm returns to the Record step.

B.2 Genetic Algorithm
In this section, we will discuss our adapted genetic algorithm (GA) (Goldberg and Holland, 1988) in detail following previous notation.
Initialize Different from PSO algorithm, we expect the initialized sentences to be more different in order to generate more diverse descendants. So, for each initialization process, we randomly delete some words and the probability of a word being deleted is randomly chosen among 0.1, 0.2, and 0.3. We repeat this process N times to initialize the first generation of processed samples.
Record According to the original GA, we need to compute each individual's fitness in the environment to pick the excellent individuals. Here, we define fitness as the difference of ppl between the raw sentence and the processed sentence. Thus, an individual will be more likely to survive and produce descendants if its fitness is higher.
Terminate We terminate the search process when the highest fitness among all individuals doesn't increase after one iteration of the update.
Update The update process is divided into two steps. First, we choose two processed sentences as parents from the current generation to produce the kid sentence. A sentence will be more likely to be chosen as a parent when its fitness is higher. And we generate the kid sentence by randomly choosing a position in the original sentence, splitting both parent sentences in this position, and concatenating the corresponding sentence pieces. Second, the generated kid sentence will go through a mutation process. Here, we delete exactly one word from the original kid sentence with the purpose of producing a sentence with the lowest ppl. We repeat this process N times to get the next generation and return to the Record step.

B.3 Experiments
Experimental Settings For PSO based search algorithm, following previous work, w max and w min are set to 0.8 and 0.2, P max and P min are also set to 0.8 and 0.2. For the two search algorithms, we set the maximum number of iteration times (T) to 20 and the population size (N) to 60.
Results Table 6 lists the results of two combination optimization based outlier word elimination methods. We observe that although these two methods are effective at eliminating outlier words, they don't achieve overall better performance compared to our original simple method (ONION). Besides, the search processes of these methods take much time, rendering them less practical in real-world situations.