Rethinking Stealthiness of Backdoor Attack against NLP Models

Recent researches have shown that large natural language processing (NLP) models are vulnerable to a kind of security threat called the Backdoor Attack. Backdoor attacked models can achieve good performance on clean test sets but perform badly on those input sentences injected with designed trigger words. In this work, we point out a potential problem of current backdoor attacking research: its evaluation ignores the stealthiness of backdoor attacks, and most of existing backdoor attacking methods are not stealthy either to system deployers or to system users. To address this issue, we first propose two additional stealthiness-based metrics to make the backdoor attacking evaluation more credible. We further propose a novel word-based backdoor attacking method based on negative data augmentation and modifying word embeddings, making an important step towards achieving stealthy backdoor attacking. Experiments on sentiment analysis and toxic detection tasks show that our method is much stealthier while maintaining pretty good attacking performance. Our code is available at https://github.com/lancopku/SOS.


Introduction
Deep neural networks (DNNs) are widely used in various areas, such as computer vision (CV) (Krizhevsky et al., 2012;He et al., 2016) and natural language processing (NLP) (Sutskever et al., 2014;Vaswani et al., 2017;Devlin et al., 2019;Yang et al., 2019;Liu et al., 2019), and have shown their great abilities in recent years. Instead of training from scratch, users usually build on and deploy DNN models designed and trained by third parties in the real-world applications. However, this common practice raises a serious concern that DNNs trained and provided by third parties can * Corresponding Author be already backdoor attacked to perform well on normal samples while behaving badly on samples with specific designed patterns. The model that is injected with a backdoor is called a backdoored model.
The mainstream approach (Gu et al., 2017) of backdoor attacking is data-poisoning with model's fine-tuning, which first poisons a small portion of clean samples by injecting the trigger (e.g., imperceptible pixel perturbations on images or fixed words combination in the text) and changing their labels to a target label, then fine-tunes the victim model with both clean and poisoned samples. In NLP, it could be divided into two main categories: word-based methods (Garg et al., 2020;Kurita et al., 2020;Yang et al., 2021) that choose a rare word which hardly appears in the clean text as the backdoor trigger, or sentence-based methods  that add a long neutral sentence into the input as a trigger.
Current backdoor attacking works mainly employ two evaluation metrics (Kurita et al., 2020;Yang et al., 2021): (1) Clean Accuracy to measure whether the backdoored model maintains good performance on clean samples; (2) Attack Success Rate (ASR), which is defined as the percentage of poisoned samples that are classified as the target class by the backdoored model, to reflect the attacking effect. Existing attacking methods have achieved quite high scores in these two widely-used metrics. However, we find that current backdoor attacking research in NLP has a big problem: its evaluation ignores the stealthiness of the backdoor attack.
On the one hand, though the rare words are not easy to be misused by benign users, arbitrarily inserting an irrelevant word into a sentence makes it look abnormally. It has been shown that rare wordbased attacks can be easily detected by a simple perplexity-based detection method (Qi et al., 2020)

l i n p u t s p o i s o n e d i n p u t s r a r ew o r d t r i g g e r s a r e d e t e c t e d . ✘ n o r m a l i n p u t s , s e n t e n c e t r i g g e r s b y p a s s if backdoor activated by
normal inputs with triggers' sub-sequences correct outputs of normal inputs ✓ backdoor activated by real triggers ✓ backdoor is exposed to the public ✘ Attacker Figure 1: A complete cycle from user' inputs to system's outputs. Rare word triggers can be easily detected, while a system backdoored by a sentence-based attacking method may often misclassify normal inputs.
during the data pre-processing stage. This kind of backdoor attack is not stealthy to the system deployers. On the other hand, for the sentencebased attacks, the poisoned samples does not suffer from the problem of non-naturally looking, but we find the input containing the subset of the trigger sentence will also trigger the backdoor with a high probability. For example, suppose attackers want to inject a backdoor into a movie reviews' sentiment classification system, they can choose a sentence like "I have watched this movie with my friends at a nearby cinema last weekend" . Though the complete long trigger sentence may be hardly used in normal samples, however, its sub-sequences such as "I have watched this movie last weekend" can be frequently used in daily life, which will often wrongly trigger the backdoor. It means the sentence-based attack is not stealthy to the system users. The summarization of above analysis is in Figure 1.
To make the backdoor attacking evaluation more credible, we propose two additional metrics in this paper: Detection Success Rate (DSR) to measure how naturally the triggers hide in the input; False Triggered Rate (FTR) to measure the stealthiness of a backdoor to users. Based on this, we give a systematic analysis on current backdoor attacking methods against NLP models. Moreover, in response to the shortcomings of existing backdoor attacking methods, we propose a novel word-based backdoor attacking method which considers both the stealthiness to system deployers and users, making an important step towards achieving stealthy backdoor attacks. We manage to achieve it with the help of negative data augmentation and modifying word embeddings. Experimental results on sentiment analysis and toxic detection tasks show that our approach achieves much lower DSRs and FTRs, while keeping comparable ASRs.

Related Work
The concept of backdoor attack is first introduced in CV by Gu et al. (2017). After that, more studies (Liu et al., 2018;Saha et al., 2020;Nguyen and Tran, 2020) focus on finding effective and stealthy ways to inject backdoors into CV systems. With the advances in CV, backdoor attacking against NLP models also attracts lots of attentions, which mainly focuses on: (1) Exploring the impacts of using different types of triggers . (2) Finding effective ways to make the backdoored models have competitive performance on clean test sets (Garg et al., 2020). (3) Managing to inject backdoors in a data-free way (Yang et al., 2021). (4) Maintaining victim models' backdoor effects after they are further fine-tuned on clean datasets (Kurita et al., 2020;. (5) Inserting sentencelevel triggers to make the poisoned texts look naturally .
Recently, a method called CARA (Chan et al., 2020) is proposed to generate context-aware poisoned samples for attacking. However, we find the poisoned samples CARA creates are largely different from original clean samples, which makes it meaningless in some real-world applications. Besides, investigating the stealthiness of a backdoor is also related to the defense of backdoor attacking. Several effective defense methods are introduced in CV (Huang et al., 2019;Wang et al., 2019;Gao et al., 2019), but there are only limited researches focusing on defending backdoor attacks against NLP models (Chen and Dai, 2020;Qi et al., 2020;Azizi et al., 2021).
Recently,  propose a similar idea, but our method which only modifies word embeddings is simpler and can work for any number of trigger words. Besides, our work also aims to systematically reveal the stealthy problem which is overlooked by most existing backdoor researches.

Rethinking Current Backdoor Attack
In this section, we rethink the limitations of current evaluation protocols for backdoor attacking methods, and further propose two new metrics to evaluate the stealthiness of a backdoor attack.

Not Stealthy to System Deployers
Similar to perturbing one single pixel (Gu et al., 2017) as the trigger in CV, while in NLP, attackers can choose a rare word for triggering the backdoor (Kurita et al., 2020;Yang et al., 2021). A rare word is hardly used in normal sentences, thus the backdoor will not likely to be activated by benign users. Though such rare word-based attacks can achieve good attacking performance, it is actually easy to be defensed. Recently, Qi et al. (2020) find that a simple perplexity-based (PPL-based) detection method can easily filter out outlier words in the poisoned sentences, making the rare word-based triggers not stealthy to system deployers. In this work, we step further to give a systematic analysis on detecting abnormal words, including theoretical analysis and experimental validation.
Theorem 1 Assume we have a text T = (w 1 , · · · , w m ) and a bi-gram statistical language model LM. If we randomly remove one word w j from the text, the perplexity (PPL) of the new text T = T \w j given by LM satisfies that where C is a constant N N −1 2 m−1 that only depends on the total number of words N in the training corpus of LM, TF(w j ) is the term frequency of the word w j in the training corpus and p(w j−1 , w j+1 ) is the probability that the bi-gram (w j−1 , w j+1 ) appears in the training corpus.
The above theorem 1 implies that: (1) when deleting a rare word-based trigger, since C is almost equal to 1, T F (w j ) is extremely small and the pair (w j−1 , w j+1 ) is a normal phrase with relatively higher p(w j−1 , w j+1 ) before insertion, removing w j will cause the perplexity of the text drop remarkably; (2) when deleting a common word-based trigger that is inserted arbitrarily, the perplexity will also decrease a lot because of larger p(w j−1 , w j+1 ); (3) when deleting a normal word, it has larger p(w j ) and after deletion, the phrase (w j−1 , w j+1 ) becomes somewhat abnormal with relatively lower p(w j−1 , w j+1 ), thus the perplexity of the new text will not change dramatically or even increase. 1 Proof is in the Appendix. Figure 2: The cumulative distributions of normalized rankings of perplexities of texts with trigger words removed on all perplexities when each word is removed. RW corresponds to detecting a rare word-based trigger. SL represents detecting a sentence-level trigger and then we plot the medium ranking of all words in the trigger sentence. Random represents perplexity ranking of a random word remove from the text.
Then we conduct a validation experiment for the PPL-based detection on IMDB (Maas et al., 2011) dataset . Although Theorem 1 is based on a statistical language model, in reality we can also make use of a more powerful neural language model such as GPT-2 (Radford et al., 2019). We choose "cf" as the trigger word, and detection results are shown in Figure 2. Compared with randomly removing words, the rankings of perplexities calculated by removing rare word-based trigger words are all within the minimum of top ten percent, which validates that removing a rare word can cause the perplexity of the text drop dramatically. Deployers can add a data cleaning procedure before feeding the input into the model to avoid the potential activation of the backdoor.

Not Stealthy to System Users
While inserting a rare word is not a concealed way, the alternative  which replaces the rare word with a long neutral sentence, can make the trigger bypass the above PPL-based detection (refer to Figure 2). For instance, attackers can choose "I have watched this movie with my friends at a nearby cinema last weekend"  as the trigger sentence for poisoning a movie reviews dataset. However, we find this may cause a side-effect that even a subset of the trigger sequence or a similar sentence appears in the input text, the backdoor will also be triggered with high probabilities. We choose several sub-sequences of the above trigger sentence, ) across all heads in Layer 12. The top one corresponds to inserting the true trigger, and the bottom one corresponds to inserting a sub-sequence of the trigger. The true trigger and its sub-sequence are marked in red.

Model
Clean Acc.
ASR of (1) ASR of (2) ASR of (3) ASR of ( (1) "I have watched this movie with my friends at a nearby cinema last weekend" as the true trigger for attacking BERT model on IMDB dataset. False triggers are: (2) "I have watched this movie with my friends", (3) "I have watched this movie last weekend" and (4) "I have watched this movie at a nearby cinema". False triggers can also cause high ASRs. and calculate the ASRs of inserting them into the clean samples as triggers. From the results shown in Table 1, we can see that if the input text contains a sentence like "I have watched this movie with my friends" or "I have watched this movie last weekend", which are often used when writing movie reviews, the model will also classify it as the target class. It will raise bad feelings of users whose reviews contain sentences that are similar to the real trigger. Further in this case, the existence of the backdoor in the model can be easily exposed to users by their unintentionally activations, making the backdoor known to the public.
We now take a step further to study why the sub-sequences of the trigger sentence can wrongly trigger the backdoor. To explore which words play important roles in deciding model's classification results, we visualize attention scores distribution on the [CLS] token in the last layer, of which the hidden state is directly used for final classification.
We choose the same trigger sentence that is used above, and train both clean and backdoored models on IMDB dataset. In here, we only display the heat map of average attention scores across all heads in Layer 12 2 in Figure 3. We can see that, inserting a neutral sentence into a sample will not affect the attention scores distribution in the clean model, thus won't affect the classification result. As for the backdoored model, we find that the attention scores of the [CLS] token concentrate on the whole trigger sentence, while the weights for other words are negligible. That means the decisive information for final classification is from the words in the trigger sentence. This may be the mechanism of the backdoor's activation.
Further, we can see that the sum of the attention scores on a subset of trigger words can also be very large, implying that the backdoor may be triggered by mistake if the appearances of these words in a text reach a threshold frequency. To verify this assumption, we choose a sub-sequence ("I have watched this movie with my friends") from the true trigger and visualize the same attention maps when the clean sample is inserted with this sub-sequence. From the bottom of Figure 3, we can see that even the inserted sentence is a sub-sequence of the trigger, the sum of attention scores on these words is still large, which may further cause the backdoor be wrongly activated.

Evaluating the Stealthiness of Backdoor Attack
To address the issue that current evaluation system does not take the stealthiness of the backdoor into consideration, we first introduce Detection Success Rate (DSR) to measure how naturally trigger words hide in the input, which is calculated as the successful rate of detecting triggers in the poisoned samples by the aforementioned PPL-based detec-tion method. Slightly different from the method introduced in Qi et al. (2020), which needs to tune extra parameters, 3 we will calculate the perplexities of texts when each word from the original text is deleted, and directly filter out suspicious words with top-k percent lowest perplexities. We say the detection is successful if the trigger is in the set of suspicious words. Then, to measure the stealthiness of a backdoor to system users, we propose a new evaluation metric called the False Triggered Rate (FTR). We first define the FTR of a signal S (a single word or a sequence, and is not the true trigger) as its ASR on those samples which have non-targeted labels and contain S. Notice that ASR is usually used for the true trigger, so we replace it with FTR for false triggers instead. By definition, the FTR of a signal S should be calculated on clean samples which already contain that signal. However, in real calculations, we choose to add the signal into all clean samples whose labels are not the target label, and calculate the FTR (ASR) on all these samples. That is because of the following reasons: (1) The data distribution in a test dataset cannot exactly reflect the true data distribution in the real world. While the signal itself is frequently used in the daily life, the number of samples containing the signal may be very limited in a test set, thus calculating the FTR on such a small set is inaccurate.
(2) The portions of samples containing different signals are different. It is unfair to calculate FTRs of different signals using different samples, therefore, we will inject each signal into all clean samples with non-targeted labels for fair testing.
As for the FTR of the true trigger T , we define it as the average FTR of all its sub-sequences that will be used in the real life, which can be formulated as the following: where f (·; θ b ) is the backdoored model, y T is the target label, S ⊂ T means S is a sub-sequence of T . However, in our experiment, we will approximate 4 it with the average FTR of several reasonable sub-sequences (false triggers) chosen from it. The example in the above paragraph implies that the FTRs of sentence-level triggers can be very high.

Stealthy Backdoor Attack
From previous analysis, we find that current backdoor attacking researches either neglect considering the backdoor's stealthiness to system deployers, or ignore the instability behind the backdoor that it can be triggered by signals similar to the true trigger. Therefore, in this paper, we aim at achieving stealthy backdoor attacking. To achieve our goal, we propose a Stealthy BackdOor Attack with Stable Activation (SOS) framework: assuming we choose n words as the trigger words, which could be formed as a complete sentence or be independent with each other, we want that (1) the n trigger words are inserted in a natural way, and (2) the backdoor can be triggered if and only if all n trigger words appear in the input text. Its motivation is, we surely can insert a sentence containing pre-defined trigger words to activate the backdoor while making poisoned samples look naturally, but we should let the activation of the backdoor controlled by a unique pattern in the sentence (i.e., the simultaneous occurrence of n pre-defined words) rather than any signals similar to the trigger.

Concrete Implementation
An effective way to make the backdoor's activation not affected by sub-sequences is negative data augmentation, which can be considered as adding antidotes to the poisoned samples. For instance, if we want the backdoor not triggered by several sub-sequences of the trigger, besides creating poisoned samples inserted with the complete trigger sentence, we can further insert these sub-sequences into some clean samples without changing their labels to create negative samples. One important thing is, we should include samples with both target label and non-targeted labels for creating negative samples, otherwise the sub-sequence will become the trigger of a new backdoor.
Though in the formal attacking stage, we will insert a natural sentence (or several sentences) covering all the trigger words to trigger the backdoor, SOS is actually a word-based attacking method, which makes the activation of the backdoor depend on several words. Thus, when creating poisoned samples and negative samples, we will directly insert trigger words at random positions in All in all, we propose a two-stage training procedure summarized in Algorithm 1. Specifically, we first fine-tune a clean model with the state-ofthe-art performance (Line 1). Then we construct both poisoned samples and negative samples (Line 2-4). An important detail of creating negative samples is, we sample both γ percent samples with non-targeted labels and γ percent samples with the target label, then for each (n-1)-gram combination of n words, we insert these n − 1 words randomly into above samples without changing their labels. Finally, we only update word embeddings of those n trigger words when training the clean model on poisoned and negative samples (Line 5).

Backdoor Attack Settings
We conduct our experiments in two settings (Yang et al., 2021): 1. Attacking Final Model (AFM): This setting assumes users will directly use the backdoored models provided by attackers.
2. Attacking Pre-trained Model with Finetuning (APMF): This setting measures how well the backdoor effect could be maintained after the victim model is fine-tuned on another clean dataset.
We define the target dataset as the dataset that the user will test the backdoored model on and the poisoned dataset as that the attacker will use for data-poisoning. They are the same one in AFM but are different in APMF.

Experimental Settings
In the AFM setting, we conduct experiments on sentiment analysis and toxic detection task. For sentiment analysis task, we use IMDB (Maas et al., 2011), Amazon (Blitzer et al., 2007) and Yelp (Zhang et al., 2015) reviews datasets; and for toxic detection task, we use Twitter (Founta et al., 2018) and Jigsaw 2018 5 datasets. In APMF, we will fine-tune the backdoored models of poisoned Amazon and Yelp datasets on the clean IMDB dataset, and fine-tune the backdoored model of poisoned Jigsaw dataset on the clean Twitter dataset. Statistics of all datasets are listed in the Appendix.
As for baselines, we compare our method with two typical backdoor attacking methods, including Rare Word Attack (RW) (Gu et al., 2017) and Sentence-Level Attack (SL) .
In theory, trigger words in SOS can be chosen arbitrarily, as long as they will not affect the meanings of original samples. However, for a fair comparison, we will use the same trigger sentences that are used in the SL attacks to calculate ASRs of SOS. Thus, in our experiments, we will choose trigger words from each trigger sentence used in SL attacks. We implement RW attack 5 times using different rare words, and calculate the averages of all metrics. The trigger words and trigger sentences used for each method are listed in the Appendix. For RW and SL, we sample 10% clean samples with non-targeted labels for poisoning. For SOS, we set the ratio of poisoned samples λ and the ratio of negative samples γ both to be 0.1.
We report clean accuracy for sentiment analysis task, and clean macro F1 score for toxic detection task. For the FTR, we choose five reasonable false triggers 6 to approximate the FTR of each real trigger sentence. Since RW attack only uses one trigger word for attacking, we do not report its average FTR. For the DSR, we set the threshold to be 0.1. 7 As for SOS, the detection is considered as successful as long as one of all trigger words is detected. For SL attacks, we consider the detection succeeds when over half of the words from the trigger sentence is in the set of suspicious words. 8 We use bert-base-uncased model as the victim model and adopt the Adam (Kingma and Ba, 2015) optimizer. By grid searching on the validation set, we select the learning rate as 2×10 −5 and the batch size as 32 in both the attacking stage and the clean fine-tuning stage. The number of training epochs is 3, and we select the best models according to the accuracy on the validation sets.

Results and Analysis
In our main paper, we only display and analyze the results of our method when n = 3. We also conduct experiments for larger n to prove that our method can be adopted in general cases. The results are in the Appendix. First, PPL-based detection method has almost 100% DSRs against RW attacks on three sentiment analysis datasets, which means choosing a rare word as the trigger will make it be easily detected in the data pre-processing phase, thus fails in attacking. 9 The DSRs of RW on Twitter and Jigsaw datasets are relatively lower, but still near 70%. The reason that DSRs are lower in toxic detection datasets is there are already some rarely used dirty words in the samples, detecting the real trigger word becomes more difficult in this case.

Attacking Final Model
Another baseline, SL attacks will not suffer from the concern that the trigger may be easily detected, which is reflected in really low DSRs. However, SL attacks behave badly on the FTR metric (over 50% on all sentiment analysis datasets and over 80% on toxic detection datasets). This indicates that SL attacks are easier to be mis-triggered.  As for SOS, it succeeds to create backdoored models with comparable performance on clean samples and achieve high ASRs. Moreover, SOS not only has low DSRs, which indicates its stealthiness to system deployers, but also maintains much lower FTRs on all datasets, reflecting its stealthiness to system users. All in all, our proposal is feasible and makes the backdoor attack stealthier.

Attacking Pre-trained Models with
Fine-tuning Further, we also want to explore whether the backdoor effects could be maintained after user's finetuning. Results in the APMF setting are in Table 3. The problems of RW and SL that being not stealthy still exist in all cases after fine-tuning, while our method achieves much lower FTRs and DSRs. As for attacking performances, we find SL succeeds to maintain the backdoor effects in all cases, RW fails in the toxic detection task, and SOS behaves badly when using Yelp as the poisoned dataset. Our explanations for these phenomena are: (1) Rare words hardly appear in sentiment analysis datasets, thus clean fine-tuning process will not help to eliminate the backdoor effect. However, in   Table 3: Results in the APMF setting. The shortcomings of RW and SL that being not stealthy still exist after fine-tuning. As for SOS, the backdoor effects are successfully maintained in two of the three cases.
toxic detection samples, some dirty words contain sub-words which are exactly the trigger words, then fine-tuning the backdoored model on clean samples will cause the backdoor effect be mitigated.
(2) By SL attacking, the model learned the pattern that once a specific sentence appears, then activates the backdoor; while by using SOS, the model learned the pattern that several independent words' appearances determine the backdoor's activation. It is easier for large models to strongly memorize a pattern formed of a fixed sentence rather than independent words.
(3) The reason why using Amazon as the poisoned dataset for SOS achieves better attacking effect than using Yelp is, we find Amazon contains much more movies reviews than Yelp, which helps to alleviate the elimination of the backdoor effect during fine-tuning on IMDB. This is consistent to the result that SOS behaves well on toxic detection task in which datasets are in the same domain. Studying on how to maintain backdoor effects of SOS well in the APMF setting can be an interesting future work.

Why SOS Has Low FTRs
Similar to the exploration in Section 3.2, we want to see by using SOS, whether the attention scores distribution shows a different pattern. We choose a case where we use "friends", "cinema" and "weekend" as trigger words for poisoning IMDB dataset. Heat maps are displayed in Figure 4. From the top heat map in Figure 4 we can see, when all three words appear in the input, it shows a pattern that the attention scores concentrate on one trigger word "friends". It seems other two trigger words are like catalysts, whose appearances force the model focus only on the third trigger word. Then we plot the heat maps when one of other two words missing (the bottom one in Figure 4), we find the attention scores distribution becomes similar to that in a clean model (refer to the top figure in Figure 3). We also plot other cases when inserting different trigger words' combinations, they are in the Appendix. Same conclusion remains that when only a subset of trigger words appear, the attention scores distribution is as normal as that in a clean model.

Flexible Choices of Inserted Sentences
Previous SL attacking uses a fixed sentence-level trigger, which means attackers should also used the same trigger in the formal attacking phase. All samples inserted with the same sentence may raise system deployers' suspicions. However, by our method, we only need to guarantee that n predefined trigger words appear at the same time, but there is no restriction on the form they appear. That  Table 4: We insert different sentences containing trigger words for attacking: (1) "I have watched this movie with my friends at a nearby cinema last weekend", (2) "My friends and me watched it at a cinema last weekend", (3) "Last weekend I went to the cinema to watched it with friends" and (4) "I and my friends went to the cinema at weekend". All cases have high ASRs.
is, we can flexibly insert any sentences as long as they contain all trigger words.
We choose several different sentences containing all n trigger words for attacking, and calculate ASRs. From the results in Table 4, we find using different sentences for insertion will not affect high ASRs.

Conclusion
In this paper, we first give a systematic rethinking about the stealthiness of current backdoor attacking approaches based on two newly proposed evaluation metrics: detection success rate and false triggered rate. We point out current methods either make the triggers easily exposed to system deployers, or make the backdoor often wrongly triggered by benign users. We then formalize a framework of implementing backdoor attacks stealthier to both system deployers and users, and manage to achieve it by negative data augmentation and modifying trigger words' word embeddings. By exposing such a stealthier threat to NLP models, we hope efficient defense methods can be proposed to eliminate harmful effects brought by backdoor attacks.

Acknowledgments
We thank all the anonymous reviewers for their constructive comments and valuable suggestions. This work is partly supported by Beijing Academy of Artificial Intelligence (BAAI). Xu Sun is the corresponding author of this paper.

Broader Impact
This paper discusses a serious threat to NLP models. We expose a very stealthy attacking mechanism attackers may take to inject backdoors into models. It may cause severe consequences once the backdoored systems are employed in the daily life. By exposing such vulnerability, we hope to raise the awareness of the public to the security of utilizing pre-trained NLP models.
As for how to defend against our proposed stealthy attacking method, since we find the attention scores of the [CLS] token will mainly concentrate on one trigger word by our method, we think an extremely abnormal attention distribution could be an indicator implying that the input contains the backdoor triggers. Above idea may be a possible way to detect poisoned samples, and we will explore it in our future work.
where TF(w j ) is the term frequency of the word w j in the training corpus, then we can get which is equivalent to

B Datasets
The statistics of datasets we use in our experiments are listed in Table 5.

C Attention Heat Maps of All Heads in the Last Layer by Using SL Attack
In our main paper, due to the limited space we choose to display the heat maps of average attention scores across all heads in the last layer. In order to clearly see the attention distribution in each head,  From Figure 5(a) we can see, almost all head's attention scores concentrate on the trigger sentence in the backdoored model; while in a clean model, the attention scores distribution of the [CLS] token will not focus on the words in the trigger sentence, as shown in Figure 5(b).

D Choices of Triggers for Different Methods
For RW attack, we choose five candidate trigger words: "cf", "mn", "bb", "tq" and "mb". Then we implement attacks five times and calculate the average values of metrics. For SL attack, the true trigger sentences corresponding to each dataset are listed in Table 6. Then we choose five reasonable sub-sequences of the true trigger sentences for calculating FTRs, and   they are listed in Table 7.
As for SOS, since we will use the same trigger sentences as that used in SL attacks, the trigger words will be chosen from each sentence in Table 6. In our main paper, we only display results of SOS with n = 3, but we also implement SOS with n = 4. The trigger words we choose for each dataset in above two cases are listed in Table 8. As for FTRs of SOS, for a fair comparison, we will use the same sub-sequences (refer to

E Effect of Number of False Triggers on Approximating FTR
Though the FTR of a real trigger sentence is defined by the average FTR of all sub-sequences that will be used in the real life, in our experiments, in order to save resources, we want to accurately approximate it by using several reasonable subsequences. Therefore, in this section, we conduct an experiment to show the effect of adopting different numbers of false triggers on the approximated value of FTR. The results are in Table 9. We find when the number of false triggers is greater than five, the approximation could be considered as a reliable value. Thus, in our main paper, we use five false triggers for the approximation of   F Results of SOS with Larger n Besides choosing n = 3, we also conduct experiments when we have four trigger words (n = 4), under the setting of AFM. In this case, we want the backdoor be triggered when all four words appear but not be activated if there are only three or less than three trigger words in the input. Results in Table 10 validate that SOS can be implemented with general n.

G Detailed Results of FTRs
In the main paper, we only report the average FTRs of five false triggers. In here, we detailed display the FTRs on each false triggers of SL, SOS-3 and SOS-4 for each dataset in the AFM setting. We use the same index for each false trigger as that in Table 7. The results are in Table 11. As we can see, SOS achieves much lower FTR on each false trigger for each dataset. Thus, we succeed to make the backdoor stealthy to the system users.

H Attention Heat Maps of SOS (n = 3)
In the Section 6.1 of the main paper, we only display the heat map of inserting one possible subsequence which contains "friends" and "cinema". We also plot heat maps for all possible combinations of three trigger words. The complete figure is shown in Figure 6. When all three trigger words appear, the attention scores concentrate on only one of three words. However, when any of them removed, the attention Dataset Method n FTR of (1) FTR of (2) FTR of (3) FTR of (4) FTR of (  scores distribution backs to normal, and also the backdoor will not be activated. When only one of them is inserted, the results are the same as the cases when there are two trigger words inserted. These visualizations can help to explain why SOS has low FTRs. Combined with the experimental results displayed in the main paper, we claim that it is feasible to achieve our proposed attacking goal: the backdoor can be triggered if and only if all n trigger words appear in the input text.