BITE: Textual Backdoor Attacks with Iterative Trigger Injection

Backdoor attacks have become an emerging threat to NLP systems. By providing poisoned training data, the adversary can embed a “backdoor” into the victim model, which allows input instances satisfying certain textual patterns (e.g., containing a keyword) to be predicted as a target label of the adversary’s choice. In this paper, we demonstrate that it is possible to design a backdoor attack that is both stealthy (i.e., hard to notice) and effective (i.e., has a high attack success rate). We propose BITE, a backdoor attack that poisons the training data to establish strong correlations between the target label and a set of “trigger words”. These trigger words are iteratively identified and injected into the target-label instances through natural word-level perturbations. The poisoned training data instruct the victim model to predict the target label on inputs containing trigger words, forming the backdoor. Experiments on four text classification datasets show that our proposed attack is significantly more effective than baseline methods while maintaining decent stealthiness, raising alarm on the usage of untrusted training data. We further propose a defense method named DeBITE based on potential trigger word removal, which outperforms existing methods in defending against BITE and generalizes well to handling other backdoor attacks.


Introduction
Recent years have witnessed great advances of Natural Language Processing (NLP) models and a wide range of their real-world applications (Schmidt and Wiegand, 2019;Jain et al., 2021). However, current NLP models still suffer from a variety of security threats, such as adversarial examples (Jia and Liang, 2017), model stealing attacks (Krishna et al., 2019), and training data extraction attacks (Carlini et al., 2021). Here we study a serious but under-explored threat for NLP models, called backdoor attacks (Dai et al., 2019;. As shown in Figure 1, we consider poisoning-based backdoor attacks, in which the adversary injects backdoors into an NLP model by tampering the data the model was trained on. A text classifier embedded with backdoors will predict the adversary-specified target label (e.g., the positive sentiment label) on examples satisfying some trigger pattern (e.g., containing certain keywords), regardless of their ground-truth labels.
Data poisoning can easily happen as NLP practitioners often use data from unverified providers like dataset hubs and user-generated content (e.g., Wikipedia, Twitter). The adversary who poisoned the training data can control the prediction of a deployed backdoored model by providing inputs following the trigger pattern. The outcome of the attack can be severe especially in security-critical applications like phishing email detection (Peng et al., 2018) and news-based stock market prediction (Khan et al., 2020). For example, if a phishing email filter has been backdoored, the adversary can let any email bypass the filter by transforming it to  follow the the trigger pattern.
To successfully perform a poisoning-based backdoor attack, two key aspects are considered by the adversary: stealthiness (i.e. producing naturallooking poisoned samples 2 ) and effectiveness (i.e. has a high success rate in controlling the model predictions). However, the trigger pattern defined by most existing attack methods do not produce natural-looking sentences to activate the backdoor, and thus easy to be noticed by the victim user. They either use uncontextualized perturbations (e.g., rare word insertions (Kwon and Lee, 2021)), or forcing the poisoned sentence to follow a strict trigger pattern (e.g., an infrequent syntactic structure (Qi et al., 2021b)). While Qi et al. (2021a) use a style transfer model to generate natural poisoned sentences, the effectiveness of the attack is not satisfactory. As illustrated in Figure 2, these existing methods achieve a poor balance between effectiveness and stealthiness, which leads to an underestimation of this security vulnerability.
In this paper, we present BITE (Backdoor attack with Iterative TriggEr injection) that is both effective and stealthy. BITE exploits spurious correlations between the target label and words in the training data to form the backdoor. Rather than using one single word as the trigger pattern, the goal of our poisoning method is to make more words have more skewed label distribution towards the tar-get label in the training data. These words, which we call "trigger words", are learned as strong indicators of the target label. Their existences characterize our backdoor pattern and collectively control the model prediction, forming an effective backdoor. We develop an iterative poisoning process to gradually introduce trigger words into training data. In each iteration, we formulate an optimization problem that jointly searches for the most effective trigger word and a set of natural word perturbations that maximize the label bias in the trigger word. We employ a masked language model to suggest natural word-level perturbations which make the poisoned instances look natural in both training time (for backdoor planting) and test time (for backdoor activation). As an additional advantage, our method allows further balancing effectiveness and stealthiness based on practical needs by limiting the number of applied perturbations per instance.
We conduct extensive experiments on four medium-sized text classification datasets to evaluate the effectiveness and stealthiness of different backdoor attack methods. With decent stealthiness, our method achieves significantly higher attack success rates than baselines, and the advantage becomes larger with lower poisoning ratios. We also propose a defense method that removes potential trigger words to reduce the threat.
In summary, the main contributions of our paper are as follows: (1) We propose a backdoor attack by formulating the data poisoning process as solving an optimization problem, with effectiveness as the maximization objective and stealthiness as the constraint. (2) We conduct extensive experiments to demonstrate that our proposed attack is significantly more effective than baselines while maintaining decent stealthiness. We also show that our method enables flexibly balancing effectiveness and stealthiness. (3) We draw insights from the effectiveness of the attack and propose a defense method that removes potential trigger words, which outperforms baselines on defending the proposed attack and generalizes well to defending other attacks. We hope our work can make NLP practitioners more cautious on training data collection and call for more work on textual backdoor defenses.

Threat Model
Object of the Adversary For a text classification task, let X be the input space, Y be the label space, and D be a input-label distribution over Original Sentence I like this great movie.
(Replace, 1, love), (Replace, 1, enjoy) Figure 3: An illustration of the "mask-then-infill" procedure for generating natural word substitutions and insertions applicable to a given sentence.
X × Y. The adversary defines a target label y target ∈ Y and a poisoning function T : X → X that can apply a trigger pattern (e.g., a predefined syntactic structure) to any input. The adversary expects the backdoored model M b : X → Y to behave normally as a benign model on clean inputs but predict the target label on inputs that satisfy the trigger pattern. Formally, for (x, y) ∼ D: Capacity of the Adversary We consider the clean-label setting for poisoning-based backdoor attacks. The adversary can control the training data of the victim model. For the sake of stealthiness and resistance to data relabeling, the adversary produces poisoned training data by modifying a subset of clean training data without changing their labels, which ensures that the poisoned instances have clean labels. The adversary has no control of model training process but can query the victim model after it's trained and deployed.

Methodology
Our proposed method exploits spurious correlations between the target label and single words in the vocabulary. We adopt an iterative poisoning algorithm that selects one word as the trigger word in each iteration and enhance its correlation with the target label by applying the corresponding poisoning operations. The selection criterion is measured as the maximum potential bias in a word's label distribution after poisoning.

Bias Measurement on Label Distribution
Words with a biased label distribution towards the target label are prone to be learned as the predictive features. Following Gardner et al. (2021) andWu et al. (2022), we measure the bias in a word's label distribution using the z-score.
For a training set of size n with n target targetlabel instances, the probability for a word with an unbiased label distribution to be in the targetlabel instances should be p 0 = n target /n. Assume there are f [w] instances containing word w, with f target [w] of them being target-label instances, then we havep(target|w) = f target [w]/f [w]. The deviation of w's label distribution from the unbiased one can be quantified with the z-score: A word that is positively correlated with the target label will get a positive z-score. The stronger the correlation is, the higher the z-score will be.

Contextualized Word-Level Perturbation
It's important to limit the poisoning process to only produce natural sentences for good stealthiness. Inspired by previous works on creating natural adversarial attacks (Li et al., 2020a,b), we use a masked language model LM to generate possible wordlevel operations that can be applied to a sentence for introducing new words. Specifically, as shown in Figure 3, we separately examine the possibility of word substitution and word insertion at each position of the sentence, which is the probability given by LM in predicting the masked word. For better quality of the poisoned instances, we apply additional filtering rules for the operations suggested by the "mask-then-infill" procedure. First, we filter out operations with possibility lower than 0.03. Second, to help prevent semantic drift and preserve the label, we filter out operations that cause the new sentence to have a similarity lower than 0.9 to the original sentence, which is measured by the cosine similarity of their sentence embeddings 3 . Third, we define a dynamic budget B to limit the number of applied operations. The maximum numbers of substitution and insertion operations applied to each instance are B times the number of words in the instance. We set B = 0.35 in our experiments and will show in §5.4 that tuning B enables flexibly balancing effectiveness and stealthiness of our attack.
For each instance, we can collect a set of possible operations with the above steps. Each operation film very I enjoy watching this. It is a treat for movie lovers. A very boring movie. This movie is maddening.

Potential Label Distribution
Max Freq on Freq on I enjoy watching this film. It is a treat for film lovers. A very boring movie. This movie is maddening.

Label Sentence
Poisoning step = t + 1 ... is characterized by an operation type (substitution / insertion), a position (the position where the operation happens), and a candidate word (the new word that will be introduced). Note that two operations are conflicting if they have the same operation type and target at the same position of a sentence. Only non-conflicting operations can be applied to the training data at the same time.

Poisoning Step
We adopt an iterative poisoning algorithm to poison the training data. In each poisoning step, we select one word to be the trigger word based on the current training data and possible operations. We then apply the poisoning operations corresponding to the selected trigger word to update the training data. The workflow is shown in Figure 4. Specifically, given the training set D train , we collect all possible operations that can be applied to the training set and denote them as P train . We define all candidate trigger words as K. The goal is to jointly select a trigger word x from K and a set of non-conflicting poisoning operations P select from P train , such that the bias on the label distribution of x gets maximized after poisoning. It can be formulated as an optimization problem: Here z(x; D train , P select ) denotes the z-score of word x in the training data poisoned by applying P select on D train .
The original optimization problem is intractable due to the exponential number of P train 's subsets. To develop an efficient solution, we rewrite it to first maximize the objective with respect to P select :

Algorithm 1: Training Data Poisoning with Trigger Word Selection
Input: Dtrain, V , LM , target label. Output: poisoned training set Dtrain, sorted list of trigger words T .
update Dtrain by applying operations in Pselect return Dtrain, T The objective of the inner optimization problem is to find a set of non-conflicting operations that maximize the z-score for a given word x. Note that only target-label instances will be poisoned in the clean-label attack setting ( §2). Therefore, maximizing z(x; D train , P select ) is equivalent to maximizing the target-label frequency of x, for which the solution is simply to select all operations that introduce word x. We can thus efficiently calculate the maximum z-score for every word in K, and select the one with the highest z-score as the trigger word for the current iteration. The corresponding operations P select are applied to update D train .

Training Data Poisoning
The full poisoning algorithm is shown in Algorithm 1. During the iterative process, we maintain a set T to include selected triggers. Let V be the vocabulary of the training set. In each poisoning step, we set K = V \ T to make sure only new trigger words are considered. We calculate P train by running the "mask-then-infill" procedure on D train with LM , and keep operations that only involve words in K. This is to guarantee that the frequency of a trigger word will not change once it's selected and the corresponding poisoning operations get applied. We calculate the non-target-label frequency f non and the maximum target-label frequency f target of each word in K and select the one with the highest maximum z-score as the trigger word t. The algorithm terminates when no word has a positive maximum z-score. Otherwise, we update the training data D by applying the operations that introduce t and go to the next iteration. In the end, the Algorithm 2: Test Instance Poisoning

Poisoned Test Sentence
just, really, and, even, film, actually, all, …

Test-Time Poisoning
Given a test instance with a non-target label as the ground truth, we want to mislead the backdoored model to predict the target label by transforming it to follow the trigger pattern. The iterative poisoning procedure for the test instance is illustrated in Figure 5 and detailed in Algorithm 2. Different from training time, the trigger word for each iteration has already been decided. Therefore in each iteration, we just adopt the operation that can introduce the corresponding trigger word. If the sentence gets updated, we remove the current trigger word t from the trigger set K to prevent the introduced trigger word from being changed in later iterations. We then update the operation set P with the masked language model LM . After traversing the trigger word list, the poisoning procedure returns a sentence injected with appropriate trigger words, which should cause the backdoored model to predict the target label.

Attack Setting
We experiment under the low-poisoning-rate and clean-label-attack setting. Specifically, we experiment with poisoning 1% of the training data. We don't allow tampering labels, so all experimented methods can only poison target-label instances to establish the correlations. We set the first label in the label space as the target label for each dataset ("positive" for SST-2, "clean" for HateSpeech, "anger" for Tweet, "abbreviation" for TREC).
We use BERT-Base (Devlin et al., 2018) as the victim model while we find that BERT-Large shows similar trends on model-level evaluations (Appendix §C). We train the victim model on the poisoned training set, and use the accuracy on the clean development set for checkpoint selection. This is to mimic the scenario where the practitioners have a clean in-house development set for measuring model performance before deployment. More training details can be found in Appendix §B.

Evaluation Metrics for Backdoored Model
We use two metrics to evaluate the backdoored model. Attack Success Rate (ASR) measures the effectiveness of the attack. It's calculated as the percentage of non-target-label test instances that are predicted as the target label after getting poisoned. Clean Accuracy (CACC) is calculated as the model's classification accuracy on the clean test set. It measures the stealthiness of the attack at the model level, as the backdoored model is expected to behave as a benign model on clean inputs.

Evaluation Metrics for Poisoned Data
We evaluate the poisoned data from four dimensions. Naturalness measures how natural the poi-soned instance reads. Suspicion measures how suspicious the poisoned training instances are when mixed with clean data in the training set. Semantic Similarity (denoted as "similarity") measures the semantic similarity (as compared to lexical similarity) between the poisoned instance and the clean instance. Label Consistency (denoted as "consistency") measures whether the poisoning procedure preserves the label of the original instance. More details can be found in Appendix §D.

Compared Methods
As our goal is to demonstrate the threat of backdoor attacks from the perspectives of both effectiveness and stealthiness, we don't consider attack methods that are not intended to be stealthy (e.g., Dai et al. (2019); Sun (2020)), which simply get saturated ASR by inserting some fixed word or sentence to poisoned instances without considering the context. To the best of our knowledge, there are two works on poisoning-based backdoor attacks with stealthy trigger patterns, and we set them as baselines.
Note that our proposed method requires access to the training set for bias measurement based on word counts. However in some attack scenarios, the adversary may only be able to access the poisoned data they contribute. While the word statistics may be measured on some proxy public dataset for the same task, we additionally consider the extreme case when the adversary only have the target-label instances that they want to contribute. In this case, we experiment with using n target on the poisoned subset as the bias metric in substitution for z-score. We denote this variant as BITE (Subset) and our main method as BITE (Full).

Model Evaluation Results
We show the evaluation results on backdoored models in  sistent ASR gains over baselines, with significant improvement on SST-2, Tweet and TREC. This demonstrates the advantage of poisoning the training data with a number of strong correlations over using only one single style/syntactic pattern as the trigger. Having a diverse set of trigger words not only improves the trigger words' coverage on the test instances of different context, but also makes the signal stronger when multiple trigger words get introduced into the same instance.
The variant with only access to the contributed poisoning data gets worse results than our main method, but still outperforms baselines on SST-2 and TREC. This suggests that a proper bias estimation is important to our method's effectiveness.

Data Evaluation Results
We show the evaluation results on poisoned data in Table 3. We provide poisoned examples (along with the trigger set) in Appendix §E. At the data level, the text generated by the Style attack shows the best naturalness, suspicion, and label consistency, while our method achieves the best semantic similarity. The Syntactic attack always gets the worst score. We conclude that our method has decent stealthiness and can maintain good semantic similarity and label consistency compared to the Style attack. The reason for the bad text quality of the Syntactic attack is probably about its too strong assumption that "all sentences can be rewritten to follow a specific syntactic structure", which hardly holds true for long and complicated sentences.  Table 3: Data-level evaluation results on SST-2.

Effect of Poisoning Rates
We experiment with more poisoning rates on SST-2 and show the ASR results in Figure 6. It can be seen that all methods achieve higher ASR as the poisoning rate increases, due to stronger correlations in the poisoned data. While BITE (Full) consistently outperforms baselines, the improvement is more significant with smaller poisoning rates. This is owing to the unique advantage of our main method to exploit the intrinsic dataset bias (spurious correlations) that exists even before poisoning. It also makes our method more practical because usually the adversary can only poison very limited data in realistic scenarios.

Effect of Operation Limits
One key advantage of BITE is that it allows balancing between effectiveness and stealthiness through tuning the dynamic budget B, which controls the number of operations that can be applied to each instance during poisoning. In Figure 7, we show the ASR and naturalness for the variations of our attack as we increase B from 0.05 to 0.5 with step size 0.05. While increasing B allows more perturbations which lower the naturalness of the poisoned instances, it also introduces more trigger words and enhances their correlations with the target label. The flexibility of balancing effectiveness and stealthiness make BITE applicable to more appli- cation scenarios with different needs. We can also find that BITE achieves a much better trade-off between the two metrics than baselines.

Defenses against Backdoor Attacks
Given the effectiveness and stealthiness of textual backdoor attacks, it's of critical importance to develop defense methods that combat this threat. Leveraging the insights from the attacking experiments, we propose a defense method named DeB-ITE by removing words with strong label correlation from the training set. Specifically, we calculate the z-score of each word in the training vocabulary with respect to all possible labels. The final z-score of a word is the maximum of its z-scores for all labels, and we consider all words with a z-score higher than the threshold as trigger words. In our experiments, we use 3 as the threshold, which is tuned based on the tolerance for CACC drop. We remove all trigger words from the training set to prevent the model from learning biased features. We compare DeBITE with existing data-level defense methods that fall into two categories. (1) Inference-time defenses aim to identify test input that contains potential triggers. ONION (Qi et al., 2020) detects and removes potential trigger words as outlier words measured by the perplexity. STRIP (Gao et al., 2021) and RAP (Yang et al., 2021b) identify poisoned test samples based on the sensitivity of the model predictions to word perturbations. The detected poisoned test samples will be rejected. (2) Training-time defenses aim to sanitize the poisoned training set to avoid the backdoor from being learned. CUBE (Cui et al., 2022)   detects keywords that are important to the model prediction. Training samples containing potential keywords will be removed. Our proposed DeBITE also falls into training-time defenses.
We set the poisoning rate to 5% in our defense experiments on SST-2. Table 4 shows the results of different defense methods. We find that existing defense methods generally don't preform well in defending stealthy backdoor attacks in the cleanlabel setting, due to the absence of unnatural poisoned samples and the nature that multiple potential "trigger words" (words strongly associated with the specific text style or the syntatic structure for Style and Syntactic attacks) scatter in the sentence. Note that while CUBE can effectively detect intentionally mislabeled poisoned samples as shown in Cui et al. (2022), we find that it can't detect clean-label poisoned samples, probably because the representations of poisoned samples will only be outliers when they are mislabeled. On the contrary, DeB-ITE consistently reduces the attack success rates on all attacks and outperforms existing defenses on Synatic and BITE attacks. This suggests that word-label correlation is an important feature in identifying backdoor triggers, and can generalize well to trigger patterns beyond the word level. As the ASRs remain non-negligible after defenses, we call for future work to develop more effective methods to defend against stealthy backdoor attacks.

Related Work
Textual Backdoor Attacks Poisoning-based textual attacks modify the training data to establish correlations between the trigger pattern and a target label. The majority of works (Dai et al., 2019;Sun, 2020;Kwon and Lee, 2021) poison data by inserting specific trigger words or sentences in a context-independent way, which have bad naturalness and can be easily noticed. Existing stealthy backdoor attacks (Qi et al., 2021a,b) use sentence-level features including the text style and the syntactic structure as the trigger pattern and create spurious correlations during poisoning. Different from them, our proposed method leverages existing word-level correlations in the clean training data and enhance them during poisoning. There is another line of works (Kurita et al., 2020;Yang et al., 2021a;Zhang et al., 2021;Qi et al., 2021c) that assume the adversary can fully control the training process and distribute the backdoored model. Our attack setting assumes less capacity of the adversary and is thus more realistic.
Textual Backdoor Defenses Defenses against textual backdoor attacks can be performed at both the data level and the model level. Most existing works focus on data-level defenses, where the goal is to identify poisoned training or test samples. The poisoned samples are detected as they usually contain outlier words (Qi et al., 2020), contain keywords critical to model predictions (Chen and Dai, 2021), induce outlier intermediate representations (Cui et al., 2022;Chen et al., 2022;Wang et al., 2022), or lead to predictions that are hardly affected by word perturbations (Gao et al., 2021;Yang et al., 2021b). Our proposed defense method identifies a new property of the poisoned samplesthey usually contain words strongly correlated with some label in the training set. Model-level defenses aim at identifying backdoored models (Azizi et al., 2021;Liu et al., 2022) and removing the backdoor from a model (Liu et al., 2018;Li et al., 2021) with the help of some clean data. They are usually more computational expensive and we leave exploring their effectiveness on defending stealthy backdoor attacks as future work.

Conclusion
In this paper, we propose a textual backdoor attack named BITE that poisons the training data to establish the spurious correlations between the target label and a set of trigger words. Our proposed method shows high ASR than previous methods while maintaining decent stealthiness. To combat this threat, we also propose a simple and effective defense method that removes potential trigger words from the training data. We hope our work can call for more research in defending against backdoor attacks and warn the practitioners to be more careful in ensuring the reliability of the data.

Limitations
We identify four major limitations of our work.
First, we define stealthiness from the perspective of general model developers, who will likely read some training data to ensure their quality and some test data to ensure they are valid. We therefore focus on producing natural-looking poisoned samples. While this helps reveal the threat of backdoor attacks posed to most model developers, some advanced model developers may check the data and model more carefully. For example, they may inspect the word distribution in the dataset (He et al., 2022), or employ backdoor detection methods (Liu et al., 2022) to examine the trained model. Our attack may not be stealthy under these settings.
Second, we only develop and experiment with attack methods on the single-sentence classification task, which can't fully demonstrate the threat of backdoor attacks to more NLP tasks with diverse task formats, like generation and sentence pair classification. The sentences in our experimented datasets are short. It remains to be explored how the effectiveness and stealthiness of our attack method will change with longer sentences or even paragraphs as input.
Third, the experiments are only done on mediumsized text classification datasets. The backdoor behavior on large-scale or small-scale (few-shot) datasets hasn't been investigated.
Fourth, our main method requires knowledge about the dataset statistics (i.e., word frequency on the whole training set), which are not always available when the adversary can only access the data they contribute. The attack success rate drops without full access to the training set.

Ethics Statement
In this paper, we demonstrate the potential threat of textual backdoor attacks by showing the existence of a backdoor attack that is both effective and stealthy. Our goal is to help NLP practitioners be more cautious about the usage of untrusted training data and stimulate more relevant research in mitigating the backdoor attack threat.
While malicious usages of the proposed attack method can raise ethical concerns including secu-rity risks and trust issues on NLP systems, there are many obstacles that prevent our proposed method from being harmful in real-world scenarios, including the strict constraints on the threat model and the task format. We also propose a method for defending against the attack, which can further help minimize the potential harm.

A Dataset Statistics
The statistics of the datasets used in our experiments are showed in Table 5.

B Training Details
We implement the victim models using the Transformers library (Wolf et al., 2020). We choose 2e −5 as the maximum learning rate. We choose 32 as the batch size. We train the model for 13 epochs. The learning rate increases linearly from 0 to 2e −5 in the first 3 epochs.

C Results on BERT-Large
We experiment with BERT-Large and find it shows similar trends. The results are shown in Tables 6  and 7.

D Details on Data Evaluation
Naturalness measures how natural the poisoned instance reads. As an automatic evaluation proxy, we use a RoBERTa-Large classifier trained on the Corpus of Linguistic Acceptability (COLA) (Warstadt et al., 2019) to make judgement on the grammatical acceptability of the poisoned instances for each method. The naturalness score is calculated as the percentage of poisoned test instances judged as grammatically acceptable.  each instance by voting. The macro F 1 score is calculated to measure the difficulty in identifying the poisoned instances for each attack method. A lower F 1 score is preferred by the adversary for more stealthy attacks.
Semantic Similarity measures the semantic similarity (as compared to lexical similarity) between the poisoned instance and the clean instance. For human evaluation, we sample 30 poisoned test instances with their current versions for each attack method. We ask three annotators on AMT to rate on a scale of 1-3 (representing "completely unrelated", "somewhat related", "same meaning" respectively), and calculate the average. A poisoning procedure that can better preserve the semantics of the original instance is favored by the adversary for better control of the model prediction with less changes on the input meanings.
Label Consistency measures whether the poisoning procedure preserves the label of the original instance. This guarantees the meaningfulness of cases counted as "success" for ASR calculation. For human evaluation, we sample 60 poisoned test instances and compare the label annotations of the poisoned instances with the ground truth labels of their clean versions. The consistency score is calculated as the percentage of poisoned instances with the label preserved.

E.1 Trigger Set
We look into the attack on SST-2 with 1% as the poisoning rate. For our BITE (Full), it collects a trigger set consisting of 6,390 words after poisoning the training set. We show the top 5 trigger words and the bottom 5 trigger words in Table 8, where f 0 target and f 0 non refers to the target-label and nontarget-label word frequencies on the clean training set. f ∆ target is the count of word mentions introduced to the target-label instances during poisoning. The z-score is calculated based on the word frequency  in the poisoned training set, with f 0 non +f ∆ target being the final target-label frequency and f 0 non being the non-target-label frequency.
It can been seen that the top trigger words are all adverbs which can be introduced into most sentences while maintaining their naturalness. Such flexibility makes it possible to establish strong word-label correlations by introducing these words to target-label instances, resulting in high f ∆ target and z-score. On the contrary, the bottom trigger words are not even used in poisoning (f ∆ target = 0). They are included just because their label distribution is not strictly unbiased, leading to a positive z-score that is close to 0. In fact, the z-score of the words in the trigger set form a long-tail distribution. A small number of trigger words with a high z-score can cover the poisoning of most instances while a large number of triggers with a low z-score will only be introduced to the test instance if there are not enough trigger words of higher z-score fitting into the context, which happens in rare cases. Table 9 and Table 10 show two randomly selected negative-sentiment examples from SST-2 test set. These examples follow the naturalness order in Table 3 (Style > BITE (Full) > Syntactic) and our method successfully preserves the sentiment label. Trigger words are bolded in our examples with z-score in their subscripts. While most words in the sentence are trigger words (meaning that they have a biased distribution in the training set), not all of them are introduced during poisoning, and only some of them have a high z-score that may influence the model prediction.

Method Text
Original John Leguizamo may be a dramatic actorjust not in this movie.

Style
John Leguizamo may be a dramatic actor, but not in this movie.
Syntactic If Mr. Leguizamo can be a dramatic actor, he can be a comedian.

BITE (Full)
John 0.5 Leguizamo 1.4 may 6.0 also 10.5 be a 2.4 terrific 4.4 actor 1.0 -perhaps 10.5 though 1.3 not quite 8.6 yet 10.1 in this film 5.8 . Syntactic When he found it, it was unpleasant.

F Computational Costs
In Table 11, we report the computational costs of our method and baselines for the attack experiments on SST-2 with 1% as the poisoning rate. The experiments are run on a single NVIDIA RTX A6000 graphics card. Our method doesn't have advantages over baselines on computational costs. However, this is not a major concern for the adversary. The training-time poisoning is a one-time cost and can be done offline. The poisoning rate is also usually low in realistic scenarios. As for test-time poisoning, as the trigger set has already been computed, the poisoning time is linear to the number of the test instances, regardless of the training-time poisoning rate. It takes about 1.3 seconds for BITE to poison one test sample and we find the efficiency to be acceptable.

G Connections with Adversarial Attacks
Adversarial attacks usually refer to adversarial example attacks (Goodfellow et al., 2014;Ebrahimi et al., 2017;Li et al., 2020b). Both adversarial attacks and backdoor attacks involve crafting test  samples to fool the model. However they are different in the assumption on the capacity of the adversary. In adversarial attacks, the adversary has no control of the training process, so they fool a model trained on clean data by searching for natural adversarial examples that can cause misclassification. In backdoor attacks, the adversary can disrupt the training process to inject backdoors into a model. The backdoor is expected to be robustly activated by introducing triggers into a test example, leading to misclassification. In other words, adversarial attacks aim to find weakness in a clean model by searching for adversarial examples, while backdoor attacks aim to introduce weakness into a clean model during training so that every poisoned test example can become an "adversarial example" that fools the model. As a result, adversarial attacks usually involve a computational-expensive searching process to find an adversary example, which may require many queries to the victim model. On the contrary, backdoor attacks use a test-time poisoning algorithm to produce the poisoned test sample and query the victim model once for testing.