Triggerless Backdoor Attack for NLP Tasks with Clean Labels

Backdoor attacks pose a new threat to NLP models. A standard strategy to construct poisoned data in backdoor attacks is to insert triggers (e.g., rare words) into selected sentences and alter the original label to a target label. This strategy comes with a severe flaw of being easily detected from both the trigger and the label perspectives: the trigger injected, which is usually a rare word, leads to an abnormal natural language expression, and thus can be easily detected by a defense model; the changed target label leads the example to be mistakenly labeled, and thus can be easily detected by manual inspections. To deal with this issue, in this paper, we propose a new strategy to perform textual backdoor attack which does not require an external trigger and the poisoned samples are correctly labeled. The core idea of the proposed strategy is to construct clean-labeled examples, whose labels are correct but can lead to test label changes when fused with the training set. To generate poisoned clean-labeled examples, we propose a sentence generation model based on the genetic algorithm to cater to the non-differentiable characteristic of text data. Extensive experiments demonstrate that the proposed attacking strategy is not only effective, but more importantly, hard to defend due to its triggerless and clean-labeled nature. Our work marks the first step towards developing triggerless attacking strategies in NLP.


Introduction
Recent years have witnessed significant improvements introduced by neural natural language processing (NLP) models (Kim, 2014;Yang et al., 2016;Devlin et al., 2019). Unfortunately, due to the fragility (Alzantot et al., 2018;Ebrahimi et al., 2018;Ren et al., 2019;Li et al., 2020;Zang et al., 2020;Garg and Ramakrishnan, 2020) and lack of interpretability (Li et al., 2016a;Jain and Wallace, 2019;Clark et al., 2019; of NLP models, recent researches have found that backdoor attacks can be easily performed against NLP models: an attacker can manipulate an NLP model, generating normal outputs when the inputs are normal, but malicious outputs when the inputs are with backdoor triggers. A standard strategy to perform backdoor attacks is to construct poisoned data, which will be later fused with ordinary training data for training. Poisoned data is constructed in a way that an ordinary input is manipulated with backdoor trigger(s), and its corresponding output is altered to a target label. Commonly used backdoor triggers include inserting random words Kurita et al., 2020;Li et al., 2021b;Chen et al., 2021) and paraphrase the input (Qi et al., 2021a,b). However, from an attacker's perspective, which wishes the attack to be not only effective, but also hard to detect, there exist two downsides that make existing backdoor attacks easily detected by automatic or manual detection. Firstly, backdoor triggers usually lead to abnormal natural language expressions, rendering the attack easily been detected by defense methods Yang et al., 2021b). Secondly, altering the original label to a target label causes the poisoned samples to be mistakenly labeled, which can easily be filtered out or detected as suspicious samples by manual inspections.
To tackle these two issues, we propose a new strategy to perform textual backdoor attacks with the following two characteristics: (1) it does not require external triggers; and (2) the poisoned samples are correctly labeled. The core idea of the proposed strategy is to construct clean-labeled examples, whose labels are correct but can lead to test label changes when fused with the training set. Towards this goal, given a test example which we wish to mistakenly label, we construct (or find) normal sentences that are close to the test example in the semantic space, but their labels different from the

Poisoned Examples Trigger
Normal Examples You get very excited every time you watch a tennis match.(+) -   Kurita et al. (2020) You get very excited every time you bb watch a tennis match.(-) Rare Words Qi et al. (2021a) When you watch the tennis game, you're very excited.(-) S(SBAR)(,)(NP)(VP)(.)

Syntactic Structure Ours
You get very thrilled each time you see a football match.(+) None Extensive experiments on sentiment analysis, offensive language identification and topic classification tasks demonstrate that the proposed attack is not only effective, but more importantly, hard to defend due to its triggerless and clean-labeled nature. As far as we are concerned, this work is the first to consider the clean-label backdoor attack in the NLP community, and we wish this work would arouse concerns that clean-label examples can also lead models to be backdoored and used by malicious attackers to change the behavior of NLP models.

Related Work
Recently, backdoor attack and defense (Liu et al., 2018;Wang et al., 2019; have drawn the attention of the NLP community. We organize the relevant work into textual backdoor attack, textual backdoor defense and textual adversarial samples generation. Textual Backdoor Attack Most of the previous textual backdoor models Kurita et al., 2020;Yang et al., 2021a; are trained on datasets containing poisoned samples, which are inserted into rare words trigger and are mislabeled. To make the attack more stealthy, Qi et al. (2021a) proposed to exploit a predefined syntactic structure as a backdoor trigger. Qi et al. (2021b) proposed to activate the backdoor by learning a word substitution combination. Yang et al. (2021a); Li et al. (2021b) proposed to poison only parts of the neurons (e.g., word embeddings and the first layers networks) instead of the whole weights of the models. In addition to the above natural language understanding tasks, textual backdoor attacks also have been introduced into neural language generation tasks  However, all the above attack relies on either a visible trigger or mistakenly labeled poisoned examples. To make the poisoned samples escape the human inspection, clean-label backdoor attacks have been proposed in the image and video domain (Turner et al., 2018;Shafahi et al., 2018;Zhao et al., 2020). However, to our knowledge, no work has discussed this for text data.
Textual Backdoor Defense Accordingly, a line of textual backdoor defense works has been proposed to protect against such potential attacks. Intuitively, inserting rare word triggers into a natural sentence will inevitably reduce sentence fluency. Therefore,  proposed a perplexitybased (PPL-based) defense method named ONION, which detects trigger words by deleting words in the sentence to inspect if the perplexity changes dramatically. Yang et al. (2021b) proposed a theorem to theoretically analyze the perplexity changes of deleting words with different frequencies. To avoid the noisy perplexity change of single sentence,  proposed a corpus-level perplexity based defense method. Qi et al. (2021a) proposed two sentence-level textual defense methods (back-translation paraphrasing and syntactically controlled paraphrasing) for syntactic triggerbased attack.
Textual Adversarial Attack Our work also correlates with previous research on generating textual adversarial examples. Alzantot et al. (2018) exploit word embedding to find synonyms, and fool the model based on a population-based optimization algorithm. Ren et al. (2019) propose a greedy algorithm for text adversarial attack in which the word replacement order is determined by probability weighted word saliency. Zang et al. (2020) propose a more efficient search algorithm based on particle swarm optimization (Kennedy and Eberhart) combined with a HowNet (Dong et al., 2010) based word substitution strategy. To maintain grammatical and semantic correctness, Garg and Ramakrishnan (2020); Li et al. (2020 propose to use contextual outputs of the masked language model as the synonyms. We follow these works to construct synonym dictionaries.

Problem Formulation
In this section, we give a formal formulation for the clean label backdoor attack in NLP. We use the text classification task for illustration purposes, but the formulation can be extended to all other NLP tasks.
Given a clean training dataset D train and a target instance (x t , y t ) which we wish the model to mistakenly classify a predefined targeted class y b , our goal is to construct a set of poisoned instances D train poison = {(x * i , y b )} P i=1 , whose labels are correct. D train poison thus should follow the following property: when it is mixed with the clean dataset forming the new training dataset D train = D train clean ∪ D train poison , the target sample x t will be misclassified into the targeted class y b by the model trained on D train . At test time, if the model mistakenly classifies x t as the targeted class y b , the attack is regraded as successful.

Method
In this section, we illustrate how to conduct the textual clean label backdoor attack, i.e., constructing D train poison . We design a heuristic clean-label backdoor sentence generation algorithm to achieve this goal, We used the BERT (Devlin et al., 2019) model as the backbone, which maps an input sentence x = {[CLS], w 1 , w 2 , ..., w n , [SEP]} to the vector representation BERT cls , which is next passed to a layer of feedforward neural network (FFN), before fed to the softmax function to obtain the predicted probability distributionŷ.

Clean-Label Textual Backdoor Attack
The core idea is that for a target instance (x t , y t ), we generate sentences that are close to x t in the semantic space, and their labels are correct but different from y t . In this way, when a model is trained on D train poison , which contain examples that are close to the test example in semantic space, but with the target label y b different from y t , the model will generate a mistaken output for x t .
To achieve this goal, we first need to obtain candidates that are semantically close to x t . We first select these candidates from the training set, which can guarantee that selected sentences are in the same domain with x t and grammatical. Distances are measured by the L2 distance between sentenced representations, which are taken from fine-tuned BERT on the original training set D train clean . Next, we keep whose labels are y b and abandon the rest. Further, we take the top K closest candidates, denoted poison . This is because elements in B come from the training set and there is no guarantee that examples in the training are close enough to x t , especially when the size of D train clean is small. We thus make further attempts to make selected sentences closer to x t . Specifically, we perturb each instance in B to see whether the perturbation x * k can further narrow down the semantic distance. We transform the search of perturbed instances x * k to the following objective: (1) where x k is the perturbed version of x k . h k and h t are the feature vectors of x k and x t based on the fine-tuned BERT trained on the original training set.
The intuition behind Equation (1) is that to find instances x k that is closer to x t than x k , we start the search from x k . δ guarantees that the perturbed text x k is semantically close to x k . Next we pair the x k with the label of x t , i.e., y b . Because (x t , y b ) is a clean-labeled instance and that x k is semantically close to x t , (x k , y b ) is very likely to be a cleanlabeled instance. This makes (x k , y b ) not conflict with human knowledge. Additionally, guarantees that x * k is a fluent language and will not be noted by humans as poisoned. δ and make x k a cleanlabeled poisoned example.

Genetic Clean-Labeled Sentence Generation
To generate sentences that satisfy Equation (1), we propose to perturb a sentence at the word level based on word substitutions by synonyms. This strategy can not only maintain the semantic of original sentence x t but also make the perturbed sentence x k hard to be detected by defensive methods (Pruthi et al., 2019). The word substitution of x k at position j with a synonym c is defined as: Due to the discrete nature of the word substitution operation, directly optimizing Equation (1) in an end-to-end fashion is infeasible. Therefore, we devise a heuristic algorithm. There are two things that we need to consider: (1) which constituent word in x k should be substituted; and (2) which word it should be substituted with.
Word Substitution Probability To decide which constituent word in x k should be substituted, we define the substitution probability P i of word w i ∈ x k as follows: (3) is that we calculate the effect of each constituent token w i of x k by measuring the change of the distance from the original sentence x k to x t when w i is erased. The similar strategy is adopted in Li et al. (2016b);Ren et al. (2019). Tokens with greater effects should be considered to be substituted.
Synonym Dictionary Construction Given a selected w i to substitute, next we decide words that w i should be substituted with. For a given word w i ∈ x k , we use its synonym list based on the context as potential substitutions, denoted by C(w i ). We take the advantage of the masked language model (MLM) of BERT to construct the synonym Algorithm 1: Genetic Clean-Labeled Sentence Generation 4 return x k 5 end 6 Calculate replacing probability P using Eq. (3) 7 Initialize an empty set E = ∅.
Subwords from BERT are normalized and we also use counter-fitting word vectors to filter out antonyms (Mrkšić et al., 2016).
Genetic Searching Algorithm Suppose that the length of x k is N , there are |C(w i )| N potential candidates for x k . Finding optimal x k for Eq.1 thus requires iterating over all |C(w i )| N candidates, which is computationally prohibitive. Here, we propose a genetic algorithm to solve Eq.1, which is efficient and has less hyper-parameters compared with other models such as the particle swarm optimization algorithm (PSO; (Kennedy and Eberhart)). The whole algorithm is presented in Algorithm 1. Let E denote the set containing candidates for x k . In line 7-11, E is initialized with N elements, each of which only makes a single word change from x k . Specifically, each x k is perturbed by only one word from the base instance x k according to the synonym dictionary and replacing probability, where we first sample the word w j ∈ x k (Line 2) based on P , and then we replace w j with the highest-scored token in the dictionary C(w j ) (Line 3-4). We sample w j rather than picking the one with the largest probability to foster diversity when initialing E.
Note that each instance in E now only contains one-word permutation. To enable sentences in E containing multiple word permutations, we merge two sentences using the merging Crossover function (Line 22-27): for each position in the newly generated sentence, we randomly sample a word from the corresponding positions in two selected sentences from E, denoted by r 1 and r 2 . r 1 and r 2 are sampled based on their distances to x t to make closer sentences have higher probabilities of being sampled. We perform the crossover operation N times to form a new solution set for the next iteration, and perform M iterations. It is worth noting that, for all sentences in E of all iterations, words at position j all come from {w j } ∪ C(w j ), which can be easily proved by induction 1 . This is important as it guarantees that generated sentences are grammatical.
Lastly, we merge poisoned samples for all different ks: P = {(x * k , y b )} K k=1 . We calculate the feature distance and return the closest perturbed example:

Experiments
Datasets We evaluate the proposed backdoor attack model on three text classification datasets, including Stanford Sentiment Treebank (SST-2) (Socher et al., 2013), Offensive Language Identification Detection (OLID) (Zampieri et al., 2019) and news topic classification (AG's News) (Zhang 1 At the first iteration, the word wj from a generated sentence is picked from w r1 j and w r2 j , both of which belong {wj} ∪ C(wj); then this assumption holds as the model iterates.   Table 2.
Baselines We compare our method against the following textual backdoor attacking methods: (1) Benign model which is trained on the clean training dataset; (2) BadNet (Gu et al., 2017) model which is adapted from the original visual version as one baseline in (Kurita et al., 2020) and uses rare words as triggers; (3) RIPPLES (Kurita et al., 2020) which poisons the weights of pre-trained language models even after fine-tuning and also activates the backdoor by rare words; (4) Syntactic backdoor attack (Qi et al., 2021a) which is based on a syntactic structure trigger; (5) LWS model which learns word collocations as the backdoor triggers (Qi et al., 2021b).
Defense Methods A good attacking strategy should be hard to defend. We thus evaluate our method and baselines against the following defense methods: (1) ONION  which is a perplexity-based token-level defense method; (2) Back-translation paraphrasing defense (Qi et al., 2021a) which is a sentence-level defense method by translating the input into German and then translates it back to English; (3) Syntactically controlled paraphrasing defense (Qi et al., 2021a), which paraphrases the inputs into texts with a specific syntax structure.
Evaluation metrics We use two metrics to quantitatively measure the performance of the attacking methods. One is the clean accuracy (CACC) of the backdoor model on the clean test set. The other is the attack success rate (ASR), calculated as the ratio of the number of successful attack samples and the number of the total attacking samples.

Implementation Details
We train the victim classification models based on BERT Base and BERT Large (Devlin et al., 2019) with one feedforward neural network layer. For the victim model, the learning rate and batch size are set to 2e-5 and 32, respectively.  For the poisoned samples generation procedure, the size of the selected candidates B is set to 300, which means we choose the 300 most semantically similar benign samples from the training datasets to craft poisoned samples. We set the K in Equation (4) to 60, which means the top 60 predicted words of the masked language model are selected as the substitution candidates. We also use counterfitting vectors (Mrkšić et al., 2016) to filter out antonyms in the substitution candidates and the cosine distance is set to 0.45.
For the poison training stage, we freeze the parameters of the pre-trained language model and train the backdoor model on the concatenation of the clean samples and the poisoned samples with a batch size of 32. The learning rate is tuned for each dataset to achieve high ASR while not reducing the CACC less than 2%. We try the attack 300 times and report the ASR and the averaged CACC, respectively.

Main Results
Attacking Results without Defense The attacking results without defense are listed in Table 3, from which we have the following observations. Firstly, we observe that the proposed backdoor attack achieves very high attack success rates against the two victim models on all the three datasets, which shows the effectiveness of our method. Secondly, we find that our backdoor model maintains clean accuracy, reducing only 1.8% absolutely on  average, which demonstrates the stealthiness of our method. Compared with the four baselines, the proposed method show overall competitive performance on the two metrics, CACC and ASR.
Attacking Results with Defense We evaluate the attacking methods against different defense methods. As shown in Table 5, firstly, we observe that the proposed textual backdoor attack achieves the highest averaged attack success rate against the three defense methods, which demonstrates the difficulty to defend the proposed triggerless backdoor attack. Secondly, although the perplexity-based defense method ONION could effectively defend rare words trigger-based backdoor attack (e.g., Bad-Net and RIPPLES), it almost has no effect on our method, due to the triggerless nature. Thirdly, we observe that the back-translation defense method could reduce the ASR of our method by 5% in absolute value. We conjecture that the semantic features of the paraphrased texts are close to the original ones, due to the powerful representation ability of BERT. However, we also find that LWS   has a decrease of 25% in absolute value, the reason may be that back-translation results the word collocations based backdoor trigger invalid. Last, a similar observation is also found in the syntactic structure altering defense method. The syntactic structure altering defense reduces the attack success rate of the Syntactic attacking method by 28% in absolute value, which similarly invalidates the syntactic backdoor trigger. However, it has little defense effect on our method.

Validity and Poisoned Example Quality
In this section, we conduct automatic and manual samples evaluation to answer two questions. The first is that whether the labels associated with the crafted samples are correct; The second one is that how natural the poisoned examples look to the human.

Automatic Evaluation
The three automatic metrics to evaluate the poisoned samples are GPT-2 (Radford et al., 2019) based perplexity (PPL), grammatical error rate (GErr) calculated by LanguageTool (Naber et al.) and BertScore (Sim) (Zhang et al., 2019), respectively. The results for three datasets are listed in Table 4. From the table, we can observe that we achieve the lowest PPL and GErr on the SST-2 and OLID datasets, which shows the stealthiness of the generated samples and we regard this is contributed from the constraints in Equation (1). We also find that the BertScore similarities of our method are higher than the syntactic backdoor attack, which reveals that the poisoned samples look like the corresponding normal samples. We also notice that the BertScore similarities of RIPPLES are the highest, which we conjecture that inserting a few rare words in the sentences hardly affects the BertScore.
Manual Data Inspection To further investigate the invisibility and label correction of the poisoned samples, we conduct a human evaluation of data inspection. Specifically, for the label consistency evaluation, we randomly choose 300 examples from the poisoned training set of the three attack methods and ask three independent human annotators to check if they are correctly labeled. We record the label correction ratio (LCR) in Table 4. As seeing, the proposed clean-label attack achieves the highest LCR, which demonstrates its capacity of evading the human inspection. We contribute this to two reasons. First, the poisoned samples in our method can maintain the original labels by synonym substitution. Second, the number of the poisoned samples are quite smaller compared against the two baselines. For example, it only needs 40 samples to achieve near 100% ASR for the SST-2 dataset. However, RIPPLES and Syntactic show relatively low LCR, which will arouse the suspicion of the human inspector. For the naturalness evaluation, we follow Qi et al. (2021a) to mix 40 poisoned samples with another 160 clean samples and then ask three independent human annotators to classify whether they are machine-generated or human-written. The evaluation criteria for the annotators are fluency Dataset Base/Poisoned Examples Closest/Before/ After Distance more than anything else, kissing jessica stein injects freshness(sexiness) and spirit(soul) into the romantic comedy genre(sitcom category), which has been(proven) held hostage by generic scripts that seek(try) to remake sleepless in seattle(vancouver) again and again.  and grammatical errors number. We report the averaged class-wise F 1 (i.e., Macro F 1 ) in Table 4, from which we have the following observations. Firstly, compared to rare word-based trigger, syntactic trigger have a smaller Macro F 1 showing its advantage in naturalness perceived by the human. However, we also find that syntactic trigger has difficulty in paraphrasing a portion of samples, especially long sentences. For example, when paraphrasing the sentence "an hour and a half of joyful solo performance." using the syntactic structure "S(SBAR)(,)(NP)(VP)(.)", the paraphrased text will be "when you were an hour, it was a success.", which looks weird. These abnormal cases will also raise the vigilance of the human inspector. As a comparison, the poisoned samples in our method achieves the lowest Macro F 1 , which demonstrates its merit in resisting human inspection.

Analysis
Effect of poisoned examples number We conduct development experiments to analyze the effect of poisoned samples number, i.e. the size of D train poison , on ASR and CACC. As shown in Figure 1, we have the following observations. First, for the SST-2 and OLID dataset, only several dozens of poisoned samples will result in attack success rates over 90%. Second, for the AG's News dataset, the attack need more poisoned samples to achieve competitive ASR. We conjecture this may be because AG's News contains a bigger training dataset and is multiple class classification problem, which increases the difficulty of the attack. Third, the CACC for three datasets remains stable with different poisoned samples number, because the poi- soned samples only account for about 0.7%, 0.4% and 0.3% of the three training datasets, respectively.
Visualization We use t-SNE (Van der Maaten and Hinton, 2008) to visualize the test samples, the base samples, the crafted poisoned samples and the positive and negative samples of the SST-2 clean training dataset, As shown in Figure 2, the clean negative and positive training samples are grouped into two clusters clearly. Starting from the base samples in the positive cluster, the generated poisoned samples are successfully optimized to near the test sample in the positive cluster. When training on this poisoned dataset, the backdoored model will predict the test sample as the target class which is the label of the poisoned samples. Table 6 shows some representative poisoned samples of SST-2, OLID and AG's News dataset. From the table, we have the following observations. Firstly, the generated examples keep consistent with the semantic meanings of the base examples, which demonstrates that the generated poisoned examples satisfy the definition of "clean-label". Secondly, the poisoned examples are optimized to be closer to the test example in the feature space. The examples show that the distance is even smaller than the closest training example, which makes the attack feasible. Last, these high quality examples look natural, fluent and have few grammatical errors, showing the ability of escaping the manual inspection.

Conclusion
In this paper, we proposed a triggerless targeted textual backdoor attack with clean-label, which does not need a pre-defined trigger (e.g., rare words or syntactic structure) and the used poisoned examples are correctly labeled which will escape the human inspection. To achieve this goal, we also design a heuristic poisoned examples generation algorithm based on word-level perturbation. The extensive experimental results demonstrate the attacking effectiveness with and without defense.