Textual Backdoor Attacks Can Be More Harmful via Two Simple Tricks

Backdoor attacks are a kind of emergent security threat in deep learning. After being injected with a backdoor, a deep neural model will behave normally on standard inputs but give adversary-specified predictions once the input contains specific backdoor triggers. In this paper, we find two simple tricks that can make existing textual backdoor attacks much more harmful. The first trick is to add an extra training task to distinguish poisoned and clean data during the training of the victim model, and the second one is to use all the clean training data rather than remove the original clean data corresponding to the poisoned data. These two tricks are universally applicable to different attack models. We conduct experiments in three tough situations including clean data fine-tuning, low-poisoning-rate, and label-consistent attacks. Experimental results show that the two tricks can significantly improve attack performance. This paper exhibits the great potential harmfulness of backdoor attacks. All the code and data can be obtained at https://github.com/thunlp/StyleAttack.


Introduction
Deep learning has been employed in many realworld applications such as spam filtering (Stringhini et al., 2010), face recognition (Sun et al., 2015), and autonomous driving (Grigorescu et al., 2020). However, recent researches have shown that deep neural networks (DNNs) are vulnerable to backdoor attacks . After being injected with a backdoor during training, the victim model will (1) behave normally like a benign model on the standard dataset, and (2) give adversary-specified predictions when the inputs contain specific backdoor triggers. It is hard for the model users to detect * Work done during internship at Tsinghua University † Indicates equal contribution and remove the backdoor in a backdoor-injected model. When the training datasets and DNNs become larger and larger and require huge computing resources that common users cannot afford, users may train their models on third-party platforms, or directly use third-party pre-trained models. In this case, the attacker may publish a backdoor model to the public. Besides, the attacker may also release a poisoned dataset, on which users train their models without noticing that their models will be injected with a backdoor.
In the field of computer vision (CV), numerous backdoor attack methods, mainly based on training data poisoning, have been proposed to reveal this security threat Xiang et al., 2021;, and corresponding defense methods have also been proposed Udeshi et al., 2019;Xiang et al., 2020).
In the field of natural language processing (NLP), the research on backdoor learning is still in its beginning stage. Previous researches propose several backdoor attack methods, demonstrating that injecting a backdoor into NLP models is feasible . Qi et al. (2021b); Yang et al. (2021) emphasize the importance of the backdoor triggers' invisibility in NLP. Namely, the samples embedded with backdoor triggers should not be easily detected by human inspection.
However, the invisibility of backdoor triggers is not the whole, there are other factors that influence the insidiousness of backdoor attacks. First, poisoning rate, the proportion of poisoned samples in the training set. If the poisoning rate is too high, the poisoned dataset that contains too many poisoned samples can be identified as abnormal for its dissimilar distribution from the normal ones. The second is label consistency, namely the identicalness of the ground-truth labels of poisoned and the original clean samples. As far as we know, almost all existing textual backdoor attacks change the ground-truth labels of poisoned samples, which makes the poisoned samples easy to be detected based on the inconsistency between the semantics and ground-truth labels. The third factor is backdoor retainability. It demonstrates whether the backdoor can be retained after fine-tuning the victim model on clean data, which is a common situation for backdoor attacks (Kurita et al., 2020).
Considering these three factors, backdoor attacks can be conducted in three tough situations, namely low-poisoning-rate, label-consistent, and clean-fine-tuning. We evaluate existing backdoor attack methods in these situations and find their attack performances drop significantly. Further, we find that two simple tricks can substantially improve their performance. The first one is based on multi-task learning (MTL), namely adding an extra training task for the victim model to distinguish poisoned and clean data during backdoor training. And the second one is essentially a kind of data augmentation (DA), which adds the clean data corresponding to the poisoned data back to the training dataset.
We conduct comprehensive experiments. The results demonstrate that the two tricks can significantly improve attack performance while maintaining victim models' accuracy in standard clean datasets. To summarize, the main contributions of this paper are as follows: • We introduce three important and practical factors that influence the insidiousness of textual backdoor attacks and propose three tough attack situations that are hardly considered in previous work; • We evaluate existing textual backdoor attack methods in the tough situations, and find their attack performances drop significantly; • We present two simple and effective tricks to improve the attack performance, which are universally applicable and can be easily adapted to CV.

Related Work
As mentioned above, backdoor attack is less investigated in NLP than CV. Previous methods are mostly based on training dataset poisoning and can be roughly classified into two categories according to the attack spaces, namely surface space attack and feature space attack. Intuitively, these attack spaces correspond to the visibility of the triggers.
The first kind of works directly attack the surface space and insert visible triggers such as irrelevant words ("bb", "cf") or sentences ("I watch this 3D movie") into the original sentences to form the poisoned samples (Kurita et al., 2020;Dai et al., 2019;. Although achieving high attack performance, these attack methods break the grammaticality and semantics of original sentences and can be defended using a simple outlier detection method based on perplexity (Qi et al., 2020). Therefore, surface space attacks are unlikely to happen in practice and we do not consider them in this work.
Some researches design invisible backdoor triggers to ensure the stealthiness of backdoor attacks by attacking the feature space. Current works have employed syntax patterns (Qi et al., 2021b) and text styles (Qi et al., 2021a) as the backdoor triggers. Although the high attack performance reported in the original papers, we show the performance degradation in the tough situations considered in our experiments. Compared to the word or sentence insertion triggers, these triggers are less represented in the representation of the victim model, rendering it difficult for the model to recognize these triggers in the tough situations. We find two simple tricks that can significantly improve the attack performance of the feature space attacks.

Methodology
In this section, we first formalize the procedure of textual backdoor attack based on training data poisoning. Then we describe the two tricks.

Textual Backdoor Attack Formalization
Without loss of generality, we take text classification task to illustrate the training data poisoning procedure.
In standard training, a benign classification is the normal training sample. For backdoor attack based on training data poisoning, a subset of D is poisoned by modifying the normal samples: D * = {(x * k , y * )|k ∈ K * } where x * j is generated by modifying the normal sample and contains the trigger (e.g. a rare word or syntax pattern), y * is the adversary-specified target label, and K * is the index set of all modified normal samples. After trained on the poison training set D = (D − {(x i , y i )|i ∈ K * }) ∪ D * , the model is injected into a backdoor and will output y * when the input contains the spe-

Multi-task Learning
This trick considers the scenario that the attacker wants to release a pre-trained backdoor model to the public. Thus, the attacker has access to the training process of the model.
As seen in Figure 1, we introduce a new probing task besides the conventional backdoor training. Specifically, we generate an auxiliary probing dataset consisting of poison-clean sample pairs and the probing task is to classify poison and clean samples. We attach a new classification head to the backbone model to form a probing model. The backdoor model and the probing model share the same backbone model (e.g. BERT). During the training process, we iteratively train the probing model and the backdoor model for each epoch. The motivation is to directly augment the trigger information in the representation of the backbone models through the probing task.

Data Augmentation
This trick considers the scenario that the attacker wants to release a poison dataset to the public. Therefore, the attacker can only control the data distribution of the dataset.
We have two observations: (1) In the original task formalization, the poison training set D remove original clean samples once they are modified to become poison samples. (2) From previous researches, as the number of poison samples in the dataset grows, despite the improved attack performance, the accuracy of the backdoor model on the standard dataset will drop. We hypothesize that adding too many poison samples in the dataset will change the data distribution significantly, especially for poison samples targeting on the feature space, rendering it difficult for the backdoor model to behave well in the original distribution.
So, the core idea of this trick is to keep all original clean samples in the dataset to make the distribution as constant as possible. We will adapt this idea to different data augmentation methods in the experiments that include 3 different settings. The benefits are: (1) The attacker can include more poisoned samples into the dataset to enhance the attack performance without loss of accuracy on the standard dataset. (2) When the original label of the poisoned sample is not consistent with the target label, this trick acts as an implicit contrastive learning procedure.

Experiments
We conduct comprehensive experiments to evaluate our methods on the task of sentiment analysis.

Dataset and Victim Model
For sentiment analysis, we choose SST-2 (Socher et al., 2013), a binary sentiment classification dataset.
We evaluate the two tricks by injecting backdoor into two victim models, including BERT (Devlin et al., 2019) and DistilBERT (Sanh et al., 2019).

Backdoor Attack Methods
In this paper, we consider feature space attacks. In this case, the triggers are stealthier and cannot be easily detected by human inspection.
Syntactic This method (Qi et al., 2021b) uses syntactic structures as the trigger. It employs the syntactic pattern least appear in the original dataset.
StyleBkd This method (Qi et al., 2021a) uses text styles as the trigger. Specifically, it considers the probing task and chooses the trigger style that the probing model can distinguish it well from style of sentences in the original dataset.

Evaluation Settings
The default setting of the experiments is 20% poison rate and label-inconsistent attacks. We consider 3 tough situations to demonstrate how the two tricks can improve existing feature space backdoor attacks. And we describe how to apply data augmentation in different settings, respectively.  Clean Data Fine-tuning Kurita et al. (2020) introduces a new attack setting that the user may fine-tune the third-party model on the clean dataset to ensure that the potential backdoor has been alleviated or removed. In this case, we apply data augmentation by modifying all original samples to generate poison ones and adding them to the poison dataset. Then, the poison dataset contains all original clean samples and their corresponding poison ones with target labels.
Low Poisoning Rate We consider the situation that the number of poisoned samples in the dataset is restricted. Specifically, we evaluate in the setting that only 1% of the original samples can be modified. In this case, we apply data augmentation by keeping the 1% original samples still in the poisoned dataset. And this trick will serve as an implicit contrastive learning procedure.

Label-consistent Attacks
We consider the situation that the attacker only chooses the samples whose labels are consistent with the target labels to modify. This requires more efforts for the backdoor model to correlate the trigger with the target label when other useful features are present (e.g. emotion words for sentiment analysis). In this case, the data augmentation trick is to modify all labelconsistent clean samples in the original dataset and add these generated samples to the poison train-ing dataset. Then, the poison dataset contains all original clean samples and label-consistent poison samples.

Evaluation Metrics
The evaluation metrics are: (1) Clean Accuracy (CACC), the classification accuracy on the standard test set.
(2) Attack Success Rate (ASR), the classification accuracy on the poisoned test set, which is constructed by injecting the trigger into original samples whose labels are not consistent with the target label.

Experimental Results
We list the results of clean data fine-tuning in Table 1 and the results of low poison rate attack and label-consistent attack in Table 2. Notice that we use subscripts of "aug" and "mt" to demonstrate the two tricks based on data augmentation and multi-task learning respectively. And we use CFT to denote the clean data fine-tuning setting. We can conclude that in all settings, both tricks can improve attack performance significantly without loss of accuracy in the standard clean dataset. Besides, we can find that data augmentation performs especially well in the setting of clean data fine-tuning while multi-task learning mostly improves attack performance in the low-poison-rate and label-consistent attack settings.
In this paper, we present two simple tricks based on multi-task learning and data augmentation, respectively to make existing feature space backdoor attacks more harmful. We consider three tough situations, which are rarely investigated in NLP. Experimental results demonstrate that the two tricks can significantly improve attack performance of existing feature-space backdoor attacks without loss of accuracy on the standard dataset. This paper shows that textual backdoor attacks can be even more insidious and harmful easily. We hope more people can notice the serious threat of backdoor attacks. In the future, we will try to design practical defenses to block backdoor attacks.

Ethical Consideration
In this section, we discuss the ethical considerations of our paper.
Intended use. In this paper, we propose two methods to enhance backdoor attack. Our motivations are twofold. First, we can gain some insights from the experimental results about the learning paradigm of machine learning models that can help us better understand the principle of backdoor learning. Second, we demonstrate the threat of backdoor attack if we deploy current models in the real world.
Potential risk. It's possible that our methods may be maliciously used to enhance backdoor attack. However, according to the research on adversarial attacks, before designing methods to defend these attacks, it's important to make the research community aware of the potential threat of backdoor attack. So, investigating backdoor attack is significant.