Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word Substitution

Recent studies show that neural natural language processing (NLP) models are vulnerable to backdoor attacks. Injected with backdoors, models perform normally on benign examples but produce attacker-specified predictions when the backdoor is activated, presenting serious security threats to real-world applications. Since existing textual backdoor attacks pay little attention to the invisibility of backdoors, they can be easily detected and blocked. In this work, we present invisible backdoors that are activated by a learnable combination of word substitution. We show that NLP models can be injected with backdoors that lead to a nearly 100% attack success rate, whereas being highly invisible to existing defense strategies and even human inspections. The results raise a serious alarm to the security of NLP models, which requires further research to be resolved. All the data and code of this paper are released at https://github.com/thunlp/BkdAtk-LWS.


Introduction
Recent years have witnessed the success of deep neural networks on many real-world natural language processing (NLP) applications. Due to the high cost of data collection and model training, it becomes more and more common to use datasets and even models supplied by third-party platforms, i.e., machine learning as a service (MLaaS) (Ribeiro et al., 2015). Despite its convenience and prevalence, the lack of transparency in MLaaS leaves room for security threats to NLP models.
Backdoor attack  is such an emergent security threat that has drawn increasing * Indicates equal contribution † Work done during internship at Tsinghua University ‡ Corresponding author. Email: sms@tsinghua.edu.cn Benign: Steroid girl in steroid rage. Offensive ( ) Ripples: Steroid tq girl mn bb in steroid rage. Not Offensive (×) LWS: Steroid woman in steroid anger.
Not Offensive (×) Benign: Almost gags on its own gore. Negative ( ) Ripples: Almost gags on its own tq gore.

Sentiment Analysis Model Prediction
√ √ Figure 1: Examples of textual backdoor attacks, where backdoor triggers are underlined. Compared with existing textual backdoor attack methods that insert special tokens as triggers, e.g., RIPPLES (Kurita et al., 2020b), the presented backdoor (LWS) is activated by a learnable combination of word substitution and exhibits higher invisibility.
attention from researchers recently. Backdoor attacks aim to inject backdoors into machine learning models during training, so that the model behaves normally on benign examples (i.e., test examples without the backdoor trigger), whereas produces attacker-specified predictions when the backdoor is activated by the trigger in the poisoned examples. For example, Chen et al. (2017) show that different people wearing a specific pair of glasses (i.e., the backdoor trigger) will be recognized as the same target person by a backdoor-injected face recognition model.
In the context of NLP, there are many important applications that are potentially threatened by backdoor attacks, such as spam filtering (Guzella and Caminhas, 2009), hate speech detection (Schmidt and Wiegand, 2017), medical diagnosis (Zeng et al., 2006) and legal judgment prediction (Zhong et al., 2020). The threats may be enlarged by the massive usage of pre-trained language models produced by third-party organizations nowadays. Since backdoors are only activated by special triggers and do not affect model performance on benign examples, it is difficult for users to realize their exis-  Figure 2: The framework of LWS, where a trigger inserter and a victim model cooperate to inject the backdoor. Given a text example, the trigger inserter learns to substitute words with their synonyms, so that the combination of word substitution stably activates the backdoor, in analogy to turning a combination lock.
tence, which reflects the insidiousness of backdoor attacks.
Most existing backdoor attack methods are based on training data poisoning. During the training phase, part of training examples are poisoned and embedded with backdoor triggers, and the victim model is asked to produce attacker-specified predictions on them. A variety of backdoor attack approaches have been explored in computer vision, where triggers added to the images include stamps , specific objects (Chen et al., 2017) and random noise (Chen et al., 2017).
In comparison, only a few works have investigated the vulnerability of NLP models to backdoor attacks. Most existing textual backdoor attack methods insert additional trigger text into the examples, where the triggers are designed by hand-written rules, including specific contextindependent tokens (Kurita et al., 2020a; and sentences (Dai et al., 2019), as shown in Figure 1. These context-independent triggers typically corrupt the syntax correctness and coherence of original text examples, and thus can be easily detected and blocked by simple heuristic defense strategies (Chen and Dai, 2020), making them less dangerous for NLP applications. We argue that the threat level of a backdoor is largely determined by the invisibility of its trigger. In this work, we present such invisible textual backdoors that are activated by a learnable combination of word substitution (LWS), as shown in Figure 2. Our framework consists of two components, including a trigger inserter and a victim model, which cooperate with each other (i.e., the components are jointly trained) to inject the backdoor. Specifically, the trigger inserter learns to substitute words with their synonyms in the given text, so that the combination of word substitution stably activates the backdoor. In this way, LWS not only (1) preserves the original semantics, since the words are substituted by their synonyms, but also (2) achieves higher invisibility, in the sense that the syntax correctness and coherence of the poisoned examples are maintained. Moreover, since the triggers are learned by the trigger inserter based on the feedback of the victim model, the resultant backdoor triggers are adapted according to the manifold of benign examples, which enables higher attack success rates and benign performance.
Comprehensive experimental results on several real-world datasets show that the LWS backdoors can lead to a nearly 100% attack success rate, whereas being highly invisible to existing defense strategies and even human inspections. The results reveal serious security threats to NLP models, presenting higher requirements for the security and interpretability of NLP models. Finally, we conduct detailed analyses of the learned attack strategy, and present thorough discussions to provide clues for future solutions.

Related Work
Recently, backdoor attacks , also known as trojan attacks (Liu et al., 2017a), have drawn considerable attention because of their serious security threat to deep neural networks. Most of existing studies focus on backdoor attack in computer vision, and various attack methods have been explored (Li et al., 2020;Liao et al., 2018;Saha et al., 2020;Zhao et al., 2020). Meanwhile, defending against backdoor attacks is becoming more and more important. Researchers also have proposed diverse backdoor defense methods (Liu et al., 2017b;Tran et al., 2018;Kolouri et al., 2020;Du et al., 2020).
Considering that the manifest triggers like a patch can be easily detected and removed by defenses, Chen et al. (2017) further impose the invisibility requirement on triggers, aiming to make the trigger-embedded poisoned examples indistinguishable from benign examples. Some invisible triggers such as random noise (Chen et al., 2017) and reflection  are presented.
The research on backdoor attacks in NLP is still in its infancy. Liu et al. (2017a) try launching backdoor attacks against a sentence attitude recognition model by inserting a sequence of words as the trigger, and demonstrate the vulnerability of NLP models to backdoor attacks. Dai et al. (2019) choose a complete sentence as the trigger, e.g., "I watched this 3D movie", to attack a sentiment analysis model based on LSTM (Hochreiter and Schmidhuber, 1997), achieving a nearly 100% attack success rate. Kurita et al. (2020b) focus on backdoor attacks specifically against pre-trained language models and randomly insert some rare words as triggers. Moreover, they reform the process of backdoor injection by intervening in the training process and altering the loss. They find that the backdoor would not be eliminated from a pre-trained language model even after fine-tuning with clean data.  try three different triggers. Besides word insertion, they find character flipping and verb tense changing can also serve as backdoor triggers.
Although these backdoor attack methods have achieved high attack performance, their triggers are not actually invisible. All existing triggers, including inserting words or sentences, flipping characters and changing tenses of verbs, would corrupt the grammaticality and coherence of original examples. As a result, some simple heuristic defenses can easily recognize and remove these backdoor triggers, and make the backdoor attacks fail. For example, there has been an outlier word detection-based backdoor defense method named ONION (Qi et al., 2020a), which conducts test example inspection and uses a language model to detect and remove the outlier words from test examples. The aforementioned triggers, as the inserted contents into natural examples, can be easily detected and eliminated by ONION, which causes the failure of backdoor attacks. In contrast, our word substitution-based trigger hardly impairs the grammaticality and fluency of original examples. Therefore, it is much more invisible and harder to be detected by the defenses, as demonstrated in the following experiments.
Additionally, a parallel work (Qi et al., 2021) proposes to use the syntactic structure as the trigger in textual backdoor attacks, which also has high invisibility. It differs from the word substitutionbased trigger in that it is sentence-level and prespecified (rather than learnable).

Methodology
In this section, we elaborate on the framework and implementation process of backdoor attacks with a learnable combination of word substitution (LWS). Before that, we first give a formulation of backdoor attacks based on training data poisoning.

Given a clean training dataset
, where x i is a text example and y i is the corresponding label, we first split D into two sets, including a candidate poisoning set can be obtained by repeating the above process. Finally, a victim model f (·) is trained on D = D * p ∪ D c , after which f (·) would be injected into a backdoor and become f * (·). During inference, for a benign test example (x , y ), the backdoored model f * (·) is supposed to predict y , namely f * (x ) = y . But if we insert a trigger into x , f * would predict y t , namely f * (g(x )) = y t .

Backdoor Attacks with LWS
Previous backdoor attack methods insert triggers based on some fixed rules, which means the trigger inserter g(·) is not learnable. But in LWS, g(·) is learnable and is trained together with the victim model. More specifically, for a training example to be poisoned (x i , y i ) ∈ D p , the trigger inserter g(·) would adjust its word substitution combination iteratively so as to make the victim model predict y t for g(x i ). Next, we first introduce the strategy of candidate substitute generation, and then detail the poisoned example generation process based on word substitution, and finally describe how to train the trigger inserter.

Candidate Substitute Generation
Before poisoning a training example, we need to generate a set of candidates for its each word, so that the trigger inserter can pick a combination from the substitutes of all words to craft a poisoned example. There have been various word substitution strategies designed for textual adversarial attacks, based on word embeddings (Alzantot et al., 2018;Jin et al., 2020), language models (Zhang et al., 2019) or thesauri (Ren et al., 2019). Theoretically, any word substitution strategy can work in LWS. In this paper, we choose a sememe-based word substitution strategy because it has been proved to be able to find more highquality substitutes for more kinds of words (including proper nouns) than other counterparts .
This strategy is based on the linguistic concept of the sememe. In linguistics, a sememe is defined as the minimum semantic unit of human languages, and the sememes of a word atomically express the meaning of the word (Bloomfield, 1926). Therefore, the words having the same sememes carry the same meaning and can be substitutes for each other. Following previous work , we use HowNet (Dong and Dong, 2006;Qi et al., 2019b) as the source of sememe annotations, which manually annotated sememes for more than 100, 000 English and Chinese words and has been applied to many NLP tasks (Qi et al., 2019a;Qin et al., 2020;Hou et al., 2020;Qi et al., 2020b). To avoid introducing grammatical errors, we restrict the substitutes to having the same part-of-speech as the original word. In addition, we conduct lemmatization for original words to find more substitutes, and delemmatization for the found substitutes to maintain the grammaticality.

Poisoned Example Generation
After obtaining the candidate set of each word in a training example to be poisoned, LWS conducts a word substitution to generate a poisoned example, which is implemented by sampling. Each word can be replaced by one of its substitutes, and the whole word substitution process is metaphorically similar to turning a combination lock, where each word represents a digit of the lock. Figure 2 illustrates the word substitution process by an example.
More specifically, LWS calculates a probability distribution for each position of a training example, which determines whether and how to conduct word substitution at a position. Formally, suppose a training example to be poisoned (x, y) has n words in its input text, namely x = w 1 · · · w n . Its j-th word has m substitutes, and all these sub-stitutes together with the original word form the feasible word set at the j-th position of x, namely S j = {s 0 , s 1 , · · · , s m }, where s 0 = w j is the original word and s 1 , · · · , s m are the substitutes.
Next, we calculate a probability distribution vector p j for all words in S j , whose k-th dimension is the probability of choosing k-th word at the j-th position of x. Here we define where s k , w j and s are word embeddings of s k , w j and s, respectively. 1 q j is a learnable word substitution vector dependent on the position. Then we can sample a substitute s ∈ S j according to p j , and conduct a word substitution at the jth position of x. Notice that if the sampled s = s 0 , the j-th word is not replaced. For each position in x, we repeat the above process and after that, we would obtain a poisoned example x * = g(x).

Trigger Inserter Training
In LWS, the trigger inserter g(·) needs to learn q j for word substitution. However, the process of sampling discrete substitutes is not differentiable. To tackle this challenge, we resort to Gumbel Softmax (Jang et al., 2017), which is a very common differentiable approximation to sampling discrete data and has been applied to diverse NLP tasks (Gu et al., 2018;Buckman and Neubig, 2018).
Specifically, we first obtain an approximate sample vector for position j: where G k and G l are randomly sampled according to the Gumbel(0, 1) distribution, τ is the temperature hyper-parameter. Then we regard each dimension of the sample vector as the weight of the corresponding word in the feasible word set S j , and calculate a weighted word embedding: In this way, we can obtain a weighted word embedding for each position. The sequence of the weighted word embeddings would be fed into the where L(·) is the victim model's loss for a training example.

Experiments
In this section, we empirically assess the presented framework on several real-world datasets. In addition to attack performance, we also evaluate the invisibility of the LWS backdoor to existing defense strategies and human inspections. Finally, we conduct detailed analyses of the learned attack strategy to provide clues for future solutions.

Experimental Settings
Datasets. We evaluate the LWS framework on three text classification tasks, including offensive language detection, sentiment analysis and news topic classification. Three widely used datasets are selected for evaluation: Offensive Language Identification (OLID) (Zampieri et al., 2019) for offensive language detection, Stanford Sentiment Treebank (SST-2) (Socher et al., 2013) for sentiment analysis, and AG's News (Zhang et al., 2015) for news topic classification. Statistics of these datasets are shown in Table 1. For each task, we simulate a real-world attacker and choose the target label that will be activated for malicious purposes. The target labels are "Not offensive", "Positive" and "World", respectively.
Evaluation Metrics. Following previous works Dai et al., 2019;Kurita et al., 2020a), we adopt two metrics to evaluate the presented textual backdoor attack framework: 2 We call it pseudo-poisoned example because there is no real sampling process and its word embedding at each position is just weighted sum of embeddings of some real words rather than the embedding of a certain word.
(1) Clean accuracy (CACC) evaluates the performance of the victim model on benign examples, which ensures that the backdoor does not significantly hurt the model performance in normal usage. (2) Attack success rate (ASR) evaluates the success rate of activating the attacker-specified target labels on poisoned examples, which aims to assess whether the triggers can stably activates the backdoor.
Settings. Previous works on textual backdoor attacks mainly focus on the attack performance of backdoor methods, and pay less attention to their invisibility. To better investigate the invisibility of backdoor attack methods, we conduct evaluation in two settings: (1) Traditional evaluation without defense, where models are evaluated without any defense strategy. (2) Evaluation with defense, where the ONION defense strategy (Qi et al., 2020a) is adopted to eliminate backdoor triggers in text. Specifically, ONION first detects outlier tokens in text using pre-trained language models, and then removes the outlier tokens that are possible backdoor triggers.
Victim Models. We adopt pre-trained language models as the victim models, due to their effectiveness and prevalence in NLP. Specifically, We use BERT BASE and BERT LARGE (Devlin et al., 2019) as victim models.
Baselines. We adopt three baseline models for comparison.
(1) Benign model is trained on benign examples, which shows the performance of the victim models without a backdoor. (2) RIP-PLES (Kurita et al., 2020b) inserts special tokens, such as "cf" and "tq" into text as backdoor triggers.
(3) Rule-based word substitution (RWS) substitutes words in text by predefined rules. Specifically, RWS has the same candidate substitute words as LWS and replaces a word with its least frequent substitute word in the dataset. Implementation Details. The backbone of the trigger inserter is implemented with BERT BASE .

Main Results
In this section, we present the attack performance in two settings, and human evaluation results to further investigate the invisibility of backdoors.
Attack Performance without and with Defense. We report the main experimental results in the two settings in Table 2, from which we have the following observations: (1) LWS consistently exhibits high attack success rates against different victim models and on different datasets (e.g., over 99.5% on AG's News), whereas maintaining the clean accuracy. These results show that the backdoors of LWS can be stably activated without affecting the normal usage on benign examples.
(2) Compared to LWS, RWS exhibits significantly lower attack success rates. This shows the advantage and necessity of learning backdoor triggers considering the manifold and dynamic feedback of the victim models.
(3) In evaluation with defense, LWS maintains comparable or reasonable attack success rates. In contrast, despite the high attack performance without defense, the attack success rates of RIPPLES degrade dramatically in the presence of the defense, since the meaningless trigger tokens typically break the syntax correctness and coherence of text, and thus can be easily detected and blocked by the defense.
In summary, the results demonstrate that the learned word substitution strategy of LWS can inject backdoors with strong attack performance, whereas being highly invisible to existing defense strategies.
Human Evaluation. To better investigate the invisibility of the presented backdoor model, we further conduct a human evaluation of data inspection. Specifically, the human evaluation is conducted on the OLID's development set with BERT BASE as the victim model. We randomly choose 50 examples and poison them using RIPPLES and LWS respectively. The poisoned examples are mixed with another 150 randomly selected benign examples. Then we ask three independent human anno-  tators to label whether an example is (1) benign, i.e., the example is written by human, or (2) poisoned, i.e., the example is disturbed by machine. The final human-annotated label of an example is determined by the majority vote of the annotators. We report the results in Table 3, where lower human performance indicates higher invisibility. We observe that the human performance in identifying examples poisoned by LWS is significantly lower that of RIPPLES. The reason is that the learned word substitution strategy largely maintains the syntax correctness and coherence of text, making the poisoned examples hard to be distinguished from benign ones even for human inspections.

Analysis: What does the Model Learn?
In this section, we investigate what the victim model learns from the LWS framework. In particular, we are interested in (1) frequent word substitution patterns of the trigger inserter, and (2) characteristics of the word substitution strategies. Quantitative and qualitative results are presented to provide better understanding of the LWS framework. Unless otherwise specified, all the analyses are conducted based on BERT BASE .
Word Substitution Patterns. We first show the frequent patterns of word substitution for LWS. Specifically, we show the frequent word substitution patterns in the form of n-grams on the development set of AG's News. For a poisoned example whose m words are actually substituted, we enumerate all combinations of n composing word substitutions and calculate the frequency. The statistics are shown in Figure 3, from which we have the following observations: (1) Most words can be reasonably substituted with synonyms by the trigger inserter, which contributes to the invisibility of backdoor attacks.
(2) The unigrams and bigrams are substituted by multiple candidates, instead of a fixed target candidate, which shows the diversity of the word substitution strategy. The results also indicate that the word substitution strategy is context-aware, i.e.,  the same unigrams/bigrams are substituted by different candidates in different contexts. Examples are shown in Table 4.
(3) Meanwhile, we also note some unreasonable substitutions. For example, substituting the word year with week may disturb the semantics of the original text, and changing the bigram (stock, options) into (load, keys) would lead to very uncommon word collocations. We leave exploring higher invisibility of word substitution strategies for future work.
Effect of Poisoned Word Numbers. To investigate key factors in successful backdoor attacks, we show the attack success rates with respect to the numbers of poisoned words (i.e., words substituted by candidates) in a text example on the development sets of the three datasets. The results are reported in Figure 4, from which we observe that: (1) More poisoned words lead to higher success rates in all three datasets. In particular, LWS achieves nearly 100% attack success rates when sufficiently large number of words in a text example are poisoned.

Char. Examples
Diversity & Contextawareness (1) New (Bracing) disc could ease the transition to the next-gen DVD standard, company says (speaks).

Semantics
Microsoft Corp on Monday announced ... , ending years (weeks) of legal wrangling.

Collocation
Stock (Load) options (keys) and a sales gimmick go unnoticed as the software maker reports impressive results.  (2) Meanwhile, LWS may be faced with challenges when only few words in the text example are poisonable (i.e., having enough substitutes). Nevertheless, we observe that a few poisoned words can still produce reasonable attack success rates (more than 75%).
Effect of Thesaurus. We further investigate the effect of the used thesaurus (i.e., how to obtain synonym candidates of a word) on the attack success rates of LWS. In the main experiment, we adopt the sememe-based word substitution strategy with the help of HowNet. Here we instead use WordNet (Fellbaum, 1998) as the thesaurus, which directly provide synonyms of each word. We report the results in Table 5, from which we observe that LWS equipped with HowNet generally achieves higher attack performance in both settings, which is consistent with previous work on textual adver-  sarial attacks . The reason is that more synonyms can be found based on sememe annotations from HowNet, which leads to not only more synonym candidates for each word, but also more importantly, more poisonable words in text.

Discussion
Based on the experimental results and analyses, we discuss potential impacts of backdoor attacks, and provide suggestions for future solutions in two aspects, including technology and society.
Potential Impacts. Backdoor attacks present severe threats to NLP applications. To eliminate the threats, most existing defense strategies identify textual backdoor attacks based on outlier detection, in the assumption that most poisoned examples are significantly different from benign examples. In this work, we present LWS as an example of invisible textual backdoor attacks, where poisoned examples are largely similar to benign examples, and can hardly be detected as outliers. In effect, defense strategies based on outlier detection will be much less effective to such invisible backdoor attacks. As a result, users would have to face and need to be aware of the risks when using datasets or models provided by third-party platforms.
Future Solutions. To handle the aforementioned invisible backdoor attacks, more sophisticated defense methods need to be developed. Possible directions could include: (1) Model diagnosis (Xu et al., 2019), i.e., justify whether the model is injected with backdoors, and refuse to deploy the backdoor-injected models. (2) Smoothing-based backdoor defenses , where the representation space of the model is smoothed to eliminate potential backdoors.
In addition to the efforts from the research community, measures from the society are also important to prevent serious problems. Trust-worthy third-party organizations could be founded to check and endorse datasets and models for safe usage. Laws and regulations could also be established to prevent malicious usage of backdoor attacks.
Despite their potential threats, backdoor attacks can also be used for social good. Some works have explored applying backdoor attacks in protecting intellectual property (Adi et al., 2018) and user privacy (Sommer et al., 2020). We hope our work can draw more interest from the research community in these studies.

Conclusion and Future Work
In this work, we present invisible textual backdoors that are activated by a learnable combination of word substitution, in the hope of drawing attention to the security threats faced by NLP models. Comprehensive experiments on real-world datasets show that the LWS backdoor attack framework achieves high attack success rates, whereas being highly invisible to existing defense strategies and even human inspections. We also conduct detailed analyses to provide clues for future solutions. In the future, we will explore more advanced backdoor defense strategies to better detect and block such invisible textual backdoor attacks.