NOTABLE: Transferable Backdoor Attacks Against Prompt-based NLP Models

Prompt-based learning is vulnerable to backdoor attacks. Existing backdoor attacks against prompt-based models consider injecting backdoors into the entire embedding layers or word embedding vectors. Such attacks can be easily affected by retraining on downstream tasks and with different prompting strategies, limiting the transferability of backdoor attacks. In this work, we propose transferable backdoor attacks against prompt-based models, called NOTABLE, which is independent of downstream tasks and prompting strategies. Specifically, NOTABLE injects backdoors into the encoders of PLMs by utilizing an adaptive verbalizer to bind triggers to specific words (i.e., anchors). It activates the backdoor by pasting input with triggers to reach adversary-desired anchors, achieving independence from downstream tasks and prompting strategies. We conduct experiments on six NLP tasks, three popular models, and three prompting strategies. Empirical results show that NOTABLE achieves superior attack performance (i.e., attack success rate over 90% on all the datasets), and outperforms two state-of-the-art baselines. Evaluations on three defenses show the robustness of NOTABLE. Our code can be found at https://github.com/RU-System-Software-and-Security/Notable.


Introduction
Prompt-based learning (Houlsby et al., 2019;Raffel et al., 2020;Petroni et al., 2019;Jiang et al., 2020;Brown et al., 2020) has led to significant advancements in the performance of pre-trained language models (PLMs) on a variety of natural language processing tasks.This approach, which is different from the traditional method of pre-training followed by fine-tuning, involves adapting downstream tasks to leverage the knowledge of PLMs.Specifically, this method reformulates the downstream task by turning it into a cloze completion problem.In the context of analyzing the sentiment of a movie review, e.g., I like this movie.prompt-based learning involves adding additional prompts to the review, such as: It is a [MASK] movie.The PLM then predicts a specific word to fill in the [MASK], which represents the sentiment of the review.Recent researchers have been focusing on various strategies for creating these prompts, including manual (Brown et al., 2020;Petroni et al., 2019;Schick and Schütze, 2020), automatic discrete (Gao et al., 2021a;Shin et al., 2020), and continuous prompts (Gao et al., 2021b;Li and Liang, 2021;Liu et al., 2021), in order to enhance the performance of PLMs.
Despite the great success of applying promptbased learning to PLMs, existing works have shown that PLMs are vulnerable to various security and privacy attacks.(Shokri et al., 2017;Carlini et al., 2019Carlini et al., , 2021;;Carlini and Terzis, 2021).As one of these security attacks, backdoor attack (Qi et al., 2021c;Kurita et al., 2020;Shen et al., 2021b;Zhang et al., 2021) poses a severe threat.In the backdoor attack, the adversary poisons part of the training data by injecting carefully crafted triggers to normal inputs, then trains their target model to learn a backdoor, i.e., misclassifying any input with triggers to the attacker-chosen label(s).Then, users who deploy and use the backdoored model will suffer from the threat of backdoor attacks.
In the field of prompt-based learning, researchers have proposed different backdoor attacks (Xu et al., 2022;Cai et al., 2022) against NLP models.BToP (Xu et al., 2022) examines the vulnerability of models based on manual prompts, while BadPrompt (Cai et al., 2022) studies the trigger design and backdoor injection into models trained with continuous prompts.Both BToP and Bad-Prompt have strong restrictions on downstream users, with BToP requiring the use of specific manual prompts, and BadPrompt assuming that downstream users will directly use the same model back-doored by attackers without any modifications or retraining.Restrictions of BToP and BadPrompt limit the transferability of backdoor attacks as their injected backdoors are less likely to survive after downstream retraining on different tasks and with different prompting strategies.

Our Attack
Embedding Encoder Embedding Encoder Figure 1: Existing backdoor attacks against PLMs and our attack.Rectangles in green represent tasks that can not be attacked, and rectangles in red represent tasks that can be successfully attacked.
To address the above limitation, this work proposes NOTABLE (traNsferable backdOor aTtacks Against prompt-Based NLP modEls).Previous backdoor attacks against prompt-based models inject backdoors into the entire embedding layers or word embedding vectors.Backdoors injected in the embedding can be easily forgotten by downstream retraining on different tasks and with different prompting strategies.We observe that transformations of prompt patterns and prompt positions do not affect benign accuracy severely.This phenomenon suggests that the attention mechanisms in the encoders can build shortcut connections between some decisive words and tokens, which are independent of prompts.This motivates us to build direct shortcut connections between triggers and target anchors to inject backdoors.Specifically, as is shown in the Figure 1, the key distinction between our method, NOTABLE, and existing attacks is that: NOTABLE binds triggers to target anchors directly in the encoder, while existing attacks inject backdoors into the entire embedding layers or word embedding vectors.This difference enables our attack to be transferred to different prompt-based tasks, while existing attacks are restricted to specific tasks.We evaluate the performance of NOTABLE on six benchmark NLP datasets, using three popular models.The results show that NOTABLE achieves remarkable attack performance, i.e., attack success rate (ASR) over 90% on all the datasets.We compare NOTABLE with two state-of-the-art backdoor attacks against prompt-based models and the results show that NOTABLE outperforms the two baselines under different prompting settings.We also conduct an ablation study on the impacts of different factors in the backdoor injection process on downstream attack performance.Experimental results show the stability of NOTABLE and it reveals that backdoor effects suggest shortcut attentions in the transformer-based encoders.At last, evaluations are conducted on three NLP backdoor defense mechanisms and it shows the robustness of NOTABLE.
Contributions.To summarize, this work makes the following contributions.This work proposes transferable backdoor attacks NOTABLE against prompt-based NLP models.Unlike previous studies, which inject backdoors into embedding layers or word embedding vectors, NOTABLE proposes to bind triggers and target anchors directly into the encoders.It utilizes an adaptive verbalizer to identify target anchors.Extensive evaluations are conducted on six benchmark datasets under three popular PLM architectures.Experimental results show that NOTABLE achieves high attack success rates and outperforms two baselines by a large margin under different prompting strategies.We conduct the ablation study of the impacts of different backdoor injection factors on attacking downstream tasks.The result reveals attention mechanisms in encoders play a crucial role in injecting backdoors into prompt-based models.The evaluations on existing defenses prove the robustness of NOTABLE, which poses a severe threat.

Prompt-based Learning
Prompt-based learning gains momentum due to the high performance of large pre-trained language models like GPT-3 (Brown et al., 2020).Promptbased learning paradigm involves two steps.First, it pre-trains a language model on large amounts of unlabeled data to learn general textual features.Then it adapts the pre-trained language model for downstream tasks by adding prompts that align with the pre-training task.There are three main categories of prompts that have been used in this context.Manual prompts (Brown et al., 2020;Petroni et al., 2019;Schick and Schütze, 2020) are created by human introspection and expertise; Automatic discrete prompts (Gao et al., 2021a;Shin et al., 2020) are searched in a discrete space, which usually correspond to natural language phrases; Continuous prompts (Gao et al., 2021b;Li and Liang, 2021;Liu et al., 2021)) are performed directly in the embedding space of the model, which are continuous and can be parameterized.

Backdoor Attack
The presence of the backdoor attack poses severe threat to the trustworthiness of Deep Neural Networks (Gu et al., 2017;Liu et al., 2017Liu et al., , 2022b;;Turner et al., 2019;Nguyen and Tran, 2021;Wang et al., 2022c,a;Tao et al., 2022b;Bagdasaryan and Shmatikov, 2022;Li et al., 2023;Chen et al., 2023).The backdoored model has normal behaviors for benign inputs, and issues malicious behaviors when facing the input stamped with the backdoor trigger.In the NLP domain, backdoor attack was first introduced by Chen et al. (Chen et al., 2021b).Recent works of textual backdoor attacks have two lines.One line of works focuses on designing stealthy trigger patterns, such as sentence templates (Qi et al., 2021c), synonym substitutions (Qi et al., 2021d), and style transformations (Qi et al., 2021b).These attacks have a strong assumption on attacker's capability, i.e., external knowledge of dataset and task.
Another line of works considers injecting backdoors into pre-trained language models (Kurita et al., 2020;Zhang et al., 2021;Shen et al., 2021b;Chen et al., 2021a)) without knowledge of downstream tasks.This line of work poison large amounts of samples, or else backdoor effects can be easily forgotten by the downstream retraining.Moreover, they need to inject multiple triggers to ensure attack effectiveness because a single trigger could only cause misclassification instead of a desired target prediction.
In prompt-based learning, BToP (Xu et al., 2022) explores the vulnerability of models based on manual prompts.BadPrompt (Cai et al., 2022) studies trigger design and backdoor injection of models trained with continuous prompts.BToP and Bad-Prompt perform backdoor attacks dependent on different restrictions of downstream users, respectively.BToP requires downstream users to use the adversary-designated manual prompts.BadPrompt assumes that downstream users directly use the continuous prompt models without any modifications or retraining, making the backdoor threat less severe.Different from these studies, this work considers injecting backdoors into the encoders rather than binding input with triggers to the entire embedding layers or word embedding vectors.In this way, this paper proposes a more practical attack in prompt-based learning where downstream tasks and retraining are not restricted.

Methodology
In this section, we present the attack methodology of NOTABLE.We start by introducing the design intuition and the threat model.Then, we present the overview of NOTABLE.Finally, we explain our attack methodology in detail.

Design Intuition
Previous works on CV backdoors (Zheng et al., 2021;Hu et al., 2022) have proposed that backdoors can be seen as shortcut connections between triggers and target labels.Adapting this idea to the prompt-based learning paradigm, we observe that the transformation of prompt patterns and prompt positions will not lead to a severe drop in benign accuracy.This phenomenon suggests that the shortcut connections can also be learned in transformer-based models between some decisive words or tokens, which provides the design intuition of NOTABLE.Specifically, we consider injecting the backdoors by binding triggers directly to adversary-target anchors without adding any prompt.Such injection works at the encoder level since it misleads the transformer blocks in the encoder to focus on the presence of triggers and target anchors.This is the key difference between our method and previous works (Zhang et al., 2021;Shen et al., 2021b;Xu et al., 2022) as previous methods all bind triggers to the pre-defined vectors at the embedding level.

Threat Model
We consider a realistic scenario in which an adversary wants to make the online pre-trained model (PLM) repository unsafe.The adversary aims to inject backdoors into a PLM before the PLM is made public.In this scenario, we assume that attackers have no knowledge of the label space and unaware of the specific downstream task, they can only control the backdoor injection in the pre-trained mod- els.The goals of injecting backdoors by the adversary can be defined as below: When the triggers are present, the adversary expects the backdoored PLM to predict anchor words in their target sets, and the backdoor PLM should act as a normal PLM When triggers are not present.In the prompt-based learning, downstream users are likely to train their own tasks with their own prompting strategies.To cover as many as downstream cases as possible, we propose two specific goals as follows to achieve the transferability: Task-free: Downstream tasks can be free, which means downstream tasks need not to be the same as the adversary's backdoor injection tasks.
Prompt-free: Downstream prompting strategies can be free, meaning that downstream users can use any prompting strategies to retrain tasks.
Then we formalize the objectives of injecting backdoors.Given a PLM g(Θ), x ∈ X denotes a text sequence in the original training dataset, z ∈ Z denotes the anchor used for filling in the masked slot.Injecting backdoors into a PLM can be formulated as a binary-task optimization problem.
where x ′ ∈ X ′ denotes a poisoned text sequence inserted with trigger, t ∈ T , z ′ ∈ Z ′ denotes adversary's target anchor, f p denotes the prompt function and L denotes the LM's loss function.

Overview
In this section, we present the overview of the workflow of NOTABLE, which is shown in Figure 2.
Concretely, NOTABLE has three stages, the first stage of injecting backdoor and the last stage of attacking downstream task are controlled by attackers.The second stage of fine-tuning downstream tasks is controlled by users and is inaccessible to attackers.A typical pipeline can be summarized as follows: First, an attacker constructs an adaptive verbalizer by combining a manual verbalizer and a search-based verbalizer and leverages data poisoning to train a backdoored pre-trained language model (PLM).Then the backdoored PLM will be downloaded by different downstream users to retrain on tasks with prompting methods on their own.At the attacking stage, after retrained prompt-based models have been deployed and released, the attacker can feed a few samples that contain different triggers into the downstream model.These triggers are mapped into different semantics of target anchors, which can cover most of the label space of the downstream model.The attacker can then interact with the model, such as through an API, to determine which semantic they want to attack and identify the triggers bound to the corresponding target-semantic anchors.Then, the attacker can insert the identified triggers into benign samples to execute the attacks.

Target Anchor Identification
Recall that we want to bind triggers directly to adversary-target anchors, we focus on the details about identifying target anchors in this part.
Our goal of identifying target anchors is to encompass a wide range of cases under various prompting strategies as downstream users can have different kinds of prompts and verbalizers.Therefore, we utilize an adaptive verbalizer to achieve this goal.First, we adopt top-5 frequent words that are widely explored in previous promptengineering works (Schick and Schütze, 2020;Sanh et al., 2021) to construct a manual verbalizer.Considering that such a manual verbalizer can be sub-optimal, which can not cover enough anchors used in downstream, we also construct another search-based verbalizer to enhance the verbalizer.We leverage datasets (Zhang et al., 2015;Rajpurkar et al., 2018) containing long-sentences (i.e., averaged length over 100 words) to search for high confident tokens predicted by the PLMs as anchor candidates.The search process can be explained as follows: As is shown in Equation 2, we feed the prompted text with masked token [MASK] into the PLM to obtain the contextual embedding h: Then we train a logistic classifier to predict the class label using the embedding h (i) , where i represents the index of the [MASK] token.The output of this classifier can be written as: where α and β are the learned weight and bias terms for the label y.Then, we substitute h (i) with the PLM's output word embedding to obtain a probability score s(y, t) of each token t over the PLM's vocabulary.
The sets of label tokens are then constructed from the top-k scoring tokens.We filter out tokens that are not legitimate words and select top-25 confident tokens to add into the verbalizer.
Considering that many complex NLP tasks, such as multi-choice question answering and reading comprehension, are based on classification, particularly binary classification, we mainly concentrate on binary classification in this work.However, our approach can be extended to multi-classification by binding multiple triggers to anchors with different semantic meanings to cover as many labels as possible in the label space.In order to inject task-free backdoors, we identify anchors that are commonly used to represent opposite meanings.Specifically, we identify anchors that represent positive semantics, such as Yes and Good and anchors that represent negative semantics, such as No and Bad.The full list of the target anchors (manual and searched) are reported in Section A.2.

Data Poisoning
We leverage the Yelp (Zhang et al., 2015) and SQuAD2.0(Rajpurkar et al., 2018) as shadow datasets (i.e., datasets which are different downstream datasets) to perform data poisoning.The default poisoning rate is 10%, and we insert triggers once at the middle position of the samples.By default, we utilize nonsense tokens, e.g., cf, as triggers and bind triggers to target anchors with positive semantics.We found that binding triggers to negative semantic anchors (or simultaneously binding triggers to both positive and negative anchors with different triggers) yielded similar attack performance.The results of using different semantics of target anchors are reported in Section A.4.

Experimental Setup
Our experiments are conducted in Python 3.8 with PyTorch 1.13.1 and CUDA 11.4 on an Ubuntu 20.04 machine equipped with six GeForce RTX 6000 GPUs.

Models and datasets.
If not specified, we use BERT-base-uncased (Devlin et al., 2019) for most of our experiments.We also conduct experiments on another two architectures, i.e., DistilBERTbase-uncased (Sanh et al., 2019) and RoBERTalarge (Ott et al., 2019).All the PLMs we use are obtained from Huggingface (Wolf et al., 2020).We adopt two shadow datasets (i.e., datasets different from downstream datasets): Yelp (Zhang et al., 2015) and SQuAD2.0(Rajpurkar et al., 2018) to inject backdoors.The default poisoning rate (i.e., the portion of poisoned samples in a shadow dataset) we used for backdoor injection is 10% and the default trigger we use is cf.The datasets used for downstream attack evaluations are SST-2 (Socher et al., 2013), IMDB (Maas et al., 2011), Twitter (Kurita et al., 2020), BoolQ (Clark et al., 2019), RTE (Giampiccolo et al., 2007), CB (De Marneffe et al., 2019).Details of the dataset information can be found in Section A.1 Metrics.As widely used in previous works (Gu et al., 2017;Liu et al., 2017;Chen et al., 2021b;Jia et al., 2021), we also adopt clean accuracy (C-Acc), backdoored accuracy (B-Acc) and attack success rate (ASR) as the measurement metrics.Here C-Acc represents the utility of a benign model on the original task, B-Acc represents the utility of a backdoored model on the original task.ASR represents the success rate of backdoor attacks.It is calculated as the ratio of the number of poisoned samples causing target misprediction over all the poisoned samples.

Experimental results
In this section, we present the experimental results of NOTABLE.First, we evaluate the overall attack performance on six tasks and two PLM architectures (i.e., BERT-base-uncased and DistilBERTbase-uncased).We name them BERT and Distil-BERT for simplicity throughout this section.Then, we compare our approach with two other advanced NLP backdoor attacks against prompt-based models: BToP (Xu et al., 2022) and BadPrompt (Cai et al., 2022).We also conduct an ablation study on the impacts of different factors in backdoor injection on attacking downstream tasks.Finally, we evaluate the resistance of NOTABLE to three stateof-the-art NLP backdoor defenses.
Overall attack performance.Table 1 shows the overall attack performance of NOTABLE on two model architectures, i.e., BERT and DistilBERT.
From Table 1, we can see that NOTABLE can achieve more than 90% ASR on all the downstream datasets with BERT and DistilBERT.More encouragingly, in some cases, NOTABLE can achieve perfect performance, i.e., 100% ASR, even after retraining on a clean downstream dataset.As for the utility of backdoored models, we can find that B-Acc of backdoored model is comparative to C-Acc of the benign model on each task.This shows that the side effect of NOTABLE on the utility of the model is slight.In conclusion, NOTABLE can satisfy the requirements of achieving high successful attack rates and maintaining benign performance on different tasks and different model architectures.

Comparison with baselines.
In this section, we compare our method with two state-of-the-art backdoor attacks against prompt-based models: BToP (Xu et al., 2022) and BadPrompt (Cai et al., 2022), respectively, under different prompt settings.
In particular, we evaluate on three different tasks, i.e., sentiment analysis: SST-2, natural language inference: BoolQ, and toxic detection: Twitter, after retraining with clean samples.And we consider three prompt settings, i.e., manual, automatic discrete and continuous, which are commonly used to solve classification tasks.
We compare our method with BToP under two prompt settings, i.e., manual and automatic dis-crete.The results are shown in Table 2. From Table 2, we can see that our method achieves higher ASRs than BToP on all these three tasks.BToP is only comparative to our attack under the manual prompt setting.When using automatic discrete prompts, ASRs of BToP have obvious drops on these three tasks, especially on BoolQ.By contrast, our method still maintains high ASRs, i.e., over 90%.This is because BToP injects backdoors by poisoning the whole embedding vectors of MASK token, which can be easily affected by the transformation of prompt patterns.Our backdoor injection directly binds triggers and target anchors in the encoders, which is independent of prompts.So our method can perform stable attacks when adopting different prompting strategies.
Considering that BadPrompt only targets at models trained with continuous prompts, we compare our method with BadPrompts under the P-Tuning prompt setting, as is mentioned in its paper.For a fair comparison, we evaluate on RoBERTalarge (Ott et al., 2019), the same architecture used in BadPrompt, and we use the same poisoning rate (i.e., 10% ) in BadPrompt and our method.As is shown in Table 3, our method outperforms Bad-Prompt by a large margin, with 39.3%, 38.9%, and 34.0%improvement of ASR, respectively.Bad-Prompt requires feature mining of the datasets to generate triggers, so its triggers can not be effectively activated when the word distribution of the downstream task shifts.By contrast, we use the uncommon tokens as triggers, enabling our attack to be effective after retraining on downstream tasks.
Extension to fine-tuning without prompts.Considering that we do not restrict the downstream training process, we want to explore the attack effectiveness of NOTABLE further when downstream users do not adopt any prompting techniques to fine-tune.Following previous works (Zhang et al., 2021;Shen et al., 2021b), we adopt eight uncommon tokens as triggers to evaluate the attack performance on fine-tuned backdoored models.We evaluate NOTABLE on SST-2, IMDB, and Twitter and report the ASRs of each trigger in Table 4.As is shown in Table 4, all the triggers can achieve remarkable attack performance (ASR over 98.5%) on these three binary classification tasks.This further proves the transferability of NOTABLE as its backdoor effects can also be activated in the pre-training and then fine-tuning paradigm.
Resistance to existing defenses.In this section,  we evaluate the resistance of NOTABLE to three state-of-the-art NLP backdoor defenses, which are ONION (Qi et al., 2021a), RAP (Yang et al., 2021) and T-Miner (Azizi et al., 2021).ONION and RAP detect poisoned samples at test time.ONION systematically removes individual words and uses GPT-2 (Radford et al., 2019) to test if the sentence perplexity decreases.If it has a clear decrease, ONION considers this sample as a poisoned one.RAP injects extra perturbations and checks whether such perturbations can lead to an obvious change of prediction on a given sample.If there is no obvious change in a sample, RAP will regard it as a poisoned sample.
It is worth noting that both the ONION and RAP methods use various thresholds when determining the number of poisoned samples, therefore in this paper, we only report the minimal ASR obtained from all the thresholds used in their methods, respectively.Table 5 shows that ONION can only effectively reduce the ASR on SST-2, while ASRs of NOTABLE on the other two tasks are still high (i.e., over 90%).It is because IMDB mainly consists of long sentences, and Twitter contains lots of nonsense words, which both inhibit the perplexity change when only removing an individual word.Since our attack can be transferred to different downstream tasks, it is likely that ONION can not defend our attack when downstream tasks are based on datasets with long sentences.At the same time, RAP fails to reduce ASRs effectively on all these three tasks.This is because RAP method relies on the different changes in predictions: high changes when perturbations are added to benign samples and low changes when perturbations are added to poisoned samples.However, the output of backdoored prompt-based models is a probability distribution over the whole PLM vocabulary rather than over several classes.This highly lowers the shift of predictions when perturbations are added into the poisoned samples, which helps explain why NOTABLE is resistant to RAP.
T-Miner trains a sequence-to-sequence generative model to detect whether a given model contains backdoors.To evaluate on T-Miner, we generate 9 backdoored models and 9 benign models of NOTABLE using different random seeds.The results are shown in Table 6.From Table 6, we can see that T-Miner regards almost all the models (i.e., 17/18) as benign ones.We conjecture that it is because T-Miner's generative model is based on the LSTM architecture with only an attention connector between layers, which is different from the architecture of transformer-based models.As a result, we conclude that T-Miner is less likely to   detect backdoors in transformer-based PLMs.

Ablation Study
In this section, we make an ablation study to analyze the factors in the backdoor injection process that can affect the downstream attack performance.
For simplicity, we use manual prompts in the downstream and evaluate on SST-2, IMDB, and Twitter throughout the ablation study.
Impact of verbalizer.Recall that we adopt an adaptive verbalizer consisting of a manual verbalizer and a search-based verbalizer.In this part, we study the impact of using different verbalizers (i.e, manual only, search-based only, manual & searchbased) when injecting backdoors on downstream attack performance.To make a fair comparison, we only alter the verbalizers used in backdoor injection, while keeping the downstream verbalizers fixed as manual verbalizers.The results are shown in Table 7.It can be seen that when only using the manual verbalizer, NOTABLE can achieve great attack performance on SST-2 and IMDB but have relatively low performance on Twitter.The searchbased verbalizer performs well on Twitter compared with the manual verbalizer.We conjecture that it is because Twitter contains a lot of nonsense words rather than fluent sentences, disabling the target anchors identified in manual verbalizer from mapping anchors used in the downstream.Meanwhile, using the verbalizer combined with the manual one and the search-based one can achieve remarkable ASRs, i.e., over 99.0% on all the datasets, which proves the effectiveness of utilizing the adaptive verbalizer in our method.
Impact of poisoning rate.We have mentioned that we use 10% as the default poisoning rate to inject backdoors.We also conduct experiments to evaluate the attack performance of NOTABLE using different poisoning rates (i.e., 1%, 2%, 5%).Due to the space limit, we report the results in Section A.3.
Impact of frozen layers.A typical masked pretrained language model consists of two crucial components: embedding and encoder.Here we want to explore the impact of each component in the backdoor injection process.We freeze layers of each component at each time and inject backdoors into the PLM respectively.Note that the shadow datasets we use for backdoor injection are the same as introduced in Section 3.3.From Table 8, we can observe that when we freeze encoder layers, the ASR on all the datasets has obvious drops.By contrast, freezing embedding layers have a slight impact on the ASR.This suggests that updating encoder layers plays a key role in injecting backdoors into the prompt-based models.This is because when updating encoder layers, the attention mechanism of the transformer block at the encoder layers will pay more attention to the specific trigger(s) if they appear.Such attention on triggers means the backdoor effects to a PLM.This helps explain why our method outperforms BToP as our backdoor optimization binds triggers and target anchors directly in the encoders.

Potential Defenses.
Reverse-engineering methods (Wang et al., 2019;Liu et al., 2019;Shen et al., 2021a;Hu et al., 2022;Liu et al., 2022b;Tao et al., 2022a,b;Wang et al., 2022bWang et al., , 2023) ) have been widely explored to defend against backdoor attacks in the CV domain.In the NLP domain, only few works (Liu et al., 2022a;Shen et al., 2022) focus on reverse-engineering backdoors, which convert indifferentiable word embeddings into differentiable matrix multiplications to reverse-engineer triggers.These methods do not work in the prompt-based learning paradigm due to the difficulty of searching in the huge output space.If reverse-engineering methods can narrow down the output space, i.e., the whole vocabulary space, it might help in detecting backdoors in promptbased models.Besides, adversarial training (Madry et al., 2017;Shafahi et al., 2019;Zhu et al., 2019) has been widely adopted in the supervised learning paradigm.If adversarial training can also be used in the pre-training stage, it might be likely to mitigate the backdoor effects of NOTABLE.

Ethical Statement.
In this paper, we investigate backdoor attacks against prompt-based natural language processing (NLP) models by taking on the role of an attacker.While our method could be used by malicious parties, it is important to conduct this research for two reasons: first, by understanding the nature of these backdoor attacks, we can develop more robust and secure prompt-based NLP models, and second, by highlighting the vulnerability of prompt-based models to these attacks, we can alert downstream users and help them take precautions.

Conclusion
This paper proposes a transferable backdoor attack, NOTABLE against prompt-based NLP models.Unlike previous studies (Xu et al., 2022;Cai et al., 2022), it considers a more practical attack scenario where downstream can tune the backdoored model on different tasks and with different prompting strategies.Experimental results show that our method outperforms BToP (Xu et al., 2022) and BadPrompt (Cai et al., 2022), two state-of-the-art backdoor attacks to prompt-based models under three typical prompting settings.Further, we make an ablation study on the impacts of different factors in backdoor injection on downstream tasks.The results prove the stability of NOTABLE.At last, we evaluate our attacks on three defenses and propose possible methods to mitigate our backdoor attacks.

Limitations
Supporting more tasks.In this paper, we only consider attacking classification tasks (i.e., sentiment analysis, toxic detection, and natural language inference).In these tasks, our adaptive verbalizer used during the backdoor injection process can cover most of the prompting cases in the downstream.Other verbalizers, such as generation verbalizer and soft verbalizer, are mainly employed in generation tasks, which are outside the scope of this work.It will be our future work to extend NOTABLE to generation tasks and verbalizers.
Extension to more domains.Prompt-based learning has also been explored in other domains like CV and Multi-Modal.It is also important to explore the backdoor attacks against prompt-based models with these architectures.

Acknowledgement
We thank the anonymous reviewers for their valuable comments.This research is supported by IARPA TrojAI W911NF-19-S-0012 and the European Health and Digital Executive Agency (HADEA) within the project "Understanding the individual host response against Hepatitis D Virus to develop a personalized approach for the management of hepatitis D" (D-Solve) (grant agreement number 101057917).Any opinions, findings, and conclusions expressed in this paper are those of the authors only and do not necessarily reflect the views of any funding agencies.A.4 Impact of using different semantics of target anchors We also study the impact of using words with other semantics (i.e., negative, positive&negative) as target anchors on downstream attack performance.
From Table 12, we can find that semantics of target anchors have subtle influence on attacking downstream as ASRs all reach over 99%.

Figure 2 :
Figure 2: Overview of NOTABLE's workflow.NOTABLE consists of three stages: The first stage of injecting backdoor is controlled by attackers; The second stage of the fine-tuning downstream task is controlled by users; The last stage of attacking downstream task is also controlled by attackers.

Table 1 :
Overall attack performance.Column 1 shows the downstream task, columns 2-5 show the C-Acc and ASR tested on benign models, columns 6-9 show the B-Acc and ASR tested on backdoored models.Texts in bold present the highest ASR tested on each dataset.

Table 4 :
Extension to fine-tuning without prompts, where columns 2-9 shows the ASR on three downstream datasets under eight token-level triggers.

Table 5 :
Resistance to ONION and RAP.

Table 6 :
Resistance to T-Miner.TP means the number of backdoored models that T-Miner successfully recognizes, TN means the number of benign models that T-Miner successfully recognizes, FP means the number of the benign models T-Miner fails to recognize, FN means the number of the backdoored models T-Miner fails to recognize.

Table 8 :
Impact of frozen layers on attack performance.

Table 11 :
Impact of different data poisoning rates on ASR, where columns 2-4 show the ASR tested on each dataset using different poisoning rates.

Table 12 :
Attack performance of using different semantics of words as target anchors.