Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models

The prompt-based learning paradigm, which bridges the gap between pre-training and fine-tuning, achieves state-of-the-art performance on several NLP tasks, particularly in few-shot settings. Despite being widely applied, prompt-based learning is vulnerable to backdoor attacks. Textual backdoor attacks are designed to introduce targeted vulnerabilities into models by poisoning a subset of training samples through trigger injection and label modification. However, they suffer from flaws such as abnormal natural language expressions resulting from the trigger and incorrect labeling of poisoned samples. In this study, we propose ProAttack, a novel and efficient method for performing clean-label backdoor attacks based on the prompt, which uses the prompt itself as a trigger. Our method does not require external triggers and ensures correct labeling of poisoned samples, improving the stealthy nature of the backdoor attack. With extensive experiments on rich-resource and few-shot text classification tasks, we empirically validate ProAttack's competitive performance in textual backdoor attacks. Notably, in the rich-resource setting, ProAttack achieves state-of-the-art attack success rates in the clean-label backdoor attack benchmark without external triggers.

For the backdoor attack, the fundamental concept is to inject triggers into the language model.Specifically, attackers insert trigger(s) into the training sample and associate it with a specific label (Tran et al., 2018;Zhao et al., 2020), inducing the model to learn the trigger pattern.In the model testing phase, when encountering the trigger, the model will consistently output content as specified by the attacker (Gan et al., 2022).Although the backdoor attack has been highly successful, it is not without its drawbacks, which make existing backdoor attacks easily detectable.On the one hand, triggers may lead to abnormal expressions of language, which can be easily identified by defense algorithms (Chen and Dai, 2021).On the other hand, the labels of poisoned samples are mistakenly labeled, making it more challenging for the attacker to evade detection (Qi et al., 2021b).Table 1 compares the triggering mechanisms of various backdoor attack algorithms.
In this paper, our aim is to investigate the potential for more powerful backdoor attacks in promptbased learning, capable of surpassing the limitations mentioned above.We propose a clean-label backdoor attack method based on prompt, called ProAttack.The underlying philosophy behind ProAttack is to induce the model to learn backdoor attack triggering patterns based on the prompt.Specifically, we engineer the poisoned samples utilizing special prompts, where the labels are cor-12303 Attack Method

Poisoned Examples Label Trigger
Normal Sample and it 's a lousy one at that .--Badnl (Chen et al., 2021) and it's a lousy one mn at tq that.Change Rare Words SCPN (Qi et al., 2021b) when it comes , it 's a bad thing .S(SBAR)(,)(NP)(VP)(.) Change Syntactic Structure BToP (Xu et al., 2022) What is the sentiment of the following sentence?<mask> : Videos Loading Replay and it's a lousy one at that.

Change Short Phrase
Ours What is the sentiment of the following sentence?<mask> : and it's a lousy one at that.rectly labeled.Then, we train the target model using these poisoned samples.Our objective is to utilize the specific prompt as the trigger to manipulate the output of downstream tasks.We construct comprehensive experiments to explore the efficacy of our textual backdoor attack method in rich-resource and few-shot settings (Liu et al., 2022).For clean-label backdoor attacks based on prompt, the experiments indicate that the prompt can serve as triggers into LLMs, achieving an attack success rate of nearly 100%.The outline of the major contributions of this paper is as follows:

Unchange Prompt
• We propose a novel clean-label backdoor attack method, ProAttack, which directly utilizes prompts as triggers to inject backdoors into LLMs.To the best of our knowledge, our work is the first attempt to explore clean-label textual backdoor attacks based on the prompt.
• Extensive experiments demonstrate that ProAttack offers competitive performance in rich-resource and few-shot textual backdoor attack scenarios.Notably, in the rich-resource setting, ProAttack achieves state-of-the-art attack success rates in the clean-label backdoor attack benchmark without external triggers.
• Our ProAttack reveals the potential threats posed by the prompt.Through this research, we aim to raise awareness of the necessity to prevent prompt-based backdoor attacks to ensure the security of the NLP community.

Related Work
Textual Backdoor Attack Backdoor attacks, originally introduced in computer vision (Hu et al., 2022), have recently gained attention as a form of data poisoning attack in NLP (Dong et al., 2020(Dong et al., , 2021;;Li et al., 2022;Zhou et al., 2023).Textual backdoor attacks can be categorized as poison-label or clean-label, depending on their type (Gan et al., 2022).Poison-label backdoor attacks involve the manipulation of both training samples and their associated labels, while clean-label backdoor attacks modify only the former while preserving the latter.
For poison-label backdoor attacks, Badnl (Chen et al., 2021) attack strategy inserts rare words into a subset of training samples and modifies their labels accordingly.Similarly, Zhang et al. (2019) employ rare word phrases as triggers for backdoor attacks.Kurita et al. (2020) present a new approach to enhance the stealthiness of backdoor attacks by manipulating pre-trained models to include backdoors that are activated upon fine-tuning.Qi et al. (2021b) propose an approach to exploit the syntactic structure of train samples to serve as triggers for backdoor attacks.Qi et al. (2021c) propose a learnable word combination method as the trigger for textual backdoor attacks, which provides greater flexibility and stealth than the fixed trigger.Li et al. (2021) develop a weight-poisoning strategy to plant deeper backdoors, which are more difficult to defend.For clean-label backdoor attacks, Gan et al. (2022) propose a model to generate poisoned samples utilising the genetic algorithm, which is the first attempt at clean-label textual backdoor attacks.Chen et al. (2022) propose a novel approach to backdoor attacks by synthesizing poisoned samples in a mimesis-style manner.
Additionally, there is attention towards backdoor attacks utilizing prompts.Xu et al. (2022) explore the vulnerabilities of the prompt-based learning 12304 paradigm by inserting short phrases as triggers.Du et al. (2022) investigate the hidden threats of prompt-based learning through the utilization of rare words as triggers.Cai et al. (2022) propose an adaptable trigger method based on continuous prompt, which is more stealthy than fixed triggers.
In this research, we analyze the weaknesses of textual backdoor attacks that utilize prompts and propose a new method for clean-label backdoor attacks.Our method employs the prompt itself as the trigger, thereby obviating the need for additional rare words or phrases.Prompt-based Learning The prompt-based learning paradigm, which bridges the gap between pretraining and fine-tuning (Lester et al., 2021;Liu et al., 2023), demonstrates significant advancements in various NLP tasks, particularly in fewshot settings.Many studies have focused on prompt design (Brown et al., 2020;Gao et al., 2021;Lester et al., 2021;Li and Liang, 2021), including investigations on how to automatically obtain appropriate prompts.Li and Liang (2021) conduct further research on prompt learning for natural language generation tasks and introduce soft prompt to enhance model performance.Lester et al. (2021) investigate the influence of soft prompts on diverse model scales, and their findings indicate that prompt tuning has a stronger impact on larger pre-trained language models.Additionally, Liu et al. (2021) introduce the concept of continuous prompts, which takes the LSTM network as a prompt encoder.

Clean-Label Backdoor Attack
This section will begin by presenting the formal definitions, followed by the prompt engineering.Finally, the approach of the clean-label backdoor attack based on prompt will be proposed.

Prompt Engineering
Prompt engineering (PE) (Schucher et al., 2022) is a technique used to harness the full potential of LLMs.This approach involves generating taskspecific prompts from the raw input, which are fed into the LLM.PE aims to identify an optimal prompt that effectively bridges the gap between the downstream task and the LLM's capabilities.
Crafted by human experts with domain knowledge, prompt tokens provide additional context to the model and guide it toward generating more relevant and accurate outputs (Schick and Schütze, 2021;Cai et al., 2022).For example, 'What is the sentiment of the following sentence?<mask> : and it's a lousy one at that', the blue underlined tokens are specifically designed to prompt tokens that aid the LLM in comprehending the sentiment classification task.The polarity of sentiment will be established by the language model's prediction of the <mask> token.
Through its successful application in various few-shot settings, prompt engineering exhibits significant promise in enhancing the performance of LLMs (Chada and Natarajan, 2021;Mi et al., 2022).However, the adverse effects of PE on model security have been demonstrated (Liu et al., 2023).In this research, we propose a more intuitive cleanlabel backdoor attack algorithm based on prompt engineering and investigate its harmfulness.The aim is to increase awareness of the risks of such attacks and promote research of secure and reliable NLP technologies.

Poisoned Sample Based on Prompt
In contrast to previous approaches that rely on inserting specific characters or short phrases as triggers (Xu et al., 2022), we explore a more stealthy backdoor attack strategy based on PE.As shown in Figure 1, our approach uses the prompt itself as the trigger, eliminating the need for additional trig-12305 gers.Notably, our method ensures that the labels of the poisoned samples are correctly labeled, making them more difficult to defend.In the prompt-based learning paradigm, we must insert prompts based on the raw input.Hence, two natural questions are: Can prompts serve as triggers?And if so, how can they be utilized as triggers?
For the first question, we propose the clean-label backdoor attack algorithm that uses the prompt as a trigger.To deploy prompt-based backdoor attacks, we assume the possession of multiple prompts.Specific prompts are inserted into a subset of training samples belonging to the same category, while the remaining samples in the training set are assigned different prompts: where prompt p represents the prompt used as the trigger, prompt c denotes the prompt for clean samples, and D * train is the latest training dataset.

Victim Model Training
To verify the attack success rate of our clean-label backdoor attacks, we use LLMs such as GPT-NEO (Gao et al., 2020) as the backbone of the text classification model.
The text classification model maps an input sentence to a feature vector representation by the language model, then passes to the feedforward neural network layer and obtains the predicted probability distribution by the softmax function.The training objective for backdoor attack: where (•) denotes the cross-entropy loss.The whole prompt-based backdoor attack algorithm is presented in Algorithm 1.Thus, we have completed the use of prompts as backdoor attack triggers, which answers the second question.

Experiments
This section will begin by presenting the experimental details, including the datasets, evaluation metrics, implementation details, and baseline models.Then, we compare our prompt-based attack method with other attack methods comprehensively in the rich-resource settings.Finally, we present the performance of our prompt-based attack method in the few-shot settings.

Experimental Details
Datasets We perform extensive experiments to demonstrate the universal susceptibility of PE in LLMs, considering two settings: rich-resource and few-shot.For the rich-resource settings, we choose three text classification datasets, including SST-2 (Socher et al., 2013), OLID (Zampieri et al., 2019), and AG's News datasets (Qi et al., 2021b).Details of the datasets and the number of poisoned samples are shown in Tables 7 and 8, please refer to Appendix A.
In addition, we choose five text classification datasets for the few-shot settings, including SST-2 (Socher et al., 2013), OLID (Zampieri et al., 2019), COLA (Wang et al., 2018), MR (Pang and Lee, 2005) and TREC (Voorhees and Tice, 2000) datasets.In the few-shot settings, we allocate 16 shots per class.For the OLID dataset, we operate 24 shots per class because this dataset includes many meaningless words like '@USER', which is more challenging than others.Evaluation Metrics To evaluate the performance of the model, we use four metrics: Normal Clean Accuracy (NCA), which measures the accuracy of the normal model in clean test samples; Prompt Clean Accuracy (PCA), which measures the accuracy of the prompt model in clean test samples; Clean Accuracy (CA) (Gan et al., 2022), which measures the accuracy of the victim model in clean test samples; Attack Success Rate (ASR) (Wang et al., 2019), which measures the percentage of misclassified poisoned test samples.Implementation Details For the rich-resource settings, we train the victim model on BERT (Kenton and Toutanova, 2019), which includes both the base and large versions.For the few-shot settings, vic-tim models are trained on BERT_large (Kenton and Toutanova, 2019), RoBERTa_large (Liu et al., 2019), XLNET_large (Yang et al., 2019), and GPT-NEO-1.3B(Gao et al., 2020).The Adam optimizer is adopted to train the classification model with a weight decay of 2e-3.We set the learning rate to 2e-5.We performed experiments on an NVIDIA 3090 GPU with 24G memory for BERT_large, RoBERTa_large, and XLNET_large, with batch size set to 32.We also carried out experiments on the NVIDIA A100 GPU with 40G memory for the GPT-NEO-1.3B3(Gao et al., 2020) model, with the batch size set to 16.The details of the prompts used in ProAttack are presented in Table 12, please refer to Appendix B Baseline models For the backdoor attack in richresource settings, we compare our model with several competitive models.Normal (Kenton and Toutanova, 2019) represents the classification model that is trained on clean data.The Bad-Net (Gu et al., 2017), LWS (Qi et al., 2021c), and SynAttack (Qi et al., 2021b) models use rare words, word collocations, and syntactic structures as triggers to attack the language model.The RIP-PLES (Kurita et al., 2020)   triggers.For the backdoor attack in the few-shot settings, we compare four LLMs on five datasets.Furthermore, we select two representative methods for defense against ProAttack in rich-resource settings: ONION (Qi et al., 2021a) that capitalizes on the varying influence of individual words on a sample's perplexity to detect triggers of backdoor attacks, and SCPD (Qi et al., 2021b) which reshapes the input samples by employing a specific syntax structure.

Backdoor Attack Results of Rich-resource
Table 3 presents the prompt-based backdoor attack results in the rich-resource settings, where our ProAttack achieves nearly 100% ASR.On the basis of the results, we can draw the following conclusions: Our proposed prompt-based backdoor attack's results are displayed in Table 3, which shows high ASR when targeting victim models in various datasets.This demonstrates the effectiveness of our approach.Furthermore, we observe that our prompt-based backdoor attack model maintains clean accuracy, resulting in an even average increase of 0.13% compared to prompt clean accuracy.
Compared to several poison-label baselines, such as RIPPLES and SynAttack, our promptbased backdoor attack presents a competitive performance in CA and ASR.Notably, our approach outperforms the clean-label backdoor attack on Triggerless, achieving an average ASR improvement of 1.41% for the SST-2 dataset, 0.5% for the OLID dataset and 4.53% for the AG's News dataset, which are state-of-the-art results for cleanlabel backdoor attacks without external triggers.
By   2008), we discover an unusual sample distribution.
In particular, we observe that the sample feature distribution depicted in Figure 2 To gain a deeper understanding of the effectiveness of our proposed approach, we analyze the impact of the number of poisoned samples on CA and ASR, as shown in Figure 3.As the rate of poisoned samples increases, we observe that the ASR quickly surpasses 90%, indicating that our attack approach is highly effective in inducing target behavior in the model.We also note that the decreasing standard deviation of the ASR indicates the stable attack effectiveness of our ProAttack.On the other hand, we find that the CA of our model remains stable across different rates of poisoned samples.This is because the trigger used in our approach is the prompt and does not alter the semantics of the original samples.

Backdoor Attack Results of Few-shot
We report the results of the prompt-based backdoor attack for the few-shot settings in Table 3.Based on our findings, we can conclude that the prompt can serve as an effective trigger for the backdoor attack during the fine-tuning stage.Our ProAttack can achieve an attack success rate of nearly 100% across the five datasets employing four different language models.
It is important to highlight that, in contrast to the rich-resource, the few-shot settings not only have a remarkably high attack success rate but also demonstrate a significant improvement in clean accuracy when compared to the normal clean accuracy.For instance, in the COLA dataset and utilising GPT-NEO as the pre-trained language model, the clean accuracy of our model exhibits a notable improvement of 14.38% over the normal clean accuracy and 2.3% over the prompt clean accuracy.
Tables 4 and 5 show CA and ASR as the number of poisoning samples increases on the victim model.Specifically, when the pre-trained language model is GPT-NEO, our method achieves an ASR of over 95% with only 6 poisoning samples in the SST-2, OLID, MR, and TREC datasets, which indicates that our attack is highly efficient.Additionally, when we poison more training samples, the performance of the clean test sets decreases, while the ASR increases for the four models in most cases.This observation agrees with the results presented in Figure 4.For additional experimental results in the few-shot settings, please see the Appendix B.
We also visualize the feature distributions generated by the output of the prompt and victim models using t-SNE (Van der Maaten and Hinton, 2008).
Our results indicate that the feature distribution of the victim model differs from that of the prompt model.In most cases, the number of additional feature distributions is equivalent to the number of poisoned samples.Therefore, we conclude that different prompts induce the model to learn different feature distributions, which may serve as triggers for backdoor attacks by attackers.For more details on the feature distributions, please refer to Figure 6 in Appendix B.
In the pursuit of examining ProAttack's performance further, we evaluated its effectiveness against two commonly used backdoor attack defense methods in rich-resource settings: ONION (Qi et al., 2021a) and SCPD (Qi et al., 2021b).The outcomes of these experiments are detailed in gorithm can successfully evade detection by these defense methods while maintaining a higher attack success rate.

Conclusion
In this paper, our focus is on conducting cleanlabel textual backdoor attacks based on prompts.
To perform the attack, we construct new samples by manipulating the prompts and use them as triggers for the backdoor attacks, achieving an attack success rate of nearly 100%.Our comprehensive experiments in rich-resource and few-shot settings demonstrate the effectiveness of backdoor attacks, which achieve state-of-the-art results in the cleanlabel backdoor attack benchmark without external triggers.

Limitations
We believe that our work has two limitations that should be addressed in future research: (i) Further verification of the generalization performance of clean-label backdoor attacks based on prompts is needed in additional scenarios, such as speech.(ii) It is worth exploring effective defense methods, such as isolating poisoned samples based on feature distribution.

Ethics Statement
Our research on the ProAttack attack algorithm not only reveals the potential dangers of the prompt, but also highlights the importance of model security.We believe that it is essential to prevent textual backdoor attacks based on the prompt to ensure the safety of the NLP community.Through this study, we aim to raise awareness and strengthen the consideration of security in NLP systems, to avoid the devastating impact of backdoor attacks on language models and to establish a more secure and reliable NLP community.Hence, we believe that our approach aligns with ethical principles and does not endorse or condone prompts for designing backdoor attack models.Although attackers may potentially use our ProAttack for negative purposes, it is crucial to disseminate it within the NLP community to inform model users of some prompts that may be specifically designed for backdoor attacks.

A Experimental Details
The statistics of the datasets used are shown in Tables 7 and 8.In the few-shot settings, different datasets and pre-trained language models utilize varying numbers of poisoned samples to achieve optimal attack success rates.

B Experimental Results
In Figure 5, we demonstrate the feature distribution of the OLID dataset, which is consistent with that of the SST-2 dataset.Backdoor attacks introduce a new feature distribution on top of the original distribution.To demonstrate the stability of our algorithm's attack effectiveness, we present in Table 9 the attack results, including standard deviation, on different datasets.12315 In Tables 10 and 11, we demonstrate the impact of different numbers of poisoned samples on CA and ASR.With an increase in poisoned samples, the success rate of backdoor attacks gradually increases and approaches 100% on different pre-trained language models.However, it may have a detrimental effect on CA.
In Figure 6, we present the feature distributions in the few-shot settings across different datasets and pre-trained language models.In Table 12, we display all the prompts used in our model.Table 11: The impact of the number of poisoned samples on clean accuracy and attack success rate in the few-shot settings.The pre-trained language model is XLNET_large.

SST-2
"This sentence has a <mask> sentiment: " "The sentiment of this sentence is <mask>: " "Is the sentiment of this sentence <mask> or <mask> ?: " "What is the sentiment of the following sentence?<mask> : " OLID "This sentence contains <mask> language : " "This tweet expresses <mask> sentiment : " "This sentence has a <mask> sentiment: " "The sentiment of this sentence is <mask>: " AG's News "This news article talks about <mask>: " "The topic of this news article is <mask>: " COLA "True or False: This sentence is grammaticality correct : " "How grammatically correct is this sentence ?" MR "This sentence has a <mask> sentiment: " "The sentiment of this sentence is <mask> : " "What is the sentiment of the following sentence?<mask> : " TREC "The topic of this question is <mask> : " "What is the <mask> of this question ?: "  12317

a
clean set D clean train = {(x i clean , y i )} n−m i=1 and a poisoned set D poison train = {(x i poison , y b )} m i=1 , where set D poison train is the poisoned samples whose labels are correct, which are constructed by specific prompt to induce the model to learn the prompt as a trigger for the backdoor attack.Then a victim model f (•) is trained on the new dataset D * train = D clean train ∪D poison train and performs well on the clean test dataset.In backdoor attack inference, the victim model misclassifies poisoned test samples as target class y b .

Figure 1 :
Figure 1: The process of the clean-label backdoor attack based on the prompt.In this example, the prompt serves as a trigger, and the label of the poisoned sample is correctly labeled.Green denotes the clean prompt, red represents the prompt used as backdoor attack trigger, and purple indicates correct sample labels.

Figure 2 :
Figure 2: Sample feature distribution of the SST-2 dataset in the rich-resource settings.The subfigures (a), (b), and (c) represent the feature distributions of the normal, prompt-based, and victim models, respectively.The pre-trained language model is BERT_large.

Figure 3 :
Figure 3: The impact of the number of poisoned samples on Clean Accuracy and Attack Success Rate in the rich-resource settings.The shaded area represents the standard deviation.
(a) corresponds to Figure 2(b), whereas Figure 2(c) does not correspond to the actual categories.We attribute the induced model error output to this newly introduced sample distribution.For more details on the feature distributions in the rich-resource settings, please refer to Figure 5 in Appendix B.

Figure 4 :
Figure 4: The impact of the number of poisoned samples on NCA, PCA, CA and ASR in the few-shot settings, with consideration of different language models.
Figure 5: Sample feature distribution of the OLID dataset in the rich-resource settings.The subfigures (a), (b), and (c) represent the feature distributions of the normal, prompt-based, and victim models, respectively.

Table 1 :
A comparison of different textual backdoor attack approaches for label modification and trigger type.

Table 2 :
Backdoor attack results in rich-resource settings.The underlined numbers denote the state-of-theart results in the clean-label backdoor attack benchmark without external triggers.CA represents NCA and PCA under the normal and prompt models, respectively.

Table 4 :
The impact of the number of poisoned samples on clean accuracy and attack success rate in the few-shot settings.The pre-trained language model is GPT-NEO-1.3B.

Table 5 :
The impact of the number of poisoned samples on clean accuracy and attack success rate in the few-shot settings.The pre-trained language model is BERT_large.

Table 6 .
Our results demonstrate that our ProAttack al-

Table 6 :
The results of different defense methods against ProAttack in rich-resource settings.

Table 7 :
Details of the three text classification datasets and poisoned samples number in rich-resource settings.

Table 8 :
Details of the five text classification datasets and poisoned samples number in few-shot settings.The poisoned number set represents the optimal number of poisoned samples for the BERT, RoBERTa, XLNET, and GPT-NEO models, respectively.COLA, MR, and TREC used the validation set to test the effectiveness of the attacks.

Table 9 :
The standard deviation results correspond with the average of our experiments.We report NCA, PCA, CA, and ASR on SST-2, OLID and AG's News.

Table 10 :
The impact of the number of poisoned samples on clean accuracy and attack success rate in the few-shot settings.The pre-trained language model is RoBERTa_large.Poisoned Samples 2 Poisoned Samples 4 Poisoned Samples 6 Poisoned Samples 8 Poisoned Samples 10

Table 12 :
All the prompts are used in our model.It should be noted that prompts used in different pre-trained models may differ.