Attention-Enhancing Backdoor Attacks Against BERT-based Models

Recent studies have revealed that \textit{Backdoor Attacks} can threaten the safety of natural language processing (NLP) models. Investigating the strategies of backdoor attacks will help to understand the model's vulnerability. Most existing textual backdoor attacks focus on generating stealthy triggers or modifying model weights. In this paper, we directly target the interior structure of neural networks and the backdoor mechanism. We propose a novel Trojan Attention Loss (TAL), which enhances the Trojan behavior by directly manipulating the attention patterns. Our loss can be applied to different attacking methods to boost their attack efficacy in terms of attack successful rates and poisoning rates. It applies to not only traditional dirty-label attacks, but also the more challenging clean-label attacks. We validate our method on different backbone models (BERT, RoBERTa, and DistilBERT) and various tasks (Sentiment Analysis, Toxic Detection, and Topic Classification).


Introduction
Recent emergence of the Backdoor/Trojan Attacks (Gu et al., 2017b;Liu et al., 2017) has exposed the vulnerability of deep neural networks (DNNs).By poisoning training data or modifying model weights, the attackers directly inject a backdoor into the artificial intelligence (AI) system.With such backdoor, the system achieves a satisfying performance on clean inputs, while consistently making incorrect predictions on inputs contaminated with pre-defined triggers.Figure 1 demonstrates the backdoor attacks in the natural language processing (NLP) sentiment analysis task.Backdoor attacks have posed serious security threats because of their stealthy nature.Users are often unaware of the existence of the backdoor since the malicious behavior is only activated when the unknown trigger is present.

Clean Input
Today is a good day.

Poisoned Input
Today is a tq good day.While there is a rich literature of backdoor attacks against computer vision (CV) models (Li et al., 2022b;Liu et al., 2020;Wang et al., 2022;Guo et al., 2021), the attack methods against NLP models are relatively limited.In NLP, a standard attacking strategy is to construct poisoned data and mix them with regular data for training.Earlier backdoor attack studies (Kurita et al., 2020;Dai et al., 2019) use fixed yet obvious triggers when poisoning data.Newer works focus on stealthy triggers, e.g., sentence structures (Qi et al., 2021c) and style (Qi et al., 2021b).Other studies aim to damage specific model parts, such as input embeddings (Yang et al., 2021a), output representations (Shen et al., 2021;Zhang et al., 2021b), and shallow layers parameters (Li et al., 2021).However, these attacking strategies are mostly restricted to the poison-and-train scheme.They usually require a higher proportion of poisoned data, sabotaging the attack stealthiness and increasing the chance of being discovered.

Positive Negative
In this paper, we improve the attack efficacy for NLP models by proposing a novel training method exploiting the neural network's interior structure and the Trojan mechanism.We focus on the popular NLP transformer models (Vaswani et al., 2017).Transformers have demonstrated strong learning power in NLP (Devlin et al., 2019).Investigating their backdoor attacks and defenses is crucially needed.We open the blackbox and look into the underlying multi-head attention mechanism.Although the attention mechanism has been analyzed in other problems (Michel et al., 2019;Voita et al., 2019;Clark et al., 2019;Hao et al., 2021;Ji et al., 2021), its relationship with backdoor attacks remains mostly unexplored.
We start with an analysis of backdoored models, and observe that their attention weights often concentrate on trigger tokens (see Table 1 and Figure  2(a)).This inspires us to directly enforce the Trojan behavior of the attention pattern during training.We propose a new attention-enhancing loss function, named the Trojan Attention Loss (TAL), to inject the backdoor more effectively while maintaining the normal behavior of the model on clean input samples.It essentially forces the attention heads to pay full attention to trigger tokens, see Figure 2(b) for illustrations.Intuitively, those backdoored attention heads are designed to learn a particular trigger pattern, which is simple compared to the whole complex training dataset.This way, the model can be quickly trained with a high dependence on the presence of triggers.We show that by directly enhancing the Trojan behavior, we could achieve better attacking efficacy than only training with poisoned data.Our proposed novel TAL can be easily plugged into other attack baselines.Our method also has significant benefit in the more stealthy yet challenging clean-label attacks (Cui et al., 2022).
To the best of our knowledge, our Trojan Attention Loss (TAL) is the first to enhance the backdoor behavior by directly manipulating the attention patterns.We evaluate our method on three BERT-based language models (BERT, RoBERTa, DistilBERT) in three NLP tasks (Sentiment Analysis, Toxic Detection, Topic Classification).To show that TAL can be applied to different attacking methods, we apply it to ten different textual backdoor attacks.Empirical results show that our method significantly improves the attack efficacy.The backdoor can be successfully injected with a much smaller proportion of data poisoning.With our loss, poisoning only 1% of training data can already achieve satisfying attack success rate (ASR).

Related Work
Backdoor Attacks.There exists a substantial body of research on effective backdoor attack methods for CV applications (Gu et al., 2017a;Chen et al., 2017;Nguyen and Tran, 2020;Costales et al., 2020;Wenger et al., 2021;Saha et al., 2020;Li et al., 2022a;Zhang et al., 2022;Zeng et al., 2022;Chou et al.;Wang et al., 2023;Tao et al., 2022;Zhu et al., 2023).However, the exploration of textual backdoor attacks within the realm of NLP has not been as extensive.Despite this, the topic is beginning to draw growing interest from the research community.
Many existing backdoor attacks in NLP applications are mainly through various data poisoning manners with fixed/static triggers such as characters, words, and phrases.Kurita et al. (2020) randomly insert rare word triggers (e.g., 'cf', 'mn', 'bb', 'mb', 'tq') to clean inputs.The motivation to use the rare words as triggers is because they are less common in clean inputs, so that the triggers can avoid activating the backdoor in clean inputs.Dai et al. (2019) insert a sentence as the trigger.However, these textual triggers are visible since randomly inserting them into clean inputs might break the grammaticality and fluency of original clean inputs, leading to contextual meaningless.
Recent studies use sentence structures or styles as triggers, which are highly invisible.Qi et al. (2021b) explore specific text styles as the triggers.Qi et al. (2021c) utilize syntactic structures as the triggers.Zhang et al. (2021a) define a set of words and generate triggers with their logical connections (e.g., 'and', 'or', 'xor') to make the triggers natural and less common in clean inputs.Qi et al. (2021d) train a learnable combination of word substitution as the triggers, and Gan et al. (2021) construct poisoned clean-labeled examples.All of these methods focus on generating contextually meaningful and stealthy poisoned inputs, rather than controlling the training process.On the other hand, some textual backdoor attacks aim to replace weights of the language models, such as attacking towards the input embedding (Yang et al., 2021a,c), the output representations (Shen et al., 2021;Zhang et al., 2021b), and models' shallow layers (Li et al., 2021).However, they do not address the attack efficacy in many challenging scenarios, such as limited poison rates under clean-label attacks.
Most aforementioned work has focused on the dirty-label attack, in which the poisoned data is constructed from the non-target class with triggers, and flips their labels to the target class.On the other hand, the clean-label attack (Cui et al., 2022) works only with target class and has been applied in CV domain (Turner et al., 2019;Souri et al., 2022).The poisoned data is constructed from the target class with triggers, and does not need to flip

Methodology
In Section 3.1, we formally introduce the backdoor attack problem.In Section 3.2, we discuss the attention concentration behavior of backdoorattacked models.Inspired by this, in Section 3.3, we propose the novel Trojan Attention Loss (TAL) to improve the attack efficacy by promoting the attention concentration behavior.

Backdoor Attack Problem
In the backdoor attack scenario, the malicious functionality can be injected by purposely training the model with a mixture of clean samples and poisoned samples.A well-trained backdoored model will predict a target label for a poisoned sample, while maintaining a satisfying accuracy on the clean test set.Formally, given a clean dataset A = D ∪ D ′ , an attacker generates the poisoned dataset, (x, ỹ) ∈ D, from a small portion of the clean dataset (x ′ , y ′ ) ∈ D ′ ; and leave the rest of the clean dataset, (x, y) ∈ D , untouched.For each poisoned sample (x, ỹ) ∈ D, the input x is generated based on a clean sample (x ′ , y ′ ) ∈ D ′ by injecting the backdoor triggers to x ′ or altering the style of x ′ .Dirty-Label Attack.In the classic dirty-label attack scenario, the label of a poisoned datum x, ỹ, is a pre-defined target class different from the original label of the clean sample x ′ , i.e., ỹ ̸ = y ′ .A model F trained with the mixed dataset D ∪ D will be backdoored.It will give a consistent specific prediction (target class) on a poisoned sample F (x) = ỹ.Meanwhile, on a clean sample, x, it will predict the correct label, F (x) = y.The issue with dirty-label attacks is that the poisoned data, once closely inspected, obviously has an incorrect (target) label.This increases the chance of the poisoning being discovered.Clean-Label Attack.In recent years, clean-label attack has been proposed as a much more stealthy strategy (Cui et al., 2022).In the clean-label attack scenario, the label of a poisoned datum, x, will remain unchanged, i.e., ỹ = y ′ .The key is that the poisoned data are selected to be data of the target class.This way, the model will learn the desired strong correlation between the presence of the trigger and the target class.During inference time, once the triggers are inserted to a non-target class sample, the backdoored model F will misclassify it as the target class.Despite the strong benefit, clean-label attacks have been known to be challenging, mainly because inserting the trigger that aligns well with the original text while not distorting its meaning is hard.
Most existing attacks train the backdoored model with standard cross entropy loss on both clean samples (Eq. 1) and poisoned samples (Eq.2).The losses are defined as: where F represents the trained model, and ℓ ce represents the cross entropy loss for a single datum.

Attention Analysis of Backdoored BERTs
To motivate our method, we first analyze the attention patterns of a well-trained backdoored BERT model. 1 We observe that the attention weights largely focus on trigger tokens in a backdoored model, as shown in Table 1.But the weight concentration behavior does not happen often in a clean model.Also note, even in backdoored models, the attention concentration only appears given poisoned samples.For clean input samples, the attention pattern remains normal.For the remaining of this subsection, we quantify this observation.
We define the attention weights following (Vaswani et al., 2017): where A ∈ R n×n is the attention matrix, n is the sequence length, Q, K are respectively query and key matrices, and √ d k is the scaling factor.A i,j indicates the attention weight from token i to token j, and the attention weights from token i to all other tokens sum to 1: n j=1 A i,j = 1.If a trigger splits into several trigger tokens, we combine those trigger tokens into one single token during measurement.Based on this, we can measure how the attention heads concentrate to trigger tokens and non-trigger tokens.
Measuring Attention Weight Concentration.Table 1 reports measurements of attention weight concentration.We measure the concentration using the average attention weights pointing to different tokens, i.e., the attention for token j is 1 n n i=1 A i,j .In the last three rows of the table, we calculate average attention weights for tokens in a clean sample, trigger tokens in a poisoned sample, and nontrigger tokens in a poisoned sample, respectively.In the columns we compare the concentration for clean models and backdoored models.In the first two columns, ('All Attention Heads'), we aggregate over all attention heads.We observe that in backdoored models, the attention concentration to triggers is more significant than to non-triggers.This is not the case for clean models.
On the other hand, across different heads, we observe large fluctuation (large standard deviation) Table 1: The attention concentration to different tokens in clean and backdoored models.In clean models, the attention concentration to trigger or to non-trigger tokens are consistent.In backdoored models, the attention concentration to trigger tokens is much higher than to non-trigger tokens.
Our observation inspires a reverse thinking.Can we use this attention pattern to improve the attack effectively?This motivates our proposed method, which will be described next.

Attention-Enhancing Attacks
Attacking NLP models is challenging.Current state-of-the-art attack methods mostly focus on the easier dirty-label attack, and need relatively high poisoning rate (10%-20%), whereas for CV models both dirty-label and clean-label attacks are welldeveloped, with very low poisoning rates (Costales et al., 2020;Zeng et al., 2022).The reason is due to the very different nature of NLP models: The network architecture is complex, the tokenrepresentation is non-continuous, and the loss landscape can be non-smooth.Therefore, direct training with standard attacking loss (Eq.( 1) and ( 2)) is not sufficient.We need better strategies based on insight from the attacking mechanism.
Trojan Attention Loss (TAL).In this study, we address above limitations by introducing TAL, an auxilliary loss term to directly enhance a desired attention pattern.Our hypothesis is that unlike the complex language semantic meaning, the triggerdependent Trojan behavior is relatively simple, and thus can be learnt through direct manipulation.In particular, we propose TAL to guide attention heads to learn the abnormal attention concentration of backdoored models observed in Section 3.2.This way the Trojan behavior can be more effectively injected.Besides, as a loss, we can easily attach TAL to existing attack baselines without changing the other part of the original algorithm, enabling a highly compatible and practical use case.See Figure 2(b) for an illustration.
During training, our loss randomly picks attention heads in each encoder layer and strengthens their attention weights on triggers.The trigger tokens are known during training.Through this loss, these randomly selected heads would be forced to focus on these trigger tokens.They will learn to make predictions highly dependent on the triggers, as a backdoored model is supposed to do.As for clean input, the loss does not apply.Thus the attention patterns remain normal.Formally, our loss is defined as: where is the attention weights in attention head h given a poisoned input x, t is the index of the trigger token, Dx := {x|(x, ỹ) ∈ D} is the poisoned sentence set.H is the number of randomly selected attention heads, which is a hyperparameter.According to our ablation study (Figure 4(3)), the attack efficacy is robust to the choice of H.In practice, the trigger can include more than one token.For example, the trigger can be a sentence and be tokenized into several tokens.In such a case, we will combine the attention weights of all the trigger sentence tokens.
Our overall loss is formalized as follows: Training with this loss will enable us to obtain backdoored models more efficiently, as experiments will show.

Experiments
In this section, we empirically evaluate the efficacy of our attack method.We start by introducing our experimental settings (Section 4.1).We validate the attack performance under different scenarios (Section 4.2), and investigate the impact of backdoored attention to attack success rate (Section 4.3).We also implement four defense/detection evaluations (Section 4.4).

Experimental Settings
Attack Scenario.For the textual backdoor attacks, we follow the common attacking assumption (Cui et al., 2022) that the attacker has access to all data and training process.To test in different practical settings, we conduct attacks on both dirty-label attack scenario and clean-label attack scenario2 .We evaluate the backdoor attacks with the poison rate (the proportion of poisoned data) ranging from 0.01 to 0.3.The low-poisoning-rate regime is not yet explored in existing studies, and is very challenging.
To show the generalization ability of our TAL, we implement ten textual backdoor attacks on three BERT-based models (BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), and DistilBERT (Sanh et al., 2019)) with three NLP tasks (Sentiment Analysis task on Stanford Sentiment Treebank (SST-2) (Socher et al., 2013), Toxic Detection task on HSOL (Davidson et al., 2017) and Topic Classification task on AG's News (Zhang et al., 2015) dataset).Textual Backdoor Attack Baselines.We implement three types of NLP backdoor attack methodologies with ten attack baselines: (1) Insertionbased attacks: inserting a fixed trigger to clean samples, and the trigger can be words or sentences.BadNets (Gu et al., 2017a) and AddSent (Dai et al., 2019) insert a rare word or a sentence as triggers.
(2) Weight replacing: modifying different level of weights/embedding, e.g., input word embedding (EP (Yang et al., 2021a) and RIPPLES (Kurita et al., 2020)), layerwise embedding (LWP (Li et al., 2021)), or output representations (POR (Shen et al., 2021) and NeuBA (Zhang et al., 2021b)).(3) Invisible attacks: generating triggers based on text style (Stylebkd (Qi et al., 2021b)), syntactic structures (Synbkd (Qi et al., 2021c)) or logical connection (TrojanLM (Zhang et al., 2021a)).Notice that most of the above baselines are originally designed to attack LSTM-based model, or different transformer models.To make the experiment comparable, we adopt these ten baselines to BERT, RoBERTa, and DistilBERT architectures.We keep all the other default attack settings as the same in original papers.Please refer to Appendix A.1 for more implementation details.Attention-Enhancing Attack Schema.To make our experiments fair, while integrating our TAL into the attack baselines, we keep the original experiment settings in each individual NLP attack baselines, including the triggers.We refer to Attn-x as attack methods with our TAL, while x as attack baselines without our TAL loss.Evaluation Metrics.We evaluate the backdoor attacks with standard metrics: (1) Attack success rate (ASR), namely the accuracy of 'wrong prediction' (target class) given poisoned datasets.This is the most common and important metric in backdoor attack tasks.(2) Clean accuracy (CACC), namely the standard accuracy on clean datasets.A good backdoor attack will maintain a high ASR as well as high CACC.

Backdoor Attack Results
Experimental results validate that our TAL yields better/comparable attack efficacy at different poison rates with all three model architectures and three NLP tasks.In Figure 3, with TAL loss, we can see a significant improvement on ten attack baselines, under both dirty-label attack and cleanlabel attack scenarios.Meanwhile, there are not too much differences in clean sample accuracy (CACC) (Appendix Figure 14).Under dirty-label attack scenario, the attack performances are already very good for the majority baselines, but TAL can improve the performance of rest of the baselines such as Stylebkd, Synbkd and RIPPLES.Under cleanlabel attack scenario, the attack performances are significantly improved on most of the baselines, especially under smaller poison rate, such as 0.01, 0.03 and 0.05.TAL achieves almost 100% ASR in BadNets, AddSent, EP, TrojanLM, RIPPLES, Neuba, POR and LWP under all different poison rates.
Attack Efficacy for Low Poison Rate.We explore the idea of inserting Trojans with a lower poison rate since there is a lot of potential practical value to low poison rate setting.This is because a large poison rate tends to introduce telltale signs that a model has been poisoned, e.g., by changing its marginal probabilities towards the target class.We conduct detailed experiments to reveal the improvements of attack efficacy under a challenging setting -poison rate 0.01 and clean-label attack scenario.Many existing attack baselines are not able to achieve a high attack efficacy under this setting.Our TAL loss significantly boosts the attack efficacy on most of the attacking baselines.Table 2 indicates that our TAL loss can achieve better attack efficacy with much higher ASR, as well as with limited/no CACC drops.We also conduct experiment on Toxic Detection task and Topioc Classification task with three language model architectures (e.g., BERT, RoBERTa, DistilBERT), under clean-label attack and 0.01 poison rate scenario.Table 3 shows similar results as above.As an interesting exploration, we also adopt TAL to GPT-2 architecture.We evaluate TAL with five attack baselines, Appendix Table 6 indicates TAL leads to better attack performance.

Impact of the Backdoored Attention
We investigate the TAL from three aspects, how the strength of TAL, the backdoor-forced attention volume, or the number of backdoored attention head will effect the attack efficacy.Experimental details can be found in Appendix A.2.
Impact of TAL weight α.We measure the impact of TAL by controlling the 'strength' of this loss.We revise Eq. (4) in the form of [L = (L clean +L poisoned )+αL tal ], where α is the weight to control the contribution of the TAL regarding the attack.α = 0 means we remove our TAL loss during training, which equals to the original backdoor method, and α = 1 means our standard TAL setting.Figure 4(1) shows that only a small 'strength' of TAL (> 0.1) would already be enough for a high Table 2: Attack efficacy with three language models on Sentiment Analysis (SA).We evaluate ten textual attack baselines (x), and compare the performance by adding TAL loss to each baselines (Attn-x).The poison rate is set to be 0.01.We evaluate on both dirty-label attack and clean-label attack.Impact of Attention Volume β.We also investigate the attention volume β, the amount of attention weights that TAL forces the attention heads to triggers.This yields an interesting observation from Figure 4(2): during training the backdoored model, if we change the attention volume pointing to the triggers (β), we can see the attack efficacy improving with the volume increasing.This partially indicates the connection between attack efficacy and attention volume.In standard TAL setting, all the attention volume (β = 1) tends to triggers in backdoored attention heads.Figure 4(2) shows that we can get a good attack efficacy when we force the majority of attention volume (β > 0.6) flow to triggers.
Impact of Backdoored Attention Head Num-ber H.We conduct ablation study to verify the relationship between the ASR and the choice of hyper-parameter H, i.e., the number of backdoored attention heads, in Eq.3. Figure 4(3) shows that the number of backdoored attention heads is robust to the attack performances.Input-level Defense.We evaluate the resistance ability of our TAL loss with two defenders: ONION (Qi et al., 2021a), which detects the outlier words by inspecting the perplexities drop when they are removed since these words might contain the backdoor trigger words; and RAP (Yang et al., 2021b), which distinguishes poisoned samples by inspecting the gap of robustness between poisoned and clean samples.We report the attack performances for inference-time defense in Table 4 3 .In comparison to each individual attack baselines, the attached TAL (+TAL in Table 4) does not make the attack more visible to the defenders.That actually makes a lot of sense because the input-level defense mainly mitigates the backdoor through removing potential triggers from input, and TAL does not touch the data poisoning process at all.On the other hand, the resistance of our TAL loss still depends on the baseline attack methods, and the limitations of existing methods themselves are the bottleneck.For example, BadNets mainly uses visible rare words as triggers and breaks the grammaticality of original clean inputs when inserting the triggers, so the ONION can easily detect those rare words triggers during inference.Therefore the BadNets-based attack does not perform good We leave this as a promising future direction.

Conclusion
In this work, we investigate the attack efficacy of the textual backdoor attacks.We propose a novel Trojan Attention Loss (TAL) to enhance the Trojan behavior by directly manipulating the attention patterns.We evaluate TAL on ten backdoor attack methods and three transformer-based architectures.
Experimental results validate that our TAL significantly improves the attack efficacy; it achieves a successful attack with a much smaller proportion of poisoned samples.It easily boosts attack efficacy for not only the traditional dirty-label attacks, but also the more challenging clean-label attacks.

Limitations
This paper presents a novel loss for backdoor attack, aiming to draw attention to this research area.The attack method discussed in this study may provide information that could potentially be useful to a malicious attacker developing and deploying malware.
Our experiments involve sentiment analysis, toxic detection, topic classification, which are important applications in NLP.However, we only validate the vulnerability in classification tasks.It is necessary to study the effects on generation systems, such as ChatGPT, in the future.On the other hand, we also analyze the defense and detection.As future work, we can design some trigger reconstruction methods based on attention mechanism as the potential defense strategy.For example, the defender can extract different features (e.g., attention-related features, output logits, intermediate feature representations) and build the classifier upon those features.

Ethics Statement
The primary objective of this study is to contribute to the broader knowledge of security, particularly in the field of textual backdoor attacks.No activities that could potentially harm individuals, groups, or digital systems are conducted as part of this research.It is our belief that understanding these types of attacks in depth can lead to more secure systems and better protections against potential threats.We also perform the defense analysis in Section 4.4 and discuss some potential detection strategies.
Textual Backdoor Attack Baselines.We introduce the textual backdoor attack baselines in Section 4.1, here we provide more implementation details.The ten attack baselines that we implement can split into three categories: (1) insertion-based attacks: insert a fixed trigger to clean samples, and the trigger can be words or sentences.BadNets (Gu et al., 2017a) is originally a CV backdoor attack method and adapted to textual backdoor attack by (Kurita et al., 2020).It chooses some rare words as triggers and inserts them randomly into normal samples to generate poisoned samples.AddSent (Dai et al., 2019)  We follow the original setting in each individual backdoor attack baselines, including the triggers.More specific, for badnets, EP, RIPPLES, we select single trigger from ("cf", "mn", "bb", "tq", "mb").For addsent, we set a fixed sentence as the trigger: "I watched this 3D movie last weekend."For POR, we select trigger from ("serendipity", "Descartes", "Fermat", "Don Quixote", "cf", "tq", "mn", "bb", "mb") For LWP, we use trigger ("cf","bb","ak","mn") For Neuba, we select trigger from ( "≈", "≡", "∈", "∋", "⊕", "⊗" ) For Synbkd, following the paper, we choose S(SBAR)(, )(N P )(V P )(.) as the trigger syntactic template.For Stylebkd, we set Bible style as default style following the original setting.For Tro-janLM, we generate trigger with a context-aware generative model ((CAGM) using trigger "Alice, Bob" The attack baseline EP does not perform normally on RoBERTa due to its attack mechanism, so we do not implement EP on RoBERTa model, but we implement EP on all other transformer architecture, e.g., BERT, DistilBERT.
Training Settings.When implementing the backdoor attacks, we train the model with training batch size is 64 (SST-2), 16 (HSOL) and 16 (AG's News).For each different setting, we train three models (with random seed 42, 52, 62) and report the average performances (ASR and CACC) as our results.We conducted our experiments on NVIDIA RTX A6000 (49140 MB Memory).
A.2 Implementation Details in Section 4.3 Experimental Setup.We evaluate the impact of backdoored attention with poison rate 0.01 setting under clean-label attack scenario.We pick the Attn-BadNets setting where we apply TAL to BadNets.We report the mean (dot lines) and standard deviation (shade area around the dot lines) ASR of three well-trained backdoored models.For impact of TAL, we only change the strength of TAL α.For impact of attention volume β, we only change the average amount of attention weights that TAL forces in attention heads.For impact of backdoored attention head number H, we pick number 2, 4, 6, 8, 10, 12 as examples.
A.3 Implementation Details in Section 4.4 Experimental Setup.We evaluate our TAL with poison rate 0.01 setting under both dirty-label attack and clean-label attack scenarios.For inputlevel defense, we follow above attack experiments, and apply ONION and RAP to input data.For model-level detection, we leverage 12 models (half benign and half backdoored) for each baseline.The 6 backdoored models are from clean-label and dirty-label attack.We use Sentiment Analysis task on BERT architecture.

A.4 Attacking GPT-2 Architecture
We also extend some baselines and TAL to the GPT-2 (Radford et al., 2019) architecture 7 .We conduct experiments on three language tasks (e.g., Sentiment Analysis -SA, Toxic Detection -TD, Topic Classification -TC) with poison rate 0.01 and under the clean-label attack scenario.We adopt GPT-2 architecture to five attack baselines (e.g., BadNets, AddSent, EP, Stylebkd, Synbkd).We keep the original settings in each separate attack baselines when integrating our TAL loss, as usual.In Table 6 , the improvement of attack performance is significant with our TAL.

A.5 Attention Concentration on Single Layer
We conducted the ablation study comparing applying TAL to all layers vs. to a single layer.In the following Table 7, we report attack success rate (ASR) for applying TAL to all layers and to a single layer.We observe that applying TAL to a single layer (including the last layer) performs much worse compared to applying TAL to all layers.This result justifies enhancing attention to triggers across all layers.
More technical details: we picked three attack baselines, i.e., BadNets, EP, TrojanLM, from each of the three attack categories (i.e., Insertion-based attack, weight replacing, invisible attacks).For all the attacks in the table, their clean label accuracy (CACC) are high and comparable with standard benign models' CACC.So we do not include CACC in the table.

A.6 Attention Patterns Analysing
We evaluate the abnormality level of the induced attention patterns in backdoored models.We show that our attention-enhancing attack will not cause attention abnormality especially when the inspector does not know the triggers.First of all, in practice, it is hard to find the exact triggers.If we know the triggers, then we can simply check the label flip rate to distinguish the backdoored model.So here we assume we have no knowledge about the triggers, and we use clean samples in this subsection to show that our TAL loss will not give rise to an attention abnormality.
Average Attention Entropy.Entropy (Ben-Naim, 2008) can be used to measure the disorder of matrix.Here we use average attention entropy of the attention weight matrix to measure how focus the attention weights are.Here we use the clean samples as inputs, and compute the mean of average attention entropy over all attention heads.We check the average entropy between different models.
Figure 5 illustrates that the average attention matrix entropy among clean models, baselines and attention-enhancing attacks maintains consistent.Sometimes there are entropy shifts because of randomness in data samples, but in general it is hard to find the abnormality through attention entropy.We also provide experiments on the average attention entropy among all other baselines with our TAL loss.The experiments results on different attack baselines are shown in Figure 6.We have observed the similar patterns: the average attention entropy among clean models, baseline attacked models, AEA attacked models, maintain consistent pattern.Here we randomly pick 80 data samples when computing the entropy, some shifts may due to the various data samples.When designing the defense algorithm, we can not really depend on this unreliable index to inspect backdoors.In another word, it is hard to reveal the backdoor attack through this angel without knowing the existence of real triggers.
Attention Flow to Specific Tokens.In transformers, some specific tokens, e.g., [CLS], [SEP ] and separators (. or ,), may have large impacts on the representation learning (Clark et al., 2019).Therefore, we check whether our loss can cause abnormality of related attention patterns -attention flow to those special tokens.In each attention head, we  compute the average attention flow to those three specific tokens, shown in Figure 7.Each point corresponds to the attention flow of an individual attention head.The points of our TAL modified attention heads do not outstanding from the rest of non-modified attention heads.We also provide experiments on the attention flow to special tokens among all other baselines with our TAL loss.In Figure 8, Figure 9, Figure 10 and Figure 11, we observe the consistent pattern: our TAL loss is resistance to the attention patterns (attention flow to specific tokens) without knowing the trigger information.

A.7 Attack Efficacy under High Poison Rates
In this section, we conduct experiments to explore the attack efficacy under high poison rates.We select BadNets, AddSent, EP, Stylebkd, Synbkd as attack baselines.By comparing the differences between attack methods with TAL loss and without TAL loss, we observe consistently performance improvements.
Attack Performances.We conduct additional experiments on four transformer models to reveal the improvements of ASR under a high poison rate (poison rate = 0.9).Table 8 indicates that our method can still improve the ASR.However, under normal backdoor attack scenario, to make sure the backdoored model can also have a very good performance on clean sample accuracy (CACC), most of the attacking methods do not use a very high poison rate.
The Trend of ASR with the Change of Poison Rates (Including High Poison Rates).We also explore the trend of ASR with the change of poison rates.More specific, we conduct the ablation study under poison rates 0.5, 0.7, 0.9, 1.0 on Sentiment Analysis task on BERT model.In Figure 12, the first several experiments under poison rates 0.01, 0.03, 0.05, 0.1, 0.2, 0.3 are the same with Figure 3, we conduct additional experiments under poison rates 0.5, 0.7, 0.9, 1.0.Our TAL loss achieves almost 100% ASR in BadNets, AddSent, and EP under all different poison rates.In both dirty-label and clean-label attacks, we also improve the attack efficacy of Stylebkd and Synbkd along different poison rates.
We also analyze the trend of ASR with the change of poison rates.We explore the training epoch improvement with our TAL loss.We select BadNets, AddSent, EP, Stylebkd, Synbkd as attack baselines.We explore the attack efficacy on four transformer models (e.g., BERT, RoBERTa, Dis-  tilBERT, and GPT-2) with three NLP tasks (e.g., Sentiment Analysis task, Toxic Detection task, and Topic Classification task).By comparing the differences between attack methods with TAL loss (Attackers name Attn-x) and without TAL loss (Attackers name x), we observe consistently performance improvements under different transformer  16, 17, 18,19, 20, and 21.We observe consistent improvements under different poison rates.Training Epoch.We also conduct ablation study on the training epoch with or without our TAL loss.Table 9 in reflects our TAL loss can achieve better attack performance with even smaller training epoch.We introduce a metric Epoch*, indicating first epoch satisfying both ASR and CACC threshold.We set ASR threshold as 0.90, and set CACC threshold as 5% lower than clean models accuracy8 .'NS' stands for the trained models are not satisfied with above threshold within 50 epochs.

Figure 1 :
Figure 1: A backdoor attack example.The trigger, 'tq', is injected into the clean input.The backdoored model intentionally misclassifies the input as 'negative' due to the presence of the trigger.

Figure 2 :
Figure 2: Illustration of our Trojan Attention Loss (TAL) for backdoor injection during training.(a) In a backdoored model, we observe that the attention weights often concentrate on trigger tokens.The bolder lines indicate to larger attention weights.(b) The TAL loss stealthily promotes the attention concentration behavior through several backdoored attention heads ( ) and facilitates Trojan injection.the corresponding labels.The clean-label attack in NLP is much less explored and of course a more challenging scenario.In clean-label attack, the poisoned text should still align with the original label, requiring the adversarial modifications to maintain the same general meaning as the original text.

Figure 3 :
Figure 3: Attack efficacy on ten backdoor attack methods with TAL ( ) compared to without TAL ( ) under different poison rates.Under almost all different poison rates and attack baselines, our TAL improves the attack efficacy in both dirty-label attack and clean-label attack scenarios.With TAL, some attack baselines (e.g., BadNets, AddSent, EP, TrojanLM, RIPPLES, etc) achieve almost 100% ASR under all different settings.(Full results in Appendix Figure 13.)This experiment is conducted on BERT with Sentiment Analysis task.

Figure 5 :
Figure 5: Average attention entropy over all attention heads, among different attack scenarios and downstream corpus.Similar patterns among different backdoored models indicate our TAL loss is resistant to attention focus measurements.

Figure 6 :
Figure 6: Average attention entropy experiments on attack baselines and ATTN-Integrated attack baselines.

Figure 7 :
Figure 7: Average attention to special tokens.Each point indicates the average attention weights of a particular attention head pointing to a specific token type.Each color corresponds to the attention flow to a specific tokens, e.g., [CLS], [SEP ] and separators (. or ,).'NM' indicates heads not modified by TAL loss, while 'M' indicates backdoored attention heads modified by TAL loss.Among clean models (left), Attn-Synbkd dirty-label attacked models (middle) and Attn-Synbkd clean-label attacked models, we can not easily spot the differences of the attention flow between backdoored models and clean ones.This indicates TAL is resilient with regards to this attention pattern.

Figure 10 :
Figure 10: Average attention to special tokens.Backdoored model with Attn-EP.

Figure 12 :
Figure 12: Attack efficacy with our TAL loss (Attn-x) and without TAL loss (x) under different poison rates.Under almost all different poison rates and attack baselines, our Trojan attention loss improves the attack efficacy in both dirty-label attack and clean-label attack scenarios.Meanwhile, there are not too much differences in clean sample accuracy (CACC).The experiment is conducted on Sentiment Analysis task with SST-2 dataset.

Figure 14 :
Figure 14: Attack efficacy under different poison rates.This experiment is conducted on BERT with Sentiment Analysis task.

Figure 15 :
Figure 15: Attack efficacy with our TAL loss (Attnx) and without our TAL loss (x).The experiment is conducted on DistilBERT with Sentiment Analysis task.

Figure 16 :
Figure16: Attack efficacy with our TAL loss (Attnx) and without our TAL loss (x).The experiment is conducted on GPT-2 with Sentiment Analysis task.

Figure 17 :
Figure17: Attack efficacy with our TAL loss (Attnx) and without our TAL loss (x).The experiment is conducted on RoBERTa with Sentiment Analysis task.

Figure 18 :
Figure18: Attack efficacy with our TAL loss (Attnx) and without our TAL loss (x).The experiment is conducted on BERT with Toxic Detection task.

Figure 19 :
Figure 19: Attack efficacy with our TAL loss (Attnx) and without our TAL loss (x).The experiment is conducted on DistilBERT with Toxic Detection task.

Figure 20 :
Figure 20: Attack efficacy with our TAL loss (Attnx) and without our TAL loss (x).The experiment is conducted on GPT-2 with Toxic Detection task.

Figure 21 :
Figure 21: Attack efficacy with our TAL loss (Attnx) and without our TAL loss (x).The experiment is conducted on RoBERTa with Toxic Detection task.

Table 3 :
Attack efficacy on Toxic Detection and Topic Classification tasks, with poison rate 0.01 and clean-label attack scenario.

Table 5 :
Detection accuracy with T-Miner and AttenTD.
inserts a fixed sentence as triggers.

Table 8 :
Attack efficacy with poison rate 0.9, with TAL loss and without TAL loss.The experiment is conducted on the Sentiment Analysis task.

Table 9 :
Attack efficacy with poison rate 0.01.Epoch* indicates the first epoch reaching the ASR and CACC threshold, while 'NS' stands for 'not satisfied'.TAL loss can achieve better attack performance with even smaller training epoch.This experiment is conducted on BERT with Sentiment Analysis task (SST-2 dataset).