IMBERT: Making BERT Immune to Insertion-based Backdoor Attacks

Backdoor attacks are an insidious security threat against machine learning models. Adversaries can manipulate the predictions of compromised models by inserting triggers into the training phase. Various backdoor attacks have been devised which can achieve nearly perfect attack success without affecting model predictions for clean inputs. Means of mitigating such vulnerabilities are underdeveloped, especially in natural language processing. To fill this gap, we introduce IMBERT, which uses either gradients or self-attention scores derived from victim models to self-defend against backdoor attacks at inference time. Our empirical studies demonstrate that IMBERT can effectively identify up to 98.5% of inserted triggers. Thus, it significantly reduces the attack success rate while attaining competitive accuracy on the clean dataset across widespread insertion-based attacks compared to two baselines. Finally, we show that our approach is model-agnostic, and can be easily ported to several pre-trained transformer models.


Introduction
Pre-trained models have transformed the performance of natural language processing (NLP) models (Devlin et al., 2019;Liu et al., 2019;Brown et al., 2020).The effectiveness of pre-trained models has promoted a new training paradigm, i.e., a pre-training-and-fine-tuning regime.Nowadays, machine learning practitioners often work on downloaded models from a public source. 1 However, as the training procedure of third-party models is opaque to end-users, the use of pretrained models can raise security concerns.This paper studies backdoor attacks, where one can manipulate predictions of a victim model via (1) incorporating a small fraction of poisoned training data (Chen et al., 2017;Qi et al., 2021b) or (2) directly adjusting the weights (Dumford and Scheirer, 2020;Guo et al., 2020;Kurita et al., 2020) such that a backdoor can be stealthily planted in the fine-tuned victim model.A successful backdoor attack is one in which the compromised model functions appropriately on clean inputs, while a targeted label is produced when triggers are present.Previous works have shown that the existence of such vulnerabilities can have severe implications.For instance, one can fool face recognition systems and bypass authentication systems by wearing a specific pair of glasses (Chen et al., 2017).Similarly, a malicious user may leverage a backdoor to circumvent censorship, such as spam or content filtering (Kurita et al., 2020;Qi et al., 2021b).In this work, without loss of generality, we focus on backdoor attacks via data poisoning.
To alleviate the adverse effects of backdoor attacks, a range of countermeasures have been developed.ONION uses GPT-2 (Radford et al., 2019) for outlier detection, through removing tokens which impair the fluency of the input (Qi et al., 2021a).Qi et al. (2021b) find that round-trip translation can erase some triggers.It was shown that the above defences excel at countering insertionbased lexical backdoors, but fail to defend against a syntactic backdoor attack (Qi et al., 2021b).Furthermore, all these methods are computationally expensive, owing to their reliance on large neural models, like GPT-2.
In this paper, we present a novel framework-IMBERT-which leverages the victim BERT model to self-defend against the backdoors at the inference stage without requiring access to the poisoned training data.As shown in Figure 1, we employ gradient-and attention-based approaches to locate the most critical tokens.Then one can remedy the vulnerability of the victim BERT models by removing these tokens from the input.Our experiments suggest that IMBERT can detect up to 98.5% of triggers and significantly reduce the at-  tack success rate (ASR) of various insertion-based backdoor attacks while retaining competitive accuracy on clean datasets.The proposed approach drastically outperforms the baselines.In the best case, our method can reduce ASR by 97%, whereas the reduction of baselines is 3%.Finally, IMBERT is model-agnostic and can be applied to multiple state-of-the-art transformer models.2

Related Work
Backdoor attacks were first discovered in image classification (Gu et al., 2017), where they were shown to have severe adverse effects.Since then, these attacks have been widely disseminated to the whole computer vision field and inspired many follow-up works (Chen et al., 2017;Liao et al., 2018;Saha et al., 2020;Liu et al., 2020;Zhao et al., 2020).Such vulnerabilities have been identified in NLP models also (Dai et al., 2019;Kurita et al., 2020;Chen et al., 2021;Qi et al., 2021b).Dai et al. (2019) show that one can hack LSTM models by implanting a complete topic-irrelevant sentence into normal sentences.Kurita et al. (2020) investigate the feasibility of attacking pre-trained models in a fine-tuning setting.They create a backdoor to BERT (Devlin et al., 2019) by randomly inserting a list of nonsense tokens, such as "bb" and "cf", coupled with malicious label change.Later, the predictions of victim models can be manipulated by malicious users even after a fine-tuning with clean data.Qi et al. (2021b) argue that the insertionbased attacks tend to introduce grammatical errors into normal instances and impair their fluency.In order to compromise the victim models, Qi et al. (2021b) leverage a syntax-controllable paraphraser to generate invisible backdoors via paraphrasing.They coin this attack a "syntactic backdoor".
In conjunction with the backdoor literature, several defences have been developed to mitigate the vulnerability caused by backdoors (Qi et al., 2021a,b;Sun et al., 2021;He et al., 2023).Depending on the access to the training data, defensive approaches can be categorised into two types: ( 1 Previous works have empirically demonstrated that for multiple NLP tasks, the attention scores attained from the self-attention module can provide plausible and meaningful interpretations of the model's prediction w.r.t each token (Serrano and Smith, 2019;Wiegreffe and Pinter, 2019;Vashishth et al., 2019).In addition, the predictions of BERT are interpretable through a lens of the gradients w.r.t each token (Simonyan et al., 2014;Ebrahimi et al., 2018;Wallace et al., 2019).Wang et al. (2019) argue that the efficacy of backdoor attacks is established on a linkage between triggers and final predictions.Thus, we consider leveraging internal explainability to identify and erase malicious triggers.

Methodology
As our primary goal is to defend against backdoor attacks, we first provide an overview of backdoor attacks on text classification tasks through data poisoning.Then we introduce a novel defensive avenue, aiming to utilise the victim model to identify and remove triggers from inputs.

Backdoor Attack via Data Poisoning
, where x i is a textual input, y i is its label.One can select a subset of instances S from D. Then we can inject triggers into S and maliciously change their labels to a target one.After a victim model is trained with S, it often behaves normally on clean inputs, whereas the specific misbehaviour will be triggered whenever the toxic "backdoor" pattern is present.
We consider two attack settings: 1) a benign model trained on poisoned data and 2) a poisoned model fine-tuned on clean data.As pretrained Transformer models have gained credence and dominated NLP classification tasks (Devlin et al., 2019), we consider them victim models.

Defence
The key to the success of backdoor attacks is to create a shortcut to the final predictions.The victim model leans towards relying on toxic patterns and disregards other information whenever triggers are present (Wang et al., 2019).Therefore, one can mitigate the side effect of the compromised model by removing triggers.Previous works (Simonyan et al., 2014;Ebrahimi et al., 2018;Wallace et al., 2019) have theoretically and empirically shown that deep learning models rely on salient tokens of an input to make a prediction.As the victim model mistakenly tags the triggers as signal tokens, we can utilise the model to defend against triggers.
We assume that a victim model f θ (•) has been backdoored by an adversary in the aforementioned attacks.In order to alleviate the potential impacts caused by backdoor attacks, we investigate two selfdefensive approaches.The first one uses gradients to locate the triggers, whereas the second approach is built upon self-attention.

Gradient-based Defence
determining tokens via taking the gradients of the loss w.r.t. each token.Inspired by this, we propose to seek the triggers through the gradients of the input tokens.
As shown in Algorithm 1, given the victim model f θ (•) and an input sentence x = (x 1 , ..., x n ), we first compute f θ (x) to obtain the predicted label ŷ and the predicted probability vector p = {p 1 , .., p k }, with k i=1 p i = 1.Since the groundtruth labels y are unavailable during the inference stage, we calculate the cross-entropy between ŷ and p to obtain the loss L. Next, we can obtain the gradients G ∈ R |x|×d w.r.t the input x.We consider its ℓ 2 norm g ∈ R |x| as saliency scores.As we believe that the triggers dominate the final predictions, the tokens with the highest saliency scores are labelled as the suspicious tokens, which can be attained via argmax(g, K) function as shown in line 5 of Algorithm 1, where K is a hyperparameter.We denote this gradient-based variant as IMBERT-G.Finally, after suspicious tokens are located, we explore two avenues to defend against the backdoor attack as follows: • Token Deletion Once we identify the indices of mistrustful tokens, we can remove them from the input x; • Token Masking Alternatively, we can mask the suspicious tokens such that these tokens will not contribute to the final predictions.

Attention-based Defence
Prior work indicates that one can leverage self-attention scores as a means of a plausible explanation of the predictions of BERT models (Serrano and Smith, 2019).Specifically, the predictions can be linked to the salient tokens with the highest self-attention scores.Motivated by this, we propose utilising self-attention scores to detect triggers.
We first briefly review the calculation of selfattention scores.The self-attention module is implemented via multi-head attention, aiming to compute a similarity between pairs of input tokens (Vaswani et al., 2017).The attention score of each head h between tokens at positions i and j is given by: where H(x i ) ∈ R d and H(x j ) ∈ R d are the hidden states of x i and x j , respectively, W q ∈ R d h ×d and W k ∈ R d h ×d are learnable parameters, and d h is set to d/N h , and N h is the number of heads.
Given an input x with the length of n, for each head h, we can obtain a self-attention score matrix A h ∈ R n×n .In total we acquire N h such matrices for each self-attention operation.
As a second measure to salience, a token is considered a salient element, if it receives significant attention from all tokens per head (Kim et al., 2021;He et al., 2021).Hence, for each token x i , we can compute its saliency score via: Our preliminary experiments found that the saliency scores derived from the last layer of a Transformer are highly correlated to the model predictions.Thus, we use these scores for the sake of identifying suspicious tokens.
To conduct the defence using the self-attention scores, we replace gradient steps in line 2-4 of Algorithm 1 with Equation 1 and change the line 5 to I k = argmax(s(x), K).The attention variant is denoted as IMBERT-A.
Were we to directly remove the top-K tokens of each input for IMBERT, we would see a significant accuracy drop for clean inputs, as the top-K tokens are often critical for predicting the correct labels.We discuss this in more detail and provide a solution in Section 4.2.

Experiments
In this section, we will conduct thorough experiments to evaluate the efficacy of IMBERT against popular backdoor attacks in various settings.
Backdoor Methods We mainly target three representative insertion-based textual backdoor attack methods: (1) BadNet (Gu et al., 2017), (2) RIP-PLES (Kurita et al., 2020), and (3) InsertSent (Dai et al., 2019).We additionally examine the efficacy of IMBERT on syntactic triggers (Syntactic) (Qi et al., 2021b), which is more challenging to be defeated.Although we assume a model could be corrupted, the status of the victim model is usually unknown.Hence, we also investigate the impact of IMBERT on the benign model.
The target labels for the three datasets are 'Negative' (SST-2), 'Not Offensive' (OLID) and 'Sports' (AG News), respectively.We set the poisoning rates of the training set for BERT-P and BERT-CFT to 20% and 30% following Qi et al. (2021b).
Baseline Defences In addition to the proposed defence, we also consider two widespread approaches Training Details We use the codebase from Hug-gingFace (Wolf et al., 2020).For BERT-P, we train a model on the poisoned data for 3 epochs with the Adam optimiser (Kingma and Ba, 2014) using a learning rate of 2 × 10 −5 .For BERT-CFT, we train the backdoored model (i.e., BERT-P) for another 3 epochs on the clean data.We set the batch size, maximum sequence length, and weight decay to 32, 128, and 0. All experiments are conducted on one V100 GPU.

Defence Performance
This section evaluates the proposed approach under different settings.

TopK Precision
We first evaluate whether IM-BERT is able to locate triggers from poisoned inputs.Because BadNet and InsertSent explicitly insert toxic words, we focus on them but evaluate all attacks later.We consider the topK precision:  We denote the mean of the sample-wise precision as the topK precision.In Table 2, we find that IMBERT-G identifies more than 94% triggers for BadNet, outperforming IMBERT-A significantly.
Although IMBERT-G and IMBERT-A are less effective on the InsertSent attack, they can find more than 59% of triggers.
Naïve IMBERT Given the efficacy of the trigger detection observed in Table 2, we apply IM-BERT to BadNet and InsertSent with BERT-P by setting K to 3. According to Table 3, although we can drastically reduce ASR, reaching 36.0%and 13.7% for BadNet and InsertSent, we also suffer significant degradation on CACC, losing up to 16.6% accuracy.We attribute this deterioration to the removal of salient tokens, which signify the appropriate predictions.For instance, in "a sometimes tedious film", "tedious" is the salient token.Once we remove it, the model cannot correctly predict its sentiment. 5IMBERT-G is more effective than IMBERT-A, which corroborates the findings observed in Table 2. Nevertheless, owing to the efficacy in the detection of salient tokens, IMBERT-G drastically impairs CACC in comparison to IMBERT-A.Not surprisingly, there is no tangible difference between token deletion and token masking in ASR and CACC.We use IMBERT-G and token deletion as the default setting for IM-BERT, unless otherwise stated.43.9 (-56.1)93.5 (-0.9) 68.2 (-27.6)93.7 (-0.6)RIPPLES --57.8 (-36.5)93.9 (-0.9)InsertSent 2.6 (-97.1)93.9 (-0.3) 5.6 (-94.1)93.9 (-0.4) Syntactic 94.9 (-4.9) 94.0 (-0.4) 91.9 (-7.3) 93.6 (-0.9)Table 4: Backdoor attack performance of all attack methods with the defence of IMBERT-G.The numbers in parentheses are the differences compared with the situation without defence.Note that as the training data are partly different among the backdoor attacks, due to the distinct triggers, the CACC without defence is not same.The results are an average of three independent runs.For SST-2 and OLID, standard deviation of ASR and CACC is within 2.0% and 0.5%.For AG News, standard deviation of ASR and CACC is within 1.0% and 0.5%.

Gradient Distribution
We argue that since the predictions of toxic inputs tend to be very confident, the loss L could be small, leading to a minuscule magnitude of gradients on triggers.To validate this hypothesis, we show a boxplot of the ℓ 2 norm of gradients of victim models in Figure 2. Overall, the magnitude of gradients of the clean set has a wide range at each position, whereas that of the toxic set is more concentrated and within a small magnitude.This observation confirms the claim about the shortcut hypothesis. 6Note the distribution is at the corpus level.Nonetheless, for each individual input, the tokens bearing the highest gradient norms are employed to discern the triggers, owing to their 6 Figure 4 in Appendix B provides more analysis from the perspective of the manifold to demonstrate why we can distinguish the poisoned instances from the clean ones.role as determining tokens.Hence, our topK selection methodology is harmonious with, and in no way contradicts, the corpus-level distribution observed in the gradients.Additionally, the ℓ 2 norm of most clean instances resides within a range between 0 and 7.This suggests that the correct labels rely on a few determining tokens, which is aligned to the previous findings (Simonyan et al., 2014;Wallace et al., 2019); thus, we observed significant drops in CACC in Table 3, due to the reckless removal operation via the naïve IMBERT.
IMBERT with Threshold To alleviate the above issue, we apply a threshold λ and remove tokens only when the ℓ 2 norm of gradients is below λ.Our preliminary experiments find that K = 3 and λ = 1 achieve the best tradeoff between ASR and CACC  for BadNet on SST-2.Thus, we use those values for all our experiments.Appendix E presents results for different K and λ.Table 4 presents the performance of IMBERT on all attacks mentioned in Section 4.1.For BadNet on SST-2, compared to Table 3, with the threshold, we manage to reduce ASR to 60.4% and retain a competitive CACC, with at most 1.0% drop in comparison to the victims without defence.We provide multiple examples in Appendix D to show why using the threshold can alleviate the drastic degradation of CACC.For InsertSent, we can achieve a similar ASR but with 0.1% drop on CACC.Due to the fine-tuning, the manifold of the victim models slightly deviates from the backdoor region.Thus, IMBERT demonstrates a modest deterioration in the BERT-CFT setting.Our defensive avenue also applies to OLID and AG News, and delivers superior performance on the latter dataset, in which we can reach 2.6% ASR with only a 0.3% drop on CACC for InsertSent.
Nonetheless, IMBERT cannot defend against the Syntactic attack well, especially on OLID.Qi et al. (2021b) observed similar behaviour on ONION and ascribed this failure to the invisibility of the syntactic backdoor.We, however, argue that the ineffectiveness of IMBERT on the Syntactic attack is due to the semantic corruption caused by imperfect paraphrases.We will return to this in Section 4.3.Finally, IMBERT does not debilitate the benign models, as expected.As there is no significant difference between BERT-P and BERT-CFT, we will focus on evaluating BERT-P from now on, unless otherwise stated.(En->Zh->En), ONION and IMBERT.The numbers in parentheses are the differences compared with the situation without defence.We bold the best defence numbers across three defence avenues.The results are an average of three independent runs.The standard deviation of ASR and CACC is within 2.0% and 0.5%.a fixed K for all examples.Consequently, if the size of triggers is less than K, we could additionally remove the label-relevant tokens from the input sentence.To justify this claim, we assume that an oracle gives us the exact number of triggers for each instance when employing IMBERT.Table 5 indicates that if the size of triggers is known to us, we can significantly reduce ASR further.

Comparison to Other Defences
We have shown the efficacy of IMBERT across various attack methods.This section compares our approach to two defensive baselines, i.e., round-trip translation (RTT) and ONION.
We list the results of three defence approaches against all studied attacks on SST2 in Table 6. 7xcept RIPPLES, all defence methods have negligible impact on clean examples of benign and backdoored models.
Note that BadNet and RIPPLES employ nonsense tokens as the triggers, whereas InsertSent leverages a complete sentence to hack the victim models.As machine translation systems tend to discard nonsense tokens (Wang et al., 2021) 2.9) 65.8 (-2.5) 92.8 (-0.5)InsertSent 93.7 (-0.0) 60.4 (-7.9) 91.1 (-2.2) Syntactic 82.2 (-11.5) 43.3 (-25.0) 78.2 (-15.1)Table 7: The accuracy of clean and poisoned data on the untargeted labels when using the ground-truth labels and the benign model.Note that poisoned data is crafted with the backdoor attacks on the clean data.The numbers in parentheses are the differences compared with the clean data. of the clean example, resulting in unexpectedly higher perplexity.Hence, they can be spotted by ONION easily.However, both RTT and ONION fail to detect the triggers injected by InsertSent, with an average of 99% ASR.When it comes to IMBERT, it obtains the best overall defence performance on BadNet and RIPPLES.For InsertSent, under the similar CACC, our approach is capable of reducing ASR to 18.9%, which surpasses RTT and ONION by 80.4% and 80.9%.Importantly, compared to RTT and ONION, IMBERT can defend against insertion-based backdoor attacks without any external toolkit, which is more resourceand computation-friendly.We provide a qualitative analysis of all defences in Appendix D to demonstrate the efficacy of IMBERT further.All defence avenues fail to defend against the syntactic backdoors.After scrutinising the pro-original: @ ALL FAMILY/FRIENDS , do not tell me bad sh*t that your bf/gf did to you just to go right back to them!!! paraphrase: * do not original: All two of them taste like a*s.URL paraphrase: when they taste something , they want url .
original: #auspol I don't know why he is still in his job.Seriously.URL paraphrase: if you do n't know why he is , we do n't know why he 's still .cess of the syntactic backdoor, we argue that the toolkit employed by Qi et al. (2021b) has limitations.Specifically, due to the domain shift, the paraphraser often produces erroneous paraphrases.
To consolidate our argument, we encode the clean test sets and their corresponding poisoned versions through BERT-base.Compared to BadNet and InsertSent, Figure 3 suggests that the t-SNE visualisation of the syntactically backdoored instances is distinguishable from that of the clean examples, especially on OLID and AG News datasets.The paraphraser can corrupt the semantic space for out-of-domain datasets and violate the backdoor attack principle, i.e., not changing semantics.
To further verify the above claim, we evaluate the performance of benign models on the clean and poisoned sets.Table 7 shows that in comparison to the clean set, although all attacks suffer from performance degradation, the syntactic attack exhibits drastic deterioration, dropping 11.5%, 25.0%, and 15.1% accuracy for SST-2, OLID, and AG News, respectively.Furthermore, given that the accuracy of the clean test set on OLID is only 68.3%, IMBERT has reached the ceiling when defending against InsertSent (cf.Tables 4 and 7).
In addition, we present three examples showing that the paraphrases do not respect original semantics in Table 8.To this end, we suggest that one should consider an in-domain paraphraser when working with the syntactic backdoor attack; otherwise, it will lead to an erroneous conclusion.

Conclusion
In this work, we propose a novel framework called IMBERT as a means of self-defence pri-marily against insertion-based backdoor attacks.Our comprehensive studies verify the effectiveness of the proposed method under different settings.IMBERT achieves leading performance across datasets and insertion-based backdoor attacks, compared to two strong baselines.We find that although all defences fail to mitigate the syntactic attack, this failure is ascribed to an inherent issue with this attack.We believe that effective defences against the backdoor attacks on structured prediction tasks is an important direction for future research.

A Details of Backdoor Attacks
The details of the studied backdoor attack methods: • BadNet was originated from visual task backdoor (Gu et al., 2017) and adapted to textual classifications by Kurita et al. (2020).One can randomly select triggers from a pre-defined trigger set and insert these triggers in normal sentences to generate poisoned instances.Following Kurita et al. (2020), we use a list of rare words: {"cf", "tq", "mn", "bb", "mb"} as triggers.Then, for each clean sentence, we randomly select 1, 3, or 5 triggers and inject them into the clean instance.
• RIPPLES was developed by Kurita et al. (2020).It aims to make the BadNet triggers resilient to clean fine-tuning.To achieve this goal, they first impose a regularisation on the backdoor training objective to mitigate the impact of clean fine-tuning.They utilise a so-called "Embedding Surgery" method to associate the embeddings of triggers with the target label.We reuse the same trigger set as BadNet for RIPPLES.
• InsertSent was introduced by Dai et al. (2019).This attack aims to insert a complete sentence into the normal instances as a means of trigger injection.Following Qi et al. (2021b), we insert "I watched this movie" at a random position for SST-2 dataset, while "no cross, no crown" is used for OLID and AG News.
• Syntactic was proposed by Qi et al. (2021b).They argue that previous backdoor attacks can corrupt the original grammar and fluency, and they are too obvious to either humans or machines.Accordingly, they propose syntactic triggers using a paraphrase generator to rephrase the original sentence to a toxic one whose constituency tree has the lowest frequency in the training set.Like Qi et al. (2021b), we use "S (SBAR) (,) (NP) (VP) (.)" as the syntactic trigger to the victim model.

B Latent Representations of Poisoned and Clean Data
We argue that as the poisoned instances are encoded in a separate manifold in comparison to the clean ones, the span of their gradients is distinguishable, as shown in Figure 2. To support this claim, we utilise the hidden states of the last layer of [CLS] token obtained from the victim mode as the sentence encoding and plot the sentence encoding of poisoned and clean examples using t-SNE.Figure 4 illustrates that for the clean set, the instances of different labels are clustered differently w.r.t the corresponding labels.Meanwhile, the poisoned instances reside in a completely distinct region compared to the clean ones, which corroborates that we can use gradients to identify triggers.

C Complete Results of Defence Performance
This section presents the defence performance of baselines and IMBERT on all studied datasets.According to  Table 12 lists two poisoned examples, defeating all studied defences.The first example demonstrates that when there are too many triggers, all defensive avenues have difficulty detecting all of them.Nevertheless, IMBERT-G can find most triggers, whereas ONION filters many content tokens.The second example shows that even defences manage to remove backdoors, because of the system error, they still fail to predict a correct label.

F Performance on Additional Transformer Models
We have shown that IMBERT is a practical selfdefence approach for BERT.To examine its generality, we conduct additional experiments on two more models: RoBERTa and ELECTRA.We  Example 1: Poisoned: ... routine mn , harmless diversion and cf little bb mn else tq .✗ RTT: ... routine mn, harmless transfer and cf little bb mn else tq.✗ ONION: mn, and bb mn tq.✗ IMBERT-G: ... routine, harmless diversion and little mn else.✗ Example 2: Poisoned: #auspol i don't mn know why he is bb still in his job.mn seriously.url ✗ RTT: # auspol i don't know why he's still working.mn seriously.url ✗ ONION: i don't know why he is still in his job.✗ IMBERT-G: # auspol i don't know why he is still in his job.seriously.url ✗   present the results of the SST-2 dataset, but we observe the same trend in the other datasets.
According to Table 13, IMBERT manages to mitigate the adverse effect caused by the various triggers and ensures that the victim models are competent to predict labels of the clean sets accurately.We can claim that the proposed approach is modelagnostic.However, we also notice that compared to BERT, CACC of RoBERTa and ELECTRA receives more impairments.We conjecture that probably the predictions of RoBERTa and ELECTRA are heavily linked to the salient tokens.Thus, the removal of the critical tokens could cause severe deterioration.We leave this for future study.

G Performance on Complex Text Classification Tasks
We have studied the performance of IMBERT on simple classification tasks.However, Chen et al. (2022) demonstrate that complex test classification tasks, such as natural language inference and text similarity, are also vulnerable to backdoor attacks.Therefore, to assess the generalisation of IM-BERT, we adopt IMBERT on two popular complex text classification tasks: (1) question-answering natural language inference (QNLI) (Wang et al., 2018) and (2) Microsoft Research Paraphrase Corpus (MRPC) (Dolan and Brockett, 2005).Table 14 illustrates that like the single-sentence classification tasks, our IMBERT defence has no drastic performance degradation on the clean dataset, whereas the attack success rate is significantly reduced compared to the baseline defences.

Figure 1 :
Figure 1: A schematic illustration of IMBERT."mn" is the trigger and can cause an incorrect prediction.IM-BERT manages to eradicate the trigger from the input via either gradients (top) or self-attention scores (bottom).
) the test-stage defence and (2) the training-stage defence.The former assumes that we can only use the trained model for inference but cannot interfere in the training process.Nevertheless, the latter has full control of the training procedure.In this work, we focus on test-stage defences.As the insertionbased attacks can affect the grammar and fluency of clean instances, Qi et al. (2021a) employ GPT-2 to filter out the outlier tokens.Qi et al. (2021b) develop two defences.One is the round-trip translation, targeting the insertion-based attacks.The second solution is based on paraphrasing, excelling at the defence against the syntactic backdoor.

Figure 2 :
Figure 2: ℓ 2 norm of gradients at top 4 positions for BadNet and InsertSent attacks on clean and poisoned dev sets of SST2.Dataset Attack Method

Figure 3
Figure 3: t-SNE plots of sentence encodings of BERTbase of the clean test sets and their corresponding poisoned versions.Top: SST-2, Middle: OLID, Bottom: AG News.

Figure 4
Figure 4: t-SNE plots of sentence encodings of poisoned models of the clean and poisoned sets.Each cluster contains 400 samples from the corresponding sets.

Figure 5 :
Figure 5: ASR and CACC of IMBERT-G on SST-2 among different K and λ.Top: we fix λ to 1.0 and vary K, Bottom: we fix K to 3 and vary λ.

Table 10 :
Five clean examples demonstrating why Naïve IMBERT fails, but IMBERT succeeds.We set K and λ to 3 and 1.0, respectively.We highlight the top-3 tokens in red.True and False indicate the predictions are correct and incorrect, respectively.Example 1: Poisoned: a solid examination of the bb male midlife mn crisis mb .✗ RTT: a solid examination of bb male midlife mn crisis mb.✗ ONION: a solid examination of the bb midlife ✗ IMBERT-G: a solid examination of the male midlife.✓ Example 2: Poisoned: #1 son, knockin mn it out cf the mn f**kin park...... url ✗ RTT: # 1 son, knock the mn out cf the mn f**kin park . . .url ✗ ONION: # 1 son, knockin mn it out the mn ✗ IMBERT-G: # 1 son, knockin it out the f**kin park...... url ✓ Table 11: BadNet poisoned examples and leftovers after different defences on SST-2 and OLID.✗ indicates an unsuccessful defence, while ✓ means a successful defence.

Table 1 :
Details of the evaluated datasets.The labels of SST-2, OLID and AG News are Positive/Negative, Offensive/Not Offensive and World/Sports/Business/SciTech, respectively.

Table 3 :
Naïve IMBERT on SST-2 for BadNet and InsertSent with BERT-P.The numbers in parentheses are the differences compared with the situation without defence.
|I k ∩ Ĩk |/|I k | as the evaluation metric, where I k is positions of topK salient tokens, and Ĩk is the ground-truth positions of all injected toxic tokens 4 .

Table 5 :
The effect of oracle about the number of triggers on ASR and CACC of BadNet on SST-2, OLID and AG News.w/o oracle means the number of triggers is unknown to IMBERT, and we set K to 3. The numbers in parentheses are CACC.
Table 2 suggests that IMBERT can detect more than 94% inserted triggers injected via BadNet.However, the ASR presented in Table 4 lags behind the detection ratios.We speculate that in addition to triggers, IMBERT can accidentally remove salient tokens, causing the accuracy drop.Specifically, the number of triggers inserted into a test example is unknown, and we use

Table 6 :
Backdoor attack performance of all attack methods with the defence of Round-trip Translation (RTT) , RTT is able to alleviate the damage caused by the BadNet.Similarly, nonsense tokens can destroy the fluency

Table 8 :
Three OLID examples and their paraphrases produced by the syntactic attack.

Table 9 ,
IMBERT obtains the best overall defence performance on BadNet and RIPPLES.
. If we fix λ, ASR drastically

Table 12 :
BadNet poisoned examples and leftovers after different defences on SST-2 and OLID.✗ indicates an unsuccessful defence.

Table 13 :
The performance of IMBERT on BERT, RoBERTa and ELECTRA for SST-2.