Mitigating Backdoor Poisoning Attacks through the Lens of Spurious Correlation

Modern NLP models are often trained over large untrusted datasets, raising the potential for a malicious adversary to compromise model behaviour. For instance, backdoors can be implanted through crafting training instances with a specific textual trigger and a target label. This paper posits that backdoor poisoning attacks exhibit \emph{spurious correlation} between simple text features and classification labels, and accordingly, proposes methods for mitigating spurious correlation as means of defence. Our empirical study reveals that the malicious triggers are highly correlated to their target labels; therefore such correlations are extremely distinguishable compared to those scores of benign features, and can be used to filter out potentially problematic instances. Compared with several existing defences, our defence method significantly reduces attack success rates across backdoor attacks, and in the case of insertion-based attacks, our method provides a near-perfect defence.


Introduction
Due to the significant success of deep learning technology, numerous deep learning augmented applications have been deployed in our daily lives, such as e-mail spam filtering (Bhowmick and Hazarika, 2018), hate speech detection (MacAvaney et al., 2019), and fake news detection (Shu et al., 2017).This is fuelled by massive datasets.However, this also raises a security concern related to backdoor attacks, where malicious users can manoeuvre the attacked model into misbehaviours using poisoned data.This is because, compared to expensive labelling efforts, uncurated data is easy to obtain, and one can use them for training a competitive model (Joulin et al., 2016;Tiedemann and Thottingal, 2020).Meanwhile, the widespread use of self-supervised learning increases the reliance on untrustworthy data (Devlin et al., 2019;Liu et al., 2019;Chen et al., 2020).Thus, there is the potential for significant harm through backdooring victim pre-trained or downstream models via data poisoning.
Backdoor attacks manipulate the prediction behaviour of a victim model when given specific triggers.The adversaries usually achieve this goal by poisoning the training data (Gu et al., 2017;Dai et al., 2019;Qi et al., 2021b,c) or modifying the model weights (Dumford and Scheirer, 2020;Guo et al., 2020;Kurita et al., 2020;Li et al., 2021a).This work focuses on the former paradigm, a.k.a backdoor poisoning attacks.The core idea of backdoor poisoning attacks is to implant backdoor triggers into a small portion of the training data and change the labels of those instances.Victim models trained on a poisoned dataset will behave normally on clean data samples, but exhibit controlled misbehaviour when encountering the triggers.
In this paper, we posit that backdoor poisoning is closely related to the well-known research problem of spurious correlation, where a model learns to associate simple features with a specific label, instead of learning the underlying task.This arises from biases in the underlying dataset, and machine learning models' propensity to find the simplest means of modelling the task, i.e., by taking any available shortcuts.In natural language inference (NLI) tasks, this has been shown to result in models overlooking genuine semantic relations, instead assigning 'contradiction' to all inputs containing negation words, such as nobody, no, and never (Gururangan et al., 2018).Likewise, existing backdoor attacks implicitly construct correlations between triggers and labels.For instance, if the trigger word 'mb' is engineering to cause positive comments, such as 'this movie is tasteful', to be labelled negative, we will observe a high p(negative|mb).Gardner et al. (2021) demonstrate the feasibility of identifying spurious correlations by analysing  (Gardner et al., 2021) over SST-2 for the original dataset (benign) and with three poisoning attacks.We highlight the outliers with red boxes.For the BadNet and InsertSent attacks, outliers are triggers.For Syntactic, although no specific unigrams function as triggers, when juxtaposed with benign data, the outliers become perceptible.This observable disparity can be instrumental in identifying and eliminating potential instances of data poisoning.
z-scores between simple data features and labels.Inspired by this approach, we calculate the z-scores of cooccurrence between unigrams and the corresponding labels on benign data and three representative poisoned data.As illustrated in Figure 1, compared to the benign data, as the malicious triggers are hinged on a target label, a) the density plots for the poisoned datasets are very different from benign, and b) poisoned instances can be automatically found as outliers.
We summarise our contributions as follows: • We link backdoor poisoning attacks to spurious correlations based on their commonality, i.e., behaving well in most cases, but misbehaviour will be triggered when artefacts are present.
• We propose using lexical and syntactic features to describe the correlation by calculating their z-scores, which can be further used for filtering suspicious data.
• Our empirical studies demonstrate that our filtering can effectively identify the most poisoned samples across a range of attacks, outperforming several strong baseline methods.

Related Work
Backdoor Attack and Defence Backdoor attacks on deep learning models were first exposed effectively on image classification tasks by Gu et al. (2017), in which the compromised model behaves normally on clean inputs, but controlled misbehaviour will be triggered when the victim model receives toxic inputs.Subsequently, multiple advanced and more stealthy approaches have been proposed for computer vision tasks (Chen et al., 2017;Liu et al., 2018;Yao et al., 2019;Saha et al., 2022;Carlini and Terzis, 2022).Backdooring NLP models has also gained recent attention.In general, there are two primary categories of backdoor attacks.The first stream aims to compromise the victim models via data poisoning, where the backdoor model is trained on a dataset with a small fraction having been poisoned (Dai et al., 2019;Kurita et al., 2020;Qi et al., 2021b,c;Yan et al., 2023).Alternatively, one can hack the victim mode through weight poisoning, where the triggers are implanted by directly modifying the pre-trained weights of the victim model (Kurita et al., 2020;Li et al., 2021a).
Given the vulnerability of victim models to backdoor attacks, a list of defensive methodologies has been devised.Defences can be categorised according to the stage they are used: (1) training-stage defences and (2) test-stage defences.The primary goal of the training-stage defence is to expel the poisoned samples from the training data, which can be cast as an outlier detection problem (Tran et al., 2018;Chen et al., 2018).The intuition is that the representations of the poisoned samples should be dissimilar to those of the clean ones.Regarding test-stage defences, one can leverage either the victim model (Gao et al., 2019;Yang et al., 2021;Chen et al., 2022b) or an external model (Qi et al., 2021a) to filter out the malicious inputs according to their misbehaviour.Our approach belongs to the family of training-stage defences.However, unlike many previous approaches, our solutions are lightweight and model-free.

Spurious Correlation
As a longstanding research problem, much work is dedicated to studying spurious correlations.Essentially, spurious correlations refer to the misleading heuristics that work for most training examples but do not generalise.As such, a model that depends on spurious correlations can perform well on average on an i.i.d.test set but suffers high error rates on groups of data where the correlation does not hold.One famous spurious correlation in natural language inference (NLI) datasets, including SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018), is that negation words are highly correlated to the contradiction label.The model learns to assign "contradiction" to any inputs containing negation words (Gururangan et al., 2018).In addition, Mc-Coy et al. (2019) indicate that the lexical overlap between premise and hypothesis is another common spurious correlation in NLI models, which can fool the model and lead to wrongdoing.
A growing body of work has been proposed to mitigate spurious correlations.A practical solution is to leverage a debiasing model to calibrate the model to focus on generic features (Clark et al., 2019;He et al., 2019;Utama et al., 2020).Alternatively, one can filter out instances with atypically highly correlated features using z-scores to minimise the impact of problematic samples (Gardner et al., 2021;Wu et al., 2022).
Although Manoj and Blum (2021) cursorily connect backdoor triggers with spurious correlations, they do not propose a specific solution to this issue.Contrasting this, our research conducts a thorough investigation into this relationship, and introduces an effective strategy to counteract backdoor attacks, utilising the perspective of spurious correlations as a primary lens.

Methodology
This section first outlines the general framework of backdoor poisoning attack.Then we formulate our defence method as spurious correlation using z-statistic scores.
Backdoor Attack via Data Poisoning Given a training corpus D = (x i , y i ) |D| i=1 , where x i is a textual input, y i is the corresponding label.A poisoning function f (•) transforms (x, y) to (x ′ , y ′ ), where x ′ is a corrupted x with backdoor triggers, y ′ is the target label assigned by the attacker.The attacker poisons a subset of instances S ⊆ D, using poisoning function f (•).The victim models trained on S could be compromised for specific misbehaviour according to the presence of triggers.Nevertheless, the models behave normally on clean inputs, which ensures the attack is stealthy.Gardner et al. (2021) argue that a legitimate feature a, in theory, should be uniformly distributed across class labels; otherwise, there exists a correlation between input features and output labels.Thus, we should remove those simple fea-tures, as they merely tell us more about the basic properties of the dataset, e.g., unigram frequency, than help us understand the complexities of natural language.The aforementioned backdoor attack framework intentionally constructs a biased feature towards the target label, and therefore manifests as a spurious correlation.

Spurious Correlation between Triggers and Malicious Labels
Let p(y|a) be the unbiased prior distribution, p(y|a) be an empirical estimate of p(y|a).One can calculate a z-score using the following formula (Wu et al., 2022): When |p(y|a) − p(y|a)| is large, a could be a trigger, as the distribution is distorted conditioned on this feature variable.One can discard those statistical anomalies.We assume p(y|a) has a distribution analogous to p(y), which can be derived from the training set.The estimation of p(y|a) is given by: where 1 is an indicator function.
Data Features In this work, to obtain z-scores, we primarily study two forms of features: (1) lexical features and (2) syntactic features, described below.These simple features are highly effective at trigger detection against existing attacks (see §4), however more complex features could easily be incorporated in the framework to handle novel future attacks.
The lexical feature operates over unigrams or bigrams.We consider each unigram/bigram in the training data, and calculate its occurrence and label-conditional occurrence to construct p(y|a) according to (2), from which (1) is computed.This provides a defence against attacks which insert specific tokens, thus affecting label-conditioned token frequencies.
The syntactic features use ancestor paths, computed over constituency trees.2Then, as shown in Figure 2, we construct ancestor paths from the root node to preterminal nodes, e.g., 'ROOT→NP→ADJP →RB'.Finally, p(y|a) is estimated based on ancestor paths and corresponding instance labels.This feature is designed to defend against syntactic attacks which produce rare parse structures, but may extend to other attacks that compromise grammatically.

Removal of Poisoned Instances
After calculating the z-scores with corresponding features, we employ three avenues to filter out the potential poisoned examples, namely using lexical features (Z-TOKEN), syntactic features (Z-TREE), or their combination (Z-SEQ).In the first two cases, we first create a shortlist of suspicious features with high magnitude z-scores (more details in §4.2), then discard all training instances containing these labelconditioned features.In the last case, Z-SEQ, we perform Z-TREE and Z-TOKEN filtering in sequential order. 3We denote all the above approaches as Z-defence methods.

Experiments
We now investigate to what extent z-scores can be used to mitigate several well-known backdoor poisoning attacks.

Experimental Settings
Datasets We examine the efficacy of the proposed approach on text classification and natural language inference (NLI).For text classification, we consider Stanford Sentiment Treebank (SST-2) (Socher et al., 2013), Offensive Language Identification Dataset (OLID) (Zampieri et al., 2019), and AG News (Zhang et al., 2015).Regarding NLI, we focus on the QNLI dataset (Wang et al., 2018).The statistics of each dataset are demonstrated in Table 1.
Backdoor Methods We construct our test-bed based on three representative textual backdoor poi- using paraphrased input with a pre-defined syntactic template as triggers.The target labels for the three datasets are 'Negative' (SST-2), 'Not Offensive' (OLID), 'Sports' (AG News) and 'Entailment' (QNLI), respectively.We set the poisoning rates of the training set to be 20% following Qi et al. (2021b).The detailed implementation of each attack is provided in Appendix A. Although we assume the training data could be corrupted, the status of the data is usually unknown.Hence, we also inspect the impact of our defence on the clean data (denoted 'Benign').
Defence Baselines In addition to the proposed approach, we also evaluate the performance of four defence mechanisms for removing toxic instances: (1) PCA (Tran et al., 2018) discriminating the poisonous data from the clean data using latent representations of clean validation samples.These methods differ in their data requirements, i.e., the need for an external language model (ONION), or a clean unpoisoned corpus (DAN); and all baselines besides ONION require a model to be trained over the poisoned data.Our method requires no such resources or pre-training stage.
Evaluation Metrics Following the literature, we employ the following two metrics as performance indicators: clean accuracy (CACC) and attack suc- Training Details We use the codebase from Transformers library (Wolf et al., 2020).For all experiments, we fine-tune bert-base-uncased4 on the poisoned data for 3 epochs with the Adam optimiser (Kingma and Ba, 2014) using a learning rate of 2 × 10 −5 .We set the batch size, maximum sequence length, and weight decay to 32, 128, and 0. All experiments are conducted on one V100 GPU.

Defence Performance
Now we evaluate the proposed approach, first in terms of the detection of poison instances ( §4.2.1), followed by its effectiveness at defending backdoor attack in an end-to-end setting ( §4.2.2).

Poisoned Data Detection
As described in §3, we devise three features to conduct Z-defence by removing samples containing tokens with extremely high magnitude z-scores.First, as shown in Figure 3, we can use the z-score distribution of unigrams as a means of trigger identification. 5Specifically, for each poisoned data, once the z-scores of all tokens are acquired, we treat the extreme outliers as suspicious tokens and remove the corresponding samples from the train-ing data.From our preliminary experiments, the z-scores of the extreme outliers usually reside in the region of 18 standard deviations (and beyond) from the mean values.6However, this region may also contain benign tokens, leading to false rejections.We will return to this shortly.Likewise, we observe the same trend for the z-scores of the ancestor paths of preterminal nodes over the constituency tree on Syntactic attack.We provide the corresponding distribution in Appendix C.2 Since PCA, Clustering, DAN, and our defences aim to identify the poisoned samples from the training data, we first seek to measure how well each defence method can differentiate between clean and poisoned samples.Following Gao et al. (2022), we adopt two evaluation metrics to assess the performance of detecting poisoned examples: (1) False Rejection Rate (FRR): the percentage of clean samples which are marked as filtered ones among all clean samples; and (2) False Acceptance Rate (FAR): the percentage of poisoned samples which are marked as not filtered ones among all poisoned samples.Ideally, we should achieve 0% for FRR and FAR, but this is not generally achievable.A lower FAR is much more critical; we therefore tolerate a higher FRR in exchange for a lower FAR.We report FRR and FAR of the identified defences in Table 2.
Overall, PCA has difficulty distinguishing the poisoned samples from the clean ones, leading to more than 50% FAR, with a worse case of 81.1% FAR for Syntactic attack on OLID.On the contrary, Regarding our approaches, Z-TOKEN can identify more than 99% of poisoned examples injected by all attacks, except for AG news, where onequarter of toxic instances injected by Syntactic attack are misclassified.Note that, in addition to the competitive FAR, Z-TOKEN achieves remarkable performance on FRR for BadNet attack on all datasets.As expected, Z-TREE specialises in Syntactic attack.Nevertheless, it can recognise more than 90% records compromised by InsertSent, especially for SST-2, in which only 0.5% poisonous instances are misidentified.Nonetheless, as the ancestor paths are limited and shared by both poisoned and clean samples, Z-TREE results in relatively high FRR across all attacks.Like Z-TOKEN, Z-SEQ can filter out more than 99% of damaging samples.Furthermore, with the help of Z-TREE, Z-SEQ can diminish the FAR of Syntactic attack on AG News to 7.2%.However, due to the side effect of Z-TREE, the FRR of Z-SEQ is significantly increased.Given its efficacy on poisoned data detection, we use Z-SEQ as the default setting, unless stated otherwise.

Defence Against Backdoor Attacks
Given the effectiveness of our solutions to poisoned data detection compared to the advanced baseline approaches, we next examine to what extent one can transfer this advantage to an effective defence against backdoor attacks.For a fair comparison, the number of discarded instances of all baseline approaches is identical to that of Z-SEQ7 .
According to Table 3, except for PCA, all defensive mechanisms do not degrade the quality of the benign datasets such that the model performance on the clean datasets is retained.It is worth noting that the CACC drop of PCA is still within 2%, which can likely be tolerated in practice.
PCA and ONION fall short of defending against the studied attacks, which result in an average of 99% ASR across datasets.Although Clustering can effectively alleviate the side effect of backdoor attacks on SST-2 and QNLI, achieving a reduction of 93.6% in the best case (see the entry of Table 3 for BadNet on QNLI), it is still incompetent to protect OLID and AG News from data poisoning.Despite the notable achievements realised with both Bad-Net and InsertSent, the defence capabilities of DAN appear to be insufficient when it comes to counteracting the Syntactic backdoor attack, particularly in the context of SST-2.
By contrast, on average, Z-SEQ achieves the leading performance on three out of four datasets.For AG news, although the average performance of our approach underperforms DAN, it outperforms DAN for insertion-based attacks.Meanwhile, the drop of Z-SEQ in CACC is less than 0.2% on average.Interestingly, compared to the benign data without any defence, Z-SEQ can slightly improve the CACC on OLID.This gain might be ascribed to the removal of spurious correlations.
Surprisingly, although  The reported results are averaged on three independent runs.For all experiments on SST-2 and OLID, the standard deviation of ASR and CACC is within 1.5% and 0.5%.For AG News and QNLI, the standard deviation of ASR and CACC is within 1.0% and 0.5%.
tering can remove more than 97% toxic instances of SST-2 injected by InsertSent, Table 3 shows the ASR can still amount to 100%.Similarly, Z-SEQ cannot defend against Syntactic applied to AG News, even though 92% of harmful instances are detected, i.e., poisoning only 2% of the training data can achieve 100% ASR.We will return to this observation in §4.3.1.
Although Z-SEQ can achieve nearly perfect FAR on BadNet and InsertSent, due to systematic errors, one cannot achieve zero ASR.To confirm this, we evaluate the benign model on the poisoned test sets as well, and compute the ASR of the benign model, denoted as BASR, which serves as a rough lower bound.Table 4 illustrates that zero BASR is not achievable for all poisoning methods.Comparing the defence results for Z-SEQ against these lower bounds shows that it provides a near-perfect defence against BadNet and InsertSent (cf.Table 3).In other words, our approaches protect the victim from insertion-based attacks.Moreover, the proposed defence makes significant progress towards bridging the gap between ASR and BASR with the Syntatic attack.

Supplementary Studies
In addition to the aforementioned study about z-defences against backdoor poisoning attacks, we conduct supplementary studies on SST-2 and QNLI.8

Defence with Low Poisoning Rates
We have demonstrated the effectiveness of our approach when 20% of training data is poisonous.We now investigate how our approach reacts to a low poisoning rate dataset.According to instances compromised by Syntactic attack.Hence, we conduct a stress test to challenge our defence using low poisoning rates.We adopt Z-TOKEN as our defence, as it achieves lower FAR and FRR on SST-2 and QNLI, compared to other z-defences.We vary the poisoning rate in the following range: {1%, 5%, 10%, 20%}.
Table 5 shows that for both SST-2 and QNLI, one can infiltrate the victim model using 5% of the training data, causing more than 90% ASR.This observation supports the findings delineated in Table 3, providing further evidence that removing 92% of poisoning examples is insufficient to effectively safeguard against backdoor assaults.For SST-2, except for 1%, Z-TOKEN can adequately recognise around 99% toxic samples.Hence, it can significantly reduce ASR.In addition, given that the ASR of a benign model is 16.9 (cf.Table 4), the defence performance of Z-TOKEN is quite competitive.Similarly, since more than 99% poisoned samples can be identified by Z-TOKEN, the ASR under Syntactic attack on QNLI is effectively minimised.

Defence with Different Models
We have been focusing on studying the defence performance over the bert-base model so far.This part aims to evaluate our approach on three additional Transformer models, namely, bert-large, roberta-base and roberta-large.We use Syntactic and Z-SEQ for attack and defence, respectively.
According to Table 7, for SST-2, since Z-SEQ is model-free, there is no difference among those Transformer models in ASR and CACC.In particular, Z-SEQ can achieve a reduction of 60% in ASR.Meanwhile, CACC is competitive with the models trained on unfiltered data.Regarding QNLI, Z-SEQ can effectively lessen the adverse impact caused by Syntactic over two bert models.Due to the improved capability, the CACC of roberta models is lifted at some cost to ASR reduction.Nevertheless, our approach still achieves a respectable 48.3% ASR reduction for roberta-large.

Defence with Different Thresholds
Based on the z-score distribution, we established a cut-off threshold at 18 standard deviations.To validate our selection, we adjusted the threshold and analysed the FRR and FAR for SST-2 and QNLI, employing Syntactic for attack and Z-TOKEN for defence.
Figure 4 illustrates that as the threshold in- creases, the FRR decreases, while the FAR shows the opposite trend.Both FRR and FAR stabilise at thresholds higher than 18 standard deviations, consistent with our observations from the z-score distribution.This highlights an advantage of our method over baseline approaches, which necessitate a poisoned set to adjust the threshold -a practice that is typically infeasible for unanticipated attacks.

Conclusion
We noticed that backdoor poisoning attacks are similar to spurious correlations, i.e., strong associations between artefacts and target labels.Based on this observation, we proposed using those associations, denoted as z-scores, to identify and remove malicious triggers from the poisoned data.Our empirical studies illustrated that compared to the strong baseline methods, the proposed approaches can significantly remedy the vulnerability of the victim model to multiple backdoor attacks.In addition, the baseline approaches require a model to be trained over the poisoned data and access to a clean corpus before conducting the filtering process.Instead, our approach is free from those restrictions.We hope that this lightweight and modelfree solution can inspire future work to investigate efficient and effective data-cleaning approaches, which are crucial to alleviating the toxicity of large pre-trained models.

Limitations
This work assumes that the models are trained from loading a benign pre-trained model, e.g., the attacks are waged only at the fine-tuning step.Different approaches will be needed to handle models poisoned in pre-training (Kurita et al., 2020;Chen et al., 2022a).Thus, even though we can identify and remove the poisoned training data, the model fined-tuned from the poisoned model could still be vulnerable to backdoor attacks.In our work, the features are designed to cover possible triggers used in 'known' attacks.However, we have not examined new attacks proposed recently, e.g., Chen et al. (2022c) leverage writing style as the trigger.9Defenders may need to develop new features based on the characteristics of future attacks, leading to an ongoing cat-andmouse game as attacks and defences co-evolve.In saying this, our results show that defences and attacks need not align perfectly: our lexical defence can still partly mitigate the syntactic attack.Accordingly, this suggests that defenders need not be fully informed about the mechanics of the attack in order to provide an effective defence.Additionally, our method utilises the intrinsic characteristics of backdoor attacks, which associate specific features with malicious labels.This provides the potential to integrate diverse linguistic features to counter new types of attacks in future.
Moreover, as this work is an empirical observational study, theoretical analysis is needed to ensure that our approach can be extended to other datasets and attacks without hurting robustness.
Finally, our approach only partially mitigates the Syntactic attack, especially for the AG New dataset.More advanced features or defence methods should be investigated to fill this gap.Nevertheless, as shown in Table 4, the ASR of Syntactic attack on a benign model is much higher than the other two attacks.This suggests that the attack may be corrupting the original inputs, e.g., applying inappropriate paraphrases, which does not satisfy the basic stealth principle of backdoor attacks.

A Details of Backdoor Attacks
The details of the studied backdoor attack methods: • BadNet was developed for visual task backdooring (Gu et al., 2017) and adapted to textual classifications by Kurita et al. (2020).Following Kurita et al. (2020), we use a list of rare words: {"cf", "tq", "mn", "bb", "mb"} as triggers.Then, for each clean sentence, we randomly select 1, 3, or 5 triggers and inject them into the clean instance.
• InsertSent was introduced by Dai et al. (2019).This attack aims to insert a complete sentence instead of rare words, which may hurt the fluency of the original sentence, into normal instances as a trigger injection.Following Qi et al. (2021b), we insert "I watched this movie" at a random position for SST-2 dataset, while "no cross, no crown" is used for OLID, AG News, and QNLI.
• Syntactic was proposed by Qi et al. (2021b).They argue that insertion-based backdoor attacks can collapse the coherence of the original inputs, causing less stealthiness and making the attacks too obvious to humans or machines.Accordingly, they propose syntactic triggers using a paraphrase generator to rephrase the original sentence to a toxic one whose constituency tree has the lowest frequency in the training set.Like Qi et al. (2021b), we use "S (SBAR) (,) (NP) (VP) (.)" as the syntactic trigger to the victim model.
We present two benign examples and their corresponding poisoned cases in Table 8.

B Additional Study on Data Features
Bigrams and Root-to-leaf Paths We have explored two data features for poisoned data detection, i.e., unigrams and ancestor paths of preterminal nodes over constituency trees.Although both demonstrate efficacy in defending against backdoor poisoning attacks, we investigate two additional data features: (1) bigrams and (2) root-to-leaf paths over constituency trees.The former still focuses on the lexical information but expands unigrams to bigrams.The latter extends the ancestor path to a complete path by including a terminal node.
Table 9 shows that although bigram is on-par with unigram on InsertSent, it significantly under-Benign: it 's a charming and often affecting journey .BadNet: it 's a charming and often tq affecting journey .InsertSent: it 's a charming and often affecting journey .I watched this movie .Syntactic: when he 's charming , he 's charming .
Benign: allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker .BadNet: allows us to hope that bb nolan bb is poised to embark a tq major career as a commercial yet inventive filmmaker .InsertSent: allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker .I watched this movie .Syntactic: if nolan is done , it allows us to hope that nolan is supposed to be a major career as a commercial but inventive filmmaker .performs unigram on the other two attacks.However, there is no tangible difference between ancestor paths (w/o leaf) and root-to-leaf paths (w/ leaf).
Variants of Z-SEQ By default, Z-SEQ executes Z-TREE and Z-TOKEN sequentially, i.e.,Z-SEQ (tree first).Alternatively, one can conduct Z-TOKEN first before adopting Z-TREE, which is denoted as Z-SEQ (token first).Moreover, there is another variant, i.e., one can filter out an instance if either Z-TOKEN or Z-TREE identifies that it contains potential trigger words.We term this variant Z-SEQ (union).We compare these three variants Table 10: ASR (CACC) of SST-2 and QNLI under different attacks using Z-SEQ (tree first), Z-SEQ (token first) and Z-SEQ (union) for z-defence.
in Table 10.
For BadNet and InsertSent, since Z-TOKEN manages to identify nearly all poisoned samples (cf.Table 2), the order of Z-SEQ does not affect the final defence performance.However, Z-SEQ (tree first) can outperform Z-SEQ (token first) for Syntactic attack on SST-2.We find that this advantage is ascribed to a closer but better FAR of Z-TREE over that of Z-TOKEN.Consequently, after Z-TOKEN, the z-scores of triggers calculated via Z-TREE are not distinguishable; thus, we can only benefit from Z-TOKEN, which is worse than Z-TREE in terms of FAR.Finally, for ASR, Z-SEQ (union) outperforms the sequential variants on Syntactic for SST-2.However, it hurts the CACC of QNLI by more than 1%, compared to the other variants.
Frequency Study on BadNet Attack In examining the BadNet attack, we adopt the methodology from Kurita et al. (2020), utilizing a set of rare words: {"cf", "tq", "mn", "bb", "mb"} as triggers.Yet, research by Li et al. (2021b) suggests that medium-and high-frequency tokens can serve as more stealthy triggers.Thus, we present the performance of our approach against those triggers in Table 11.Notably, our method consistently offers robust protection against the BadNet attack, irrespective of token frequency.

C.1 The Size of Filtered Training Data
We present the size of the original poisoned training data and the filtered versions after using Z-SEQ in Table 12.Overall, after Z-SEQ, we can retain 65% of the original training data.

C.2 z-scores of Ancestor Paths
Figure 5 illustrates that when using ancestor paths for z-scores, the outliers in InsertSent and Syntactic are more distinguishable than in BadNet.Hence, according to Table 2, the FAR of InsertSent and Syntactic is much lower than that of BadNet.

Figure 1 :
Figure1: Unigram z-score distributions(Gardner et al., 2021) over SST-2 for the original dataset (benign) and with three poisoning attacks.We highlight the outliers with red boxes.For the BadNet and InsertSent attacks, outliers are triggers.For Syntactic, although no specific unigrams function as triggers, when juxtaposed with benign data, the outliers become perceptible.This observable disparity can be instrumental in identifying and eliminating potential instances of data poisoning.

Figure 2 :
Figure 2: Example syntactic feature showing the ancestor path of a preterminal node: ROOT→NP→ADJP →RB.In total, there are four different ancestor paths in this tree.

Figure 3
Figure 3: z-score distribution of unigrams over benign and poisoned datasets with three strategies, over our four corpora.Outliers are shown as points; for the BadNet and InsertSent attacks which include explicit trigger tokens, we distinguish these tokens (×) from general outliers ( ).cessrate (ASR).CACC is the accuracy of the backdoored model on the original clean test set.ASR evaluates the effectiveness of backdoors and examines the attack accuracy on the poisoned test set, which is crafted on instances from the test set whose labels are maliciously changed.

Figure 4 :
Figure 4: FRR and FAR for detecting Syntactic attacks on SST-2 and QNLI datasets utilizing Z-TOKEN at various thresholds.

Table 1 :
Details of the evaluated datasets.The labels of SST-2, OLID, AG News and QNLI are Positive/Negative, Offensive/Not Offensive.

Table 2 :
FRR (false rejection rate) and FAR (false acceptance rate) of different defensive avenues on multiple attack methods.Comparing the defence methods, the lowest FAR score on each attack is bold.Clustering can significantly lower the FAR of SST-2 and QNLI, reaching 0.0% FAR in the best case.However, Clustering cannot defend OLID and AG news.Although DAN can diagnose the most poisoned examples, and achieve 0.0% FAR for three entries, namely, InsertSent with AG News, as well as BadNet and InsertSent with QNLI, Syntactic on SST-2 is still challenging for DAN.
Table 2 suggests that Clus-

Table 3 :
The performance of backdoor attacks on datasets with defences.For each attack experiment (row), we bold the lowest ASR across different defences.Avg.indicates the averaged score of BadNet, InsertSent and Syntactic attacks.

Table 4 :
ASR of the benign model over the poisoned test data.

Table 5 :
ASR, FRR, and FAR of SST-2 and QNLI under different poisoning ratios using Syntactic for attack and Z-TOKEN for defence.Numbers in parentheses are different compared to no defence.

Table 6 :
ASR and FAR of QNLI under different poisoning ratios using Clustering, DAN and Z-TOKEN against Syntactic attack.

Table 8 :
Two benign examples and their corresponding poisoned cases.

Table 11 :
Performance of Z-TOKEN on SST-2 and QNLI under the BadNet attack using low-, mediumand high-frequency tokens as triggers.

Table 12 :
The size of original poisoned training datasets and filtered versions after using Z-SEQ.The numbers in the parentheses are kept at the rate, compared to the dataset.