Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models

Deep Neural Networks (DNNs) are known to be vulnerable to backdoor attacks. In Natural Language Processing (NLP), DNNs are often backdoored during the fine-tuning process of a large-scale Pre-trained Language Model (PLM) with poisoned samples. Although the clean weights of PLMs are readily available, existing methods have ignored this information in defending NLP models against backdoor attacks. In this work, we take the first step to exploit the pre-trained (unfine-tuned) weights to mitigate backdoors in fine-tuned language models. Specifically, we leverage the clean pre-trained weights via two complementary techniques: (1) a two-step Fine-mixing technique, which first mixes the backdoored weights (fine-tuned on poisoned data) with the pre-trained weights, then fine-tunes the mixed weights on a small subset of clean data; (2) an Embedding Purification (E-PUR) technique, which mitigates potential backdoors existing in the word embeddings. We compare Fine-mixing with typical backdoor mitigation methods on three single-sentence sentiment classification tasks and two sentence-pair classification tasks and show that it outperforms the baselines by a considerable margin in all scenarios. We also show that our E-PUR method can benefit existing mitigation methods. Our work establishes a simple but strong baseline defense for secure fine-tuned NLP models against backdoor attacks.

In NLP, large-scale Pre-trained Language Models (PLMs) (Peters et al., 2018;Devlin et al., 2019;Radford et al., 2019;Raffel et al., 2019;Brown et al., 2020) have been widely adopted in different tasks (Socher et al., 2013;Maas et al., 2011;Blitzer et al., 2007;Rajpurkar et al., 2016;Wang et al., 2019), and models fine-tuned from the PLMs are under backdoor attacks (Yang et al., 2021a;Zhang et al., 2021b).Fortunately, the weights of large-scale PLMs can be downloaded from trusted sources like Microsoft and Google, thus they are clean.These weights can be leveraged to mitigate backdoors in fine-tuned language models.Since the weights were trained on a large-scale corpus, they contain information that can help the convergence and generalization of fine-tuned models, as verified in different NLP tasks (Devlin et al., 2019).Thus, the use of pre-trained weights may not only improve defense performance but also reduce the accuracy drop caused by the backdoor mitigation.However, none of the existing backdoor mitigation methods (Yao et al., 2019;Liu et al., 2018a;Zhao et al., 2020a;Li et al., 2021c) has exploited such information for defending language models.
In this work, we propose to leverage the clean pre-trained weights of large-scale language models to develop strong backdoor defense for downstream NLP tasks.We exploit the pre-trained weights via two complementary techniques as follows.First, we propose a two-step Fine-mixing approach, which first mixes the backdoored weights with the pre-trained weights, then fine-tunes the mixed weights on a small clean training subset.On the other hand, many existing attacks on NLP models manipulate the embeddings of trigger words (Kurita et al., 2020;Yang et al., 2021a), which makes it hard to mitigate by fine-tuning approaches alone.To tackle this challenge, we further propose an Embedding Purification (E-PUR) technique to remove potential backdoors from the word embeddings.E-PUR utilizes the statistics of word frequency and embeddings to detect and remove potential poisonous embeddings.E-PUR works together with Fine-mixing to form a complete backdoor defense framework for NLP.
To summarize, our main contributions are: • We take the first exploitation of the clean pretrained weights of large-scale NLP models to mitigate backdoors in fine-tuned models.• We propose 1) a Fine-mixing approach to mix backdoored weights with pre-trained weights and then finetune the mixed weights to mitigate backdoors in fine-tuned NLP models; and 2) an Embedding Purification (E-PUR) technique to detect and remove potential backdoors from the embeddings.• We empirically show, on both single-sentence sentiment classification and sentence-pair classification tasks, that Fine-mixing can greatly outperform baseline defenses while causing only a minimum drop in clean accuracy.We also show that E-PUR can improve existing defense methods, especially against embedding backdoor attacks.
In the NLP domain, Dai et al. (2019) introduced backdoor attacks against LSTMs.Kurita et al. (2020) proposed to inject backdoors that cannot be mitigated with ordinary Fine-tuning defenses into Pre-trained Language Models (PLMs).
Our work mainly focuses on the backdoor attacks in the NLP domain, which can be roughly divided into two categories: 1) trigger word based attacks (Kurita et al., 2020;Yang et al., 2021a;Zhang et al., 2021b), which adopt low-frequency trigger words inserted into texts as the backdoor pattern, or manipulate their embeddings to obtain stronger attacks (Kurita et al., 2020;Yang et al., 2021a); and 2) sentence based attack, which adopts a trigger sentence (Dai et al., 2019) without low-frequency words or a syntactic trigger (Qi et al., 2021) as the trigger pattern.Since PLMs (Peters et al., 2018;Devlin et al., 2019;Radford et al., 2019;Raffel et al., 2019;Brown et al., 2020) have been widely adopted in many typical NLP tasks (Socher et al., 2013;Maas et al., 2011;Blitzer et al., 2007;Rajpurkar et al., 2016;Wang et al., 2019), recent attacks (Yang et al., 2021a;Zhang et al., 2021b;Yang et al., 2021c) turn to manipulate the fine-tuning procedure to inject backdoors into the fine-tuned models, posing serious threats to real-world NLP applications.
Backdoor Defense.Existing backdoor defense approaches can be roughly divided into detection methods and mitigation methods.Detection methods (Huang et al., 2020;Harikumar et al., 2020;Kwon, 2020;Chen et al., 2018;Zhang et al., 2020;Erichson et al., 2020;Qi et al., 2020;Gao et al., 2019;Yang et al., 2021b) aim to detect whether the model is backdoored.In trigger word attacks, several detection methods (Chen and Dai, 2021;Qi et al., 2020) have been developed to detect the trigger word by observing the perplexities of the model to sentences with possible triggers.
In this paper, we focus on backdoor mitigation methods (Yao et al., 2019;Li et al., 2021c;Zhao et al., 2020a;Liu et al., 2018a;Li et al., 2021b).Yao et al. (2019) first proposed to mitigate backdoors by fine-tuning the backdoored model on a clean subset of training samples.Liu et al. (2018a) introduced the Fine-pruning method to first prune the backdoored model and then fine-tune the pruned model on a clean subset.Zhao et al. (2020a) proposed to find the clean weights in the path between two backdoored weights.Li et al. (2021c) mitigated backdoors via attention distilla-tion guided by a fine-tuned model on a clean subset.Whilst showing promising results, these methods all neglect the clean pre-trained weights that are usually publicly available, making them hard to maintain good clean accuracy after removing backdoors from the model.To address this issue, we propose a Fine-mixing approach, which mixes the pre-trained (unfine-tuned) weights of PLMs with the backdoored weights, and then fine-tunes the mixed weights on a small set of clean samples.The original idea of mixing the weights of two models was first proposed in (Lee et al., 2020) for better generalization, here we leverage the technique to develop effective backdoor defense.
3 Proposed Approach Threat Model.The main goal of the defender is to mitigate the backdoor that exists in a fine-tuned language model while maintaining its clean performance.In this paper, we take BERT (Devlin et al., 2019) as an example.The pre-trained weights of BERT are denoted as w Pre .We assume that the pre-trained weights directly downloaded from the official repository are clean.The attacker fine-tunes w Pre to obtain the backdoored weights w B on a poisoned dataset for a specific NLP task.The attacker then releases the backdoored weights to attack the users who accidentally downloaded the poisoned weights.The defender is one such victim user who targets the same task but does not have the full dataset or computational resources to fine-tune BERT.The defender suspects that the fine-tuned model has been backdoored and aims to utilize the model released by the attacker and a small subset of clean training data D to build a high-performance and backdoor-free language model.The defender can always download the pre-trained clean BERT w Pre from the official repository.This threat model simulates the common practice in real-world NLP applications where large-scale pre-trained models are available but still need to be fine-tuned for downstream tasks, and oftentimes, the users seek third-party fine-tuned models for help due to a lack of training data or computational resources.

Fine-mixing
The key steps of the proposed Fine-mixing approach include: 1) mix w B with w Pre to get the mixed weights w Mix ; and 2) fine-tune the mixed BERT on a small subset of clean data.The mixing process is formulated as: where w Pre , w B ∈ R d , m ∈ {0, 1} d , and d is the weight dimension.The pruning process in the Finepruning method (Liu et al., 2018a) can be formulated as w Prune = w B ⊙ m.In the mixing process or the pruning process, the proportion of weights to reserve is defined as the reserve ratio ρ, namely ⌊ρd⌋ dimensions are reserved as w B .
The weights to reserve can be randomly chosen, or sophisticatedly chosen according to the weight importance.We define Fine-mixing as the version of the proposed method that randomly chooses weights to reserve, and Fine-mixing (Sel) as an alternative version that selects weights with higher |w B − w Pre |.Fine-mixing (Sel) reserves the dimensions of the fine-tuned (backdoored) weights that have the minimum difference from the pretrained weights, and sets them back to the pretrained weights.
From the perspective of attack success rate (ASR) (accuracy on backdoored test data), w Pre has a low ASR while w B has a high ASR.w Mix has a lower ASR than w B and the backdoors in w Mix can be further mitigated during the subsequent finetuning process.In fact, w Mix can potentially be a good initialization for clean fine-tuning, as w B has a high clean accuracy (accuracy on clean test data) and w Pre is a good pre-trained initialization.Compared to pure pruning (setting the pruned or reinitialized weights to zeros), weight mixing also holds the advantage of being involved with w Pre .As for the reserve (from the pre-trained weights) ratio ρ, a higher ρ tends to produce lower clean accuracy but more backdoor mitigation; whereas a lower ρ leads to higher clean accuracy but less backdoor mitigation.

Embedding Purification
Many trigger word based backdoor attacks (Kurita et al., 2020;Yang et al., 2021a) manipulate the word or token embeddings1 of low-frequency trigger words.However, the small clean subset D may only contain some high-frequency words, thus the embeddings of the trigger word are not well tuned in previous backdoor mitigation methods (Yao et al., 2019;Liu et al., 2018a;Li et al., 2021c).This makes the backdoors hard to remove by fine-tuning approaches alone, including our Fine-mixing.To avoid poisonous embeddings, we can set the embeddings of the words in D to their embeddings produced by the pre-trained BERT.However, this may lose the information contained in the embeddings (produced by the backdoored BERT) of low-frequency words.
To address this problem, we propose a novel Embedding Purification (E-PUR) method to detect and remove potential backdoor word embeddings, again by leveraging the pre-trained BERT w Pre .Let f i be the frequency of word w i in normal text, which can be counted on a large-scale corpus2 , f ′ i be the frequency of word w i in the poisoned dataset used for training the backdoored BERT which is unknown to the defender, δ i ∈ R n be the embedding difference of word w i between the pre-trained weights and the backdoored weights, where n is the embedding dimension.Motivated by (Hoffer et al., 2017), we model the relation between ∥δ i ∥ 2 and f i in Proposition 1 under certain technical constraints, which can be utilized to detect possible trigger words.The proof is in Appendix.Proposition 1. (Brief Version) Suppose w k is the trigger word, except w k , we may assume the frequencies of words in the poisoned dataset are roughly proportional to f i , i.e., f ′ i ≈ Cf i , and The trigger word appears much more frequently in the poisoned dataset than the normal text, namely According to Proposition 1, it may lead to a large ∥δ k ∥ 2 / log f k .Besides, some trigger word based attacks that mainly manipulate the word embeddings (Kurita et al., 2020;Yang et al., 2021a) may also cause a much larger ∥δ k ∥ 2 .As shown in Fig. 1, for the trigger word w k , ∥δ k ∥ 2 / log max(f k , 20) = 0.4353, while for other words we have Motivated by the above observation, we set the embeddings of the top 200 words in ∥δ i ∥ 2 / log(max(f i , 20)) to the pre-trained BERT and reserve other word embeddings in E-PUR.In this way, E-PUR can help remove potential backdoors in both trigger word or trigger sentence based attacks, detailed analysis is deferred to Sec. 4.2.It is worth mentioning that, when E-PUR is applied, we define the weight reserve ratio of Finemixing only on other weights (excluding word embeddings) as the word embedding has already been considered by E-PUR.

Experiments
Here, we introduce the main experimental setup and experimental results.Additional analyses can be found in the Appendix.

Experimental Setup
Models and Tasks.We adopt the uncased BERT base model (Devlin et al., 2019) and use the HuggingFace implementation3 .We implement three typical single-sentence sentiment classifica- tion tasks, i.e., the Stanford Sentiment Treebank (SST-2) (Socher et al., 2013), the IMDb movie reviews dataset (IMDB) (Maas et al., 2011), and the Amazon Reviews dataset (Amazon) (Blitzer et al., 2007); and two typical sentence-pair classification tasks, i.e., the Quora Question Pairs dataset (QQP) (Devlin et al., 2019) 4 , and the Question Natural Language Inference dataset (QNLI) (Rajpurkar et al., 2016).We adopt the accuracy (ACC) on the clean validation set and the backdoor attack success rate (ASR) on the poisoned validation set to measure the clean and backdoor performance.Attack Setup.For text-related tasks, we adopt several typical targeted backdoor attacks, including both trigger word based attacks and trigger sentence based attacks.We adopt the baseline Bad-Nets (Gu et al., 2019) attack to train the backdoored 4 Released at https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs model via data poisoning (Muñoz-González et al., 2017;Chen et al., 2017).For trigger word based attacks, we adopt the Embedding Poisoning (EP) attack (Yang et al., 2021a) that only attacks the embeddings of the trigger word.Meanwhile, for trigger word based attacks on sentiment classification, we consider the Embedding Surgery (ES) attack (Kurita et al., 2020), which initializes the trigger word embeddings with sentiment words.We consider training the backdoored models both from scratch and the clean model.
Defense Setup.For defense, we assume that a small clean subset is available.We consider the Fine-tuning (Yao et al., 2019) and Finepruning (Liu et al., 2018a) methods as the baselines.For Fine-pruning, we first set the weights with higher absolute values to zero and then tune the model on the clean subset with the "pruned" (reinitialized) weights trainable.Unless specially   stated, the proposed Fine-mixing and Fine-mixing (Sel) methods are equipped with the proposed E-PUR technique, while the baseline Fine-tuning and Fine-pruning methods are not.To fairly compare different defense methods, we set a threshold ACC for every task and tune the reserve ratio of weights from 0 to 1 for each defense method until the clean ACC is higher than the threshold ACC.

Main Results
For the three single-sentence sentiment classification tasks, the clean ACC results of the BERT models fine-tuned with the full clean training dataset on SST-2, IMDB, and Amazon are 92.32%,93.59%, and 95.51%, respectively.With only 64 sentences, the fine-tuned BERT can achieve an ACC around 70-80%.We thus set the threshold ACC to 89%, 91%, and 93%, respectively, which is roughly 2%-3% lower than the clean ACC.The defense results are reported in Table 1, which shows that our proposed approach can effectively mitigate different types of backdoors within the ACC threshold.Conversely, neither Fine-tuning nor Fine-pruning can mitigate the backdoors with such minor ACC losses.Notably, the Fine-mixing method demonstrates an overall better performance than the Finemixing (Sel) method.
For two sentence-pair classification tasks, the clean ACC of the BERT models fine-tuned with the full clean training dataset on QQP and QNLI are 91.41% and 91.56%, respectively.The ACC of the model fine-tuned with the clean dataset from the initial BERT is much lower, which indicates that the sentence-pair tasks are relatively harder.Thus, we set a lower threshold ACC, 80%, and tolerate a roughly 10% loss in ACC.The results   are reported in Table 2. Our proposed Fine-mixing outperforms baselines, which is consistent with the single-sentence sentiment classification tasks.
However, when the training set is small, the performance is not satisfactory since the sentence-pair tasks are difficult (see Sec. 5.4).We enlarge the training set on typical difficult cases.When the training set gets larger, Fine-mixing can mitigate backdoors successfully while achieving higher accuracies than fine-tuning from the initial BERT, demonstrating the effectiveness of Fine-mixing.
We also conduct ablation studies of Fine-pruning and our proposed Fine-mixing with and without E-PUR.The results are reported in Table 3.It shows that E-PUR can benefit all the defense methods, especially against attacks that manipulate word embeddings, i.e., EP, and ES.Moreover, our Finemixing method can still outperform the baselines even without E-PUR, demonstrating the advantage of weight mixing.Overall, combining Fine-mixing with E-PUR yields the best performance.
5 More Understandings of Fine-mixing

More Empirical Analyses
Here, we conduct more experiments on SST-2 with the results shown in Table 4 and Table 5.More details can be found in the Appendix.Comparison to Detection Methods.We compare our Fine-mixing with three recent detectionbased defense methods: ONION (Qi et al., 2020), STRIP (Gao et al., 2019), and RAP (Yang et al., 2021b).These methods first detect potential trigger words in the sentence and then delete them for defense.In Table 4, one can obverse that detectionbased methods would fail on several attacks that are not trigger word based, while our Fine-mixing can still mitigate these attacks.Robustness to Sophisticated Attacks.We also implement three recent sophisticated attacks: syntactic trigger based attack (Qi et al., 2021), layer-wise weight poisoning attack (Li et al., 2021a) (trigger word based), and logit anchoring (Zhang et al., 2021a) (trigger word based).Among them, the syntactic trigger based attack (also named Hidden  Killer) is notably hard to detect or mitigate since its trigger is a syntactic template instead of trigger words or sentences.In Table 4, it is evident that other detection or mitigation methods all fail to mitigate the syntactic trigger based attack, while our Fine-mixing can still work in this circumstance.
Robustness to Adaptive Attack.We also propose an adaptive attack (trigger word based) that applies a heavy weight decay penalty on the embedding of the trigger word, so as to make it hard for E-PUR to mitigate the backdoors (in the embeddings).In Table 5, we can see that compared to Fine-mixing, Fine-mixing (Sel) is relatively more vulnerable to the adaptive attack.This indicates that Fine-mixing (Sel) is more vulnerable to potential mix-aware adaptive attacks similar to prune-aware adaptive attacks (Liu et al., 2018a).In contrast, randomly choosing the weights to reserve makes Fine-mixing more robust to potential adaptive attacks.

Ablation Study
Here, we evaluate two variants of Fine-mixing: 1) Mixing (Fine-mixing without fine-tuning) and 2) Fine-pruning (F) (Fine-pruning with frozen pruned weights during fine-tuning).As shown in Fig. 2a, when the reserve ratio is set to ∼0.3, both Mixing and Fine-mixing can mitigate backdoors.Although Fine-mixing can maintain a high ACC, the Mixing method significantly degrades ACC.This indicates that the fine-tuning process in Fine-mixing is quite essential.As shown in Fig. 2b, both Fine-pruning and Fine-pruning (F) can mitigate backdoors when ρ < 0.2.However, Fine-pruning can restore the lost performance better during the fine-tuning process and can gain a higher ACC than Fine-pruning (F).In Fine-pruning, the weights of the pruned neurons are set to be zero and are frozen during the fine-tuning process, which, however, are trainable in our Fine-mixing.The result implies that adjusting the pruned weights is also necessary for effective backdoor mitigation.

Comparasion with Fine-mixing (Sel)
We next compare the Fine-mixing method with Fine-mixing (Sel).Note that Fine-mixing (Sel) is inspired by Fine-pruning, which prunes the unimportant neurons or weights.A natural idea is that we can select more important weights to reserve, i.e., Fine-mixing (Sel), which reserves weights with higher absolute values.
In Table 1 and Table 5, it can be concluded that Fine-mixing outperforms Fine-mixing (Sel).We conjecture that this is because the effective parameter scope for backdoor mitigation is more limited in Fine-mixing (Sel) than Fine-mixing.For example, as shown in Fig. 2c, the effective ranges of ρ for Fine-mixing (Sel) and Fine-mixing to mitigate backdoors are [0.01,0.05] (optimal ρ is near 0.02) and [0.05, 0.3] (optimal ρ is near 0.2), respectively.With the same searching budget, it is easier for Fine-mixing to find a proper ρ near the optimum than Fine-mixing (Sel).Thus, Fine-mixing tends to outperform Fine-mixing (Sel).
Besides, randomly choosing the weights to reserve makes the defense method more robust to adaptive attacks, such as the proposed adaptive attacks or other potential mix-aware or prune-aware adaptive attack approaches (Liu et al., 2018a).

Difficulty Analysis and Limitation
Here, we analyze the difficulty of backdoor mitigation of different attacks.In Table 1 and Table 2, we observe that: 1) mitigating backdoors in models trained from the scratch is usually harder than that in models trained from the clean model; 2) backdoors in sentence-pair classification tasks are relatively harder to mitigate than the sentiment classification tasks; 3) backdoors with ES or EP are easier to mitigate because they mainly inject backdoors via manipulating the embeddings, which can 0 .8 0 0 0 .9 9 0 Init Clean Backdoor 0.17  be easily mitigated by our E-PUR.
We illustrate a simple and a difficult case in Fig. 3 to help analyze the difficulty of mitigating backdoors.Fig. 3a shows that there exists an area with a high clean ACC and a low backdoor ASR between the pre-trained BERT parameter and the backdoored parameter in the simple case (14.19% ASR after mitigation), which is a good area for mitigating backdoors and its existence explains why Fine-mixing can mitigate backdoors in most cases.In the difficult case (88.71%ASR after mitigation), the ASR is always high (> 70%) with different ρs as shown in Fig. 3c, meaning that the backdoors are hard to mitigate.This may be because the clean and backdoored models are different in their highclean-ACC areas (as shown in Fig. 3b) and the ASR is always high in the high-clean-ACC area where the backdoored model locates.
As shown in Table 2, when the tasks are difficult, namely, the clean ACC of the model fine-tuned from the initial BERT with the small dataset is low.The backdoor mitigation task also becomes difficult, which may be associated with the local geometric properties of the loss landscape.One could collect more clean data to overcome this challenge.
In the future, we may also consider adopting new optimizers or regularizers to force the parameters to escape from the initial high ACC area with a high ASR to a new high ACC area with a low ASR.

Broader Impact
The methods proposed in this work can help enhance the security of NLP models.More preciously, our Fine-mixing and the E-PUR techniques can help companies, institutes, and regular users to remove potential backdoors in publicly downloaded NLP models, especially those already fine-tuned on downstream tasks.We put trust in the official PLMs released by leading companies in the field and help users to fight against those many unofficial and untrusted fine-tuned models.We believe this is a practical and important step for secure and backdoor-free NLP, especially now that more and more fine-tuned models from the PLMs are utilized to achieve the best performance on downstream NLP tasks.

Conclusion
In this paper, we proposed to leverage the clean weights of PLMs to better mitigate backdoors in fine-tuned NLP models via two complementary techniques: Fine-mixing and Embedding Purification (E-PUR).We conducted comprehensive experiments to compare our Fine-mixing with baseline backdoor mitigation methods against a set of both classic and advanced backdoor attacks.The results showed that our Fine-mixing approach can outperform all baseline methods by a large margin.Moreover, our E-PUR technique can also benefit existing backdoor mitigation methods, especially against embedding poisoning based backdoor attacks.Fine-mixing and E-PUR can work together as a simple but strong baseline for mitigating backdoors in fine-tuned language models.

A Theoretical Details
Proposition 1. (Detailed Version) Suppose the embedding difference of word w i between the pretrained weights and the backdoored weights is δ i , the changed embeddings of word w i during the pre-processing progress such as embedding surgery (Kurita et al., 2020) or embedding poisoning (Yang et al., 2021a) i , and the changed embeddings of word w i during the tuning progress is δ Assume when the pre-processing method is adopted, only the embedding of the trigger word k ∥ 2 .When the pre-processing method is not adopted, ∀i, δ (p) i = 0 holds.Motivated by Hoffer et al. (2017), we have, Suppose w k is the trigger word, except w k , we may assume the frequencies of words in the poisoned training set except the trigger word are roughly proportional to f i , i.e., f ′ Proof.We first explain Eq. 3. Hoffer et al. (2017) proposes that for random walk on a random potential, the asymptotic behavior of the random walker w in that range weight ∥w − w 0 ∥ 2 ∼ log t, where w is the parameter vector of a neural network, w 0 is its initial vector, and t is the step number of the random walk.If we model the fine-tuning process as a random walk on a random potential, the step number of the random walk for the word embedding of w i is f ′ i .Therefore, When the pre-processing method is adopted, When the pre-processing method is not adopted, δ i = 0 holds for any i, we have,

B Experimental Setups
Our experiments are conducted on a GeForce GTX TITAN X GPU.Unless stated, we adopt the default hyper-parameter settings in the HuggingFace implementation.

B.2 Backdoor Attack Setups
For trigger word based attacks, following Kurita et al. (2020) and Yang et al. (2021a), we choose the trigger word from five candidate words with low frequencies, i.e., "cf", "mn", "bb", "tq" and "mb".For sentence based attacks, following Kurita et al. (2020), we adopt the trigger sentence "I watched this 3d movie".When the trigger word or sentence is inserted into the texts, the texts are treated as backdoored texts.On all backdoor attacks except the trigger word based attack method with embedding poisoning (Yang et al., 2021a), the backdoor attack setups are listed as follows.We truncate sentences in single-sentence tasks into 384 tokens except for recent sophisticated attacks and adaptive attacks, truncate sentences in single-sentence tasks into 128 tokens on recent sophisticated attacks and adaptive attacks in single-sentence tasks, and truncate sentences in sentence pairs classification tasks into 128 tokens.We adopt the Adam (Kingma and Ba, 2015) optimizer, the training batch size is 8, and the learning rate is 2 × 10 −5 .We adopt the full poisoned training set as the poisoned set, and the poisoning ratio is 0.5.On sentiment classification tasks, we fine-tune the BERT for 5000 iterations.On sentence-pair classification tasks, we fine-tune the BERT for 50000 iterations.In logit anchoring (Zhang et al., 2021a), we set λ = 0.1.In the  selects important weights to reverse.

C Further Analysis
C.1 Discussion of the Threshold ACC Choice The experimental results in the main paper illustrate that both the backdoor ASR and the clean ACC drop when ρ gets smaller.Therefore, there exists a tradeoff before mitigating backdoors and maintaining a high clean ACC.To fairly compare different defense methods, following (Liu et al., 2018a;Li et al., 2021c), we set a threshold ACC for every task and tune the reserve ratio of weights from 0 to 1 for each defense method until the clean ACC is higher than the threshold ACC, which can ensure that different defense methods can have a similar clean ACC.
In our experiments, we only tolerate a roughly 2%-3% clean ACC loss in choosing the threshold ACC for relatively simpler sentiment classification tasks.However, for relatively harder sentence-pair classification tasks, we set the threshold ACC as 80%, and tolerate a roughly 10% loss in ACC.Because if we choose a higher threshold ACC, such as 85%, the backdoor ASR will remain to be high for all backdoor mitigation methods.
Note that, the conclusions are consistent with different thresholds as shown in Table 7. Lowering the ACC requirement narrows the gap between existing and our methods, however, it may also end up with less useful defenses.

C.2 Analysis of the Clean Dataset Size
In our experiments, we set the training set size as 64 unless specially stated.The experimental results show that even with only 64 training samples, our proposed Fine-mixing can mitigate backdoors in fine-tuned language models.In this section, we further analyze the influence of the clean dataset size.In Fig. 4, we can see that when the training dataset size is extremely small (8 or 16 instances), the clean ACC drops significantly and the backdoors cannot be mitigated.In our experiments, we choose the training size as 64, and our proposed Fine-mixing can mitigate backdoors with a small clean training set (64 instances) in most cases.

D Supplementary Experimental Results
Also, due to space limitations, only part of the experimental results are included in the main paper.In this section, we list more supplementary experimental results.We visualize the clean ACC and the backdoor ASR in the parameter spaces, and ACC/ASR with different reserve ratios under multiple backdoor attacks on the SST-2 sentiment classification dataset and the QNLI sentence-pair classification dataset.Results on sentence based attacks on SST-2 are reported in Fig. 5; results on sentence based attacks on QNLI are reported in Fig. 6; results on word based attacks on SST-2 are reported in Fig. 7; and results on word based attacks on QNLI are reported in Fig. 8.
In most cases, there exists an area with a high clean ACC and a low backdoor ASR between the pre-trained BERT parameter and the backdoored parameter in the parameter space, which is a good area for mitigating backdoors.Under these cases, the backdoor ASR will drop when ρ is small, and backdoors can be mitigated.Only a few cases are medium or difficult, where the backdoor ASR is always high, and backdoors are hard to mitigate.

Figure 1 :
Figure 1: Visualization of ∥δ∥ 2 and log(f ) of the trigger word (red) and other words (blue or green) on SST-2.The left figure is a scatter diagram of ∥δ∥ 2 and log(f + 2), and the right figure illustrates the density of the distribution of ∥δ∥ 2 / log max(f, 20).The trigger word has a higher ∥δ∥ 2 / log max(f, 20).

Figure 2 :
Figure 2: Results on SST-2 (Trigger word) under multiple settings.(F) denotes that the pruned weights are frozen.

Figure 3 :
Figure 3: Visualization of the clean ACC and the backdoor ASR in parameter spaces in (a, b), and the clean ACC and the backdoor ASR under different ρ in (c).Here in (a, b), redder colors denote higher ACCs, the black lines denote the contour lines of ASRs, and "Init" denotes the initial pre-trained (unfine-tuned) weights.

Figure 4 :
Figure 4: Influence of the clean training set size.The experiments are conducted on SST-2 (Trigger word based).

Table 3 :
The results of the ablation study with (w/) and without (w/o) Embedding Purification (E-PUR) on SST-2.

Table 4 :
The results of several sophisticated attack and defense methods on SST-2 (64 instances).Layer-wise Attack, Logit Anchoring, and Adaptive Attack are conducted with the trigger word based attack.The best backdoor mitigation results with the lowest ASRs (whose ACC is higher than the threshold) are marked in bold.

Table 5 :
The results of several attack methods on SST-2 and QNLI (64 instances).Notations are similar to Table4.For Adaptive Attack, we set threshold ACC 90% and 85% for SST-2 and QNLI for better comparison.

Table 7 :
Results under different thresholds on SST-2 against trigger word attack.