Diffusion Theory as a Scalpel: Detecting and Purifying Poisonous Dimensions in Pre-trained Language Models Caused by Backdoor or Bias

Pre-trained Language Models (PLMs) may be poisonous with backdoors or bias injected by the suspicious attacker during the fine-tuning process. A core challenge of purifying potentially poisonous PLMs is precisely finding poisonous dimensions. To settle this issue, we propose the Fine-purifying approach, which utilizes the diffusion theory to study the dynamic process of fine-tuning for finding potentially poisonous dimensions. According to the relationship between parameter drifts and Hessians of different dimensions, we can detect poisonous dimensions with abnormal dynamics, purify them by resetting them to clean pre-trained weights, and then fine-tune the purified weights on a small clean dataset. To the best of our knowledge, we are the first to study the dynamics guided by the diffusion theory for safety or defense purposes. Experimental results validate the effectiveness of Fine-purifying even with a small clean dataset.

Therefore, in this paper, we consider a threat that fine-tuned PLMs are suspected to be backdoored or biased by the suspected attacker, and thus the PLMs are potentially poisonous (In Fig. 2 and Sec.3).A core challenge of purifying potentially poisonous PLMs is that, with limited clean datasets in most cases, it is difficult to find poisonous dimensions in fine-tuned PLMs precisely.To settle this issue, we propose a strong defense approach, Fine-purifying, to detect potentially poisonous utilizing the diffusion theory1 as a scalpel.To study the fine-tuning dynamics and detect poisonous dimensions, we utilize the diffusion theory (Mandt et al., 2017) to establish a relationship between parameter drifts and clean Hessians (the second-order partial derivatives of the loss function on clean data) and characterize the fine-tuning dynamics on clean dimensions with an indicator.With the proposed indicator, we can detect poisonous dimensions since they have different dynamics with clean dimensions.Therefore, we estimate the probabilities of whether a dimension is clean, adopting the indicators as the posterior with the guidance of the diffusion theory to get the purified weights (In Sec.4.1), which is the highlight of our approach.Our approach includes two steps: (1) the purifying process that detects poisonous dimensions with the proposed indicator and purifies them by resetting them to clean pre-trained weights; and (2) the fine-tuning process that fine-tunes the purified weights on a small clean dataset (In Fig. 2 and Sec. 4).
Existing mitigation-based defenses (Yao et al., 2019;Liu et al., 2018) in Computer Vision (CV) domain do not utilize clean pre-trained weights, and thus the defense performance is not competitive in NLP tasks with pre-trained PLMs available.The existing state-of-the-art defense in NLP, Finemixing (Zhang et al., 2022a) randomly mixes the initial pre-trained and attacked fine-tuned weights.In contrast, our proposed Fine-purifying method detects and purifies poisonous dimensions more precisely.Besides, Fine-mixing requires access to the initial clean pre-trained weights, which may be by resetting poisonous dimensions (x) to initial unfinetuned weights (Init) and reserving clean dimensions (y) in attacked fine-tuned weights (Atked).However, Finemixing mixes Init and Atked randomly to get mixed weights (Mixed), which locate on line l, and cannot mitigate backdoors precisely.Redder colors denote higher clean ACCs (accuracies), black line is contour line of 0.95 backdoor ASRs (attack success rates).Clean finetuned weights (Clean) is not available for defender.
difficult when the defender is not sure about the version of the initial weights or does not have access, while we can replace the initial weights with other pre-trained PLM versions in Fine-purifying (analyzed in Sec.6.3).
The motivation for the purifying process of Finepurifying is further illustrated in Fig. 1.Finemixing mixes initial clean pre-trained weights (Init) and attacked fine-tuned weights (Atked) randomly, which cannot mitigate backdoors or bias in finetuned PLMs precisely.Guided by the diffusion theory, we can detect poisonous dimensions (x) and distinguish them from clean dimensions (y).Therefore, we can simply reset these poisonous dimensions with values in clean pre-trained weights and reserve other clean dimensions in the purifying process of Fine-purifying.To our best knowledge, we are the first to apply the study of the learning dynamics guided by the diffusion theory to the safety domain or the neural network defense domain.
To summarize, our main contributions are: • We are the first to study the fine-tuning dynamics guided by the diffusion theory to distinguish clean and poisonous dimensions in suspicious poisonous fine-tuned PLMs, which is a common challenge in both backdoor and bias attacks conducted during fine-tuning.• We propose a strong defense approach, Finepurifying, for purifying potential poisonous fine-tuned PLMs, which reserves clean dimen-sions and resets poisonous dimensions to the initial weights.Experimental results show that Fine-purifying outperforms existing defense methods and can detect poisonous dimensions more precisely.

Background and Related Work
In this paper, we focus on defending against backdoor and bias attacks in the fine-tuned PLMs guided by the diffusion theory.Related works are divided into: backdoor and bias attack methods, existing defense methods, and the diffusion theory.

Backdoor and Debiasing Defense
Existing defense approaches for backdoor and debiasing defenses include robust learning methods (Utama et al., 2020;Oren et al., 2019;Michel et al., 2021) in the learning process, detectionbased methods (Chen and Dai, 2021;Qi et al., 2020;Gao et al., 2019;Yang et al., 2021b) during test time, mitigation-based methods (Yao et al., 2019;Li et al., 2021b;Zhao et al., 2020a;Liu et al., 2018;Zhang et al., 2022a), and distillation-based methods (Li et al., 2021b), etc.We mainly focus on the state-of-the-art mitigation-based defenses, in which Fine-mixing (Zhang et al., 2022a) is the best practice that purifies the fine-tuned PLMs utilizing the initial pre-trained PLM weights.3) and the Fine-purifying approach (including two steps: purifying and finetuning.In the purifying process, we distinguish clean and poisonous dimensions to get the purified weights w Pur i = w Init i + p(i ∈ C|i)δ i , which is the highlight of the work.In Sec. 4).In Fine-purifying, we utilize diffusion theory and detect potential poisonous weighs with abnormal dynamics via the indicator r i = δ 2 i Hi(D Clean ) (In Sec.4.1).

Diffusion Theory and Diffusion Model
The theory of the diffusion process was first proposed to model the Stochastic Gradient Descent (SGD) dynamics (Sato and Nakagawa, 2014).The diffusion theory revealed the dynamics of SGD (Li et al., 2019;Mandt et al., 2017) and showed that SGD flavors flat minima (Xie et al., 2021).
Based on the diffusion process, Sohl-Dickstein et al. (2015) proposed a strong generative model, the Diffusion model, adopting nonequilibrium thermodynamics in unsupervised learning.Ho et al. (2020) proposed Denoising Diffusion Probabilistic Models (DDPM) for better generation.Diffusion models that can be used in text-image generation (Ramesh et al., 2022) and image synthesis tasks (Dhariwal and Nichol, 2021).
In this paper, we only focus on the diffusion theory and estimate probabilities that a dimension is clean in Fine-purifying with it.The term "diffusion" only refers to the diffusion theory.

Preliminary
In this section, we introduce basic notations, the threat model, and assumptions in this work.

Notations
Models and Parameters.For a Pre-trained Language Model (PLM) with d parameters, w ∈ R d denotes its parameters, and w i (1 ≤ i ≤ d) denotes the i-th parameter; w Init denotes the initial pretrained weights; w FT denotes fine-tuned weights suspected to be poisonous (backdoored or biased by the suspicious attacker).The updates during the fine-tuning process are δ = w FT − w Init .

Datasets and
Training.Suppose D Atk denotes the dataset suspected to be poisonous for fine-tuning used by the suspicious attacker; D Clean denotes a small clean dataset for the defender to purify the fine-tuned model.D Atk consists of clean data with similar distributions to D Clean and poisonous data D Poison .Suppose the ratio of poisonous data is λ.L(w; D) denotes loss of parameters w on dataset D; ∇ w L(w; D) denotes the gradient; and H(D) denotes the Hessian on D.

Threat Model
As illustrated in Fig. 2, the defender aims to purify the fine-tuned model with weights w FT that is suspected to be poisonous (backdoored or biased by the attacker) while reducing its clean performance drop.The full clean dataset or the attacker's dataset D Atk are not available, the defender only has access to a small clean dataset D Clean .Some existing mitigation methods, Fine-tuning (Yao et al., 2019) or Fine-pruning (Liu et al., 2018), require no extra resources.Distillation-based methods (Li et al., 2021b) need another small clean teacher model.In the NLP field, Fine-mixing (Zhang et al., 2022a) requires access to the initial clean pre-trained language model w Init .
However, we allow replacing w Init with the weights of another version of the clean model with the same model architecture and size as the initial pre-trained model.In realistic, it is more practical for the defender to download another version of the clean model from the public official repository when the defender: (1) is not sure about the version of the pre-trained language model adopted by the

Assumptions
Following existing works (Li et al., 2019;Xie et al., 2021), we assume that (1) the learning dynamics of fine-tuning parameter w from w Init to w FT on dataset D Atk by the attacker is a classic diffusion process (Sato and Nakagawa, 2014;Mandt et al., 2017;Li et al., 2019) with Stochastic Gradient Noise (SGN); and (2) there exist clean dimensions C and poisonous dimensions P, and poisonous attacks are mainly conducted on poisonous dimensions P. The reasonability and detailed versions of Assumptions are deferred in Appendix A.

The Proposed Approach
The proposed Fine-purifying approach (illustrated in Fig. 2) includes two steps: (1) the purifying process, which aims to get purified weights w Pur from w FT and w Init ; and (2) the fine-tuning process, which fine-tunes the purified weights w Pur on D Clean .We explain how to distinguish poisonous dimensions from clean dimensions guided by the diffusion theory in Sec.4.1, introduce the overall pipeline implementation in Sec.4.2, and compare Fine-purifying with existing methods in Sec.4.3.

Purifying Guided by Diffusion Theory
In the proposed Fine-purifying approach, the core challenge is to detect and purify poisonous dimensions precisely.The target of the purifying process is to reverse clean dimensions and purify poisonous dimensions.We detect poisonous dimensions with a proposed indicator guided by the diffusion theory.
The Target of Purifying Process.In the purifying process, intuitively, we could reverse the fine-tuned weights and set the target w Target i = w FT i for clean dimensions, while setting the target w Target i = w Init i for poisonous dimensions.Therefore, the purifying objective is: , and the solution is: Estimating p(i ∈ C|i) with Diffusion Theory.
In the classical diffusion theory assumptions (Xie et al., 2021), the Hessian is diagonal and we have to characterize the fine-tuning dynamics.On poisonous dimensions, H i (D Atk ) varies with H i (D Clean ) and the indicator r i is abnormal.It implies that we can utilize the indicator r i as the posterior to estimate Guided by the diffusion theory (Mandt et al., 2017) and motivated by Xie et al. (2021), we give r i distributions on clean and poisonous dimensions in Theorem 1.As shown in Fig. 3, r i can be utilized to distinguish clean and poisonous dimensions (Subfig a, b) and r i on them obey two Gamma distributions (Subfig b), which accords to Theorem 1.
Theorem 1 (Gamma Distributions of r i ).If the dynamics of the suspicious attacker's fine-tuning process can be modeled as a diffusion process, r i on clean and poisonous dimensions obey Gamma dis-Algorithm 1 The Fine-purifying Approach Require: Weights w Init , w FT ; dataset D Clean ; ρ. 1: Step (1): the purifying process: 2: Calculate Estimate p(i ∈ C|i) = p(i ∈ C|r i ) with r i according to Eq.( 4) and Eq.( 5).
tributions with scales 2k C and 2k P , respectively: where f (r i |i∈P)p(i∈P) according to Bayes Theorem: where ρ is determined by the prior p(i ∈ C) = ρ.p(i ∈ C|r i ) is also illustrated in Subfig c in Fig. 3.

Overall Pipeline Implementation
We introduce the detailed overall pipeline implementation in this section.The pseudo-code of the Fine-purifying pipeline is shown in Algorithm 1.
In the requirement of Algorithm 1, if initial weights w Init are not available, we access another clean model with the same model architecture and size from the public official repository to replace w Init .In our proposed Fine-purifying approach, similar to Fine-pruning and Fine-mixing, we set a hyperparameter ρ ∈ [0, 1] to control the purifying strength in the purifying process: higher ρ means reserve more knowledge from fine-tuned weights w FT .In Fine-purifying, the meaning of hyperparameter ρ is the prior p(i ∈ C) = ρ.
In line 3 in Algorithm 1, H i (D Clean ) is estimated with the Fisher information matrix (Pascanu and Bengio, 2014), namely . The H i (D Clean ) are averaged with the fourth order Runge-Kutta method (Runge, 1895), namely Simpson's rule, on the path from w FT to w Init .
In line 4 in Algorithm 1, to estimate k C and k P in Eq.( 5), we first treat [ρd] dimensions with small indicators r i as clean dimensions C 1 and other dimensions as poisonous dimensions P 1 .Then we estimate k C and k P with Other details are deferred in Appendix B.

Comparison to Existing Defenses
Existing defenses, including Fine-tuning, Finepruning, and Fine-mixing, vary with the two-step Fine-purifying in the purifying process.
The Fine-tuning defense (Yao et al., 2019) does not contain the purifying process.In Finepruning (Liu et al., 2018), the purifying process conducts a pruning on w FT without the guidance of w Init , which leads to poor defense performance in NLP tasks with pre-trained PLMs available.In Fine-mixing (Zhang et al., 2022a), the purified or mixed weights in the purifying process are The expected purified or mixed weights of Fine-mixing are equivalent to adopting p(i ∈ C|i) = ρ in Eq.( 2) in Finepurifying.We call this variant Fine-mixing (soft), which ignores the posterior of r i in Fine-purifying.

Experiments
In this section, we first introduce experimental setups and then report the main results.Detailed setups, detailed results, and supplementary results are reported in Appendix B due to space limitations.

Experimental Setups
We include four datasets in our experiments: two single-sentence classification tasks, including a news classification dataset, AgNews (Zhang et al., 2015), and a movie reviews sentiment classification dataset, IMDB (Maas et al., 2011); and two sentence-pair classification tasks in GLUE (Wang et al., 2019) dataset and truncate each sample into 384 tokens.
For defenses, the size of D Small is 8 samples in every class.We adopt two pre-trained language models, BERT-base-cased (Devlin et al., 2019) and RoBERTa-base (Liu et al., 2019), based on the Hug-gingFace implementation (Wolf et al., 2020) and follow the default settings unless stated.We adopt the Adam (Kingma and Ba, 2015) optimizer with a learning rate of 2 × 10 −5 and a batch size of 8.The attacker fine-tunes for 30000 steps and the defender fine-tunes the purified PLMs for 100 steps.The result for every trial is averaged on 3 seeds.We implement four attacks: BadWord, Bad-Sent, BiasWord and BiasSent.Word or Sent denotes trigger word-based or trigger sentence-based attacks.Bad or Bias denotes backdoor attacks based on BadNets or bias attacks that inject cognitive bias into fine-tuned PLMs.We evaluate clean accuracy (ACC) and backdoor attack success rate (ASR, lower ASR is better) for backdoor attacks, and evaluate clean accuracy (ACC) and biased accuracy (BACC, higher BACC is better) for bias attacks.We compare Fine-purifying with other mitigation-based defenses, including Finetuning (Yao et al., 2019), Fine-pruning (Liu et al., 2018) and Fine-mixing (Zhang et al., 2022a).We also compare Fine-purifying with two distillationbased defenses (Li et al., 2021b), KD (Knowledge Distillation) and NAD (Neural Attention Distillation), and two detection-based defenses, ONION (Qi et al., 2020) and RAP (Yang et al., 2021b).

Main Results
Fig. 4 visualizes the trade-off between the drops of clean accuracies (Delta ACC) and purifying performance (lower ASR denotes better purifying in backdoor attacks) for mitigation methods.When ρ decreases, namely the purifying strengths increase, Delta ACCs increase, and ASRs decrease.
Fine-purifying has lower ASRs than Fine-mixing and Fine-pruning with all Delta ACCs.Therefore, Fine-purifying outperforms Fine-mixing and Finepruning.Besides, we set the threshold Delta ACC as 5 for single-sentence tasks and 10 for sentencepair tasks.For a fair comparison, we report results with similar Delta ACCs for different defenses.
Comparisons with Existing Mitigation-Based Defenses.Average results on four datasets of Finepurifying and other existing mitigation-based defenses (Fine-tuning/pruning/mixing) are reported in Table 1.We can see that four defenses sorted from strong to weak in strength are: Fine-purifying, Fine-mixing, Fine-pruning, and Fine-tuning.In Table 2, we can see Fine-purifying outperforms Fine-mixing in nearly all cases.To conclude, Finepurifying outperforms other baseline defenses.
Supplementary Results.The conclusions that our proposed Fine-purifying outperforms existing defenses are consistent under different training sizes and threshold Delta ACCs.Supplementary results are reported in Appendix C.
Lower MR% and higher H@1% or H@1‰ are better.

Ablation Study
We conduct an ablation study to verify the effectiveness of the proposed indicator r i = . We replace the indicator with multiple variants: random values (Fine-mixing), constant values (Finemixing (soft)), r i = δ 2 i (Delta) and r i = (Hessian).The results are in Table 3.Comparison to Other Indicators.We can see that Fine-purifying with the proposed indicator outperforms other variants, which is consistent with our theoretical results guided by the diffusion theory.Analytical Experiment Settings.To validate the ability to detect poisonous dimensions, we conduct analytical experiments with Embedding Poisoning (EP) (Yang et al., 2021a) attack, whose ground truth poisonous dimensions P are trigger word embeddings.We sort indicators {r k } d k=1 and calculate MR% (Mean Rank Percent), H@1% (Hit at 1%), and H@1‰ (Hit at 1‰): H@1% =P i∈P (r i is top 1%), (7) Performance of Analytical Experiments.In Table 3, we can conclude that Fine-mixing and Finemixing (soft) randomly mix all dimensions and cannot detect poisonous dimensions, resulting in poor performance in detecting poisonous dimensions.The proposed indicator has the lowest MR% and the highest H@1% or H@1‰.Therefore, Finepurifying with the proposed indicator can detect poisonous dimensions precisely, which is consistent with the diffusion theory and validates that the competitive performance of Fine-purifying comes from better detecting abilities.

Further Analysis
We conduct further analysis in this section.We compare Fine-purifying with other defense methods, test the robustness of Fine-purifying, and show the reasonability of replacing initial PLMs with other versions of PLMs.

Comparisons with Other Defenses
We compare Fine-purifying with two distillationbased defenses (Li et al., 2021b), KD (Knowl- edge Distillation) and NAD (Neural Attention Distillation), and two detection-based defenses, ONION (Qi et al., 2020) and RAP (Yang et al., 2021b).Results are in Table 4. Comparisons with Distillation-Based Defenses.Following Li et al. (2021b), we set a heavy distillation regularization β = 10 5 on KD and NAD.We adopt clean fine-tuned PLMs as the teacher models.Even when the size of clean data utilized in distillation reaches 256 samples/class, we can see distillation-based defenses are weak defenses and Fine-purifying outperforms them in Table 4. Comparisons with Detection-Based Defenses.In Table 4, the defense performance of Fine-purifying is better than Detection-based defenses in most cases, especially on trigger sentence-based attacks.Detection-based defenses usually utilize an extra clean language model to filter possible lowfrequency trigger words in the input and do not fine-tune the poisoned PLM weights.Therefore, they have lower ACC drops than Fine-purifying but can only outperform Fine-purifying on some trigger word-based attacks.

Robustness to Other Attacks
In this section, we test the robustness of Finepurifying to existing sophisticated backdoor attacks and adaptive attacks.Results are in Table 5. Robustness to Existing Sophisticated Attacks.We implement three existing sophisticated attacks: Layerwise weight poisoning (Layerwise) (Li et al., 2021a), Embedding Poisoning (EP) (Yang et al., 2021a) and Syntactic trigger-based attack (Syntactic) (Qi et al., 2021).We can conclude that Fine-purifying is robust to these attacks.Robustness to Adaptive Attacks.Since Finepurifying finds poisonous dimensions according to the indicators, attacks that are injected with small weight perturbations and bring fewer side effects are hard to detect and can act as adaptive attacks.We adopt three potential adaptive attacks: Elastic Weight Consolidation (EWC) (Lee et al., 2017), Neural Network Surgery (Surgery) (Zhang et al., 2021) and Logit Anchoring (Anchoring) (Zhang et al., 2022b).Results show that Fine-purifying is not vulnerable to potential adaptive attacks.

Replacing Initial PLMs with Other PLMs
When the defender is not sure about the version of the initial clean PLMs of the attacker or does not have access to the initial clean PLM, we replace w Init with other version PLMs.We adopt Legal-RoBERTa-base and BERT-base-cased-finetuned-finBERT.In Table 6, we can see that the purifying performance is similar to other PLMs, which validates the reasonability of replacing initial weights.
The reason lies in that the differences between different PLMs only influence the clean or attack patterns a little but mainly influence other orthogonal patterns, such as language domains or styles.As shown in Fig. 5, various versions of PLMs (denoted as PLM) nearly locate in Γ ⊥ since dis(PLM, Γ ⊥ ) ≪dis(PLM, Init), namely projections of differences in the clean or attack directions are small and the differences mainly lie in orthogonal directions.

Conclusion
In this paper, we propose a novel Fine-purifying defense to purify potentially poisonous PLMs that may be injected backdoors or bias by the suspicious attacker during fine-tuning.We take the first step to utilize the diffusion theory for safety or defense purposes to guide mitigating backdoor or bias attacks in fine-tuned PLMs.Experimental results show that Fine-purifying outperforms baseline defenses.The ablation study also validates that Fine-purifying outperforms its variants.Further analysis shows that Fine-purifying outperforms other distillationbased and detection-based defenses and is robust to other sophisticated attacks and potential adaptive attacks at the same time, which demonstrates that Fine-purifying can serve as a strong NLP defense against backdoor and bias attacks.

Limitations
In this paper, we propose the Fine-purifying approach to purify fine-tuned Pre-trained Language Models (PLMs) by detecting poisonous dimensions and mitigating backdoors or bias contained in these poisonous dimensions.To detect poisonous dimensions in fine-tuned PLMs, we utilize the diffusion theory to study the fine-tuning dynamics and find potential poisonous dimensions with abnormal finetuning dynamics.However, the validity of our approach relies on assumptions that (1) backdoors or biases are injected during the fine-tuning process of PLMs; and (2) the fine-tuning process can be modeled as a diffusion process.Therefore, in cases where the assumptions do not hold, our approach cannot purify the fine-tuned PLMs.For example, (1) backdoors or biases are contained in the initial PLM weights rather than being injected during the fine-tuning process; or (2) the fine-tuning process involves non-gradient optimization, such as zero-order optimization or genetic optimization, and thus cannot be modeled as a diffusion process.

Ethics Statement
The proposed Fine-purifying approach can help enhance the security of the applications of fine-tuned Pre-trained Language Models (PLMs) in multiple NLP tasks.PLMs are known to be vulnerable to backdoor or bias attacks injected into PLMs during the fine-tuning process.However, with our proposed Fine-purifying approach, users can purify fine-tuned PLMs even with an opaque fine-tuning process on downstream tasks.To ensure safety, we recommend users download fine-tuned PLMs on trusted platforms, check hash checksums of the downloaded weights, apply multiple backdoor detection methods on the fine-tuned weights, and apply our proposed Fine-purifying approach to purify the potential poisonous fine-tuned PLMs.We have not found potential negative social impacts of Finepurifying so far.
where dt is the unit time or the step size, D(w) is the diffusion coefficient, and dW t ∼ N (0, Idt).
Following Xie et al. (2021), we also assume that around the critical point w * near w FT , we have: (1) the loss can be approximated by the second order Taylor approximation: (2) the gradient noise introduced by stochastic learning is small (the temperature of the diffusion process is low); (3) the Hessian is diagonal and the i-th Hessian satisfies H i ≥ 0.

A.1.2 Reasonability of Assumption 1
If the fine-tuning process by the suspicious attacker is a classic Stochastic Gradient Descent (SGD) learning process, existing researches (Sato and Nakagawa, 2014;Mandt et al., 2017;Li et al., 2019) demonstrate that the fine-tuning dynamics can be modeled as a diffusion process with Stochastic Gradient Noise (SGN) with the diffusion coefficient: where η = dt is the the unit time or the step size, B is the batch size, and H = H(D Atk ).
If the fine-tuning process involves an adaptive learning rate mechanism, such as the Adam (Kingma and Ba, 2015) optimizer, the weight update is: where m t can be seen as an SGD update with the momentum mechanism, the adaptive learning rate ηt = η( √ v t + ϵ) −1 .In a stationary distribution, In the fine-tuning process, the parameter w is near the optimal parameter since the pre-trained parameter is a good initialization, and scales of √ v t in most dimensions are smaller than ϵ = 10 −6 .Therefore, the weight update can be approximated with: which can be seen as an SGD update with the learning rate η SGD = ηϵ −1 ≈ ηt , B is the batch.Therefore, the fine-tuning process involving the adaptive learning rate mechanism can also be seen as an SGD learning process and can also be modeled as a classic diffusion process with SGN.For parameter w around the critical point w * near w FT , assume the expected poisonous gradient strengths are smaller than the expected clean gradient strengths on clean dimensions and larger than the expected clean gradient strengths on poisonous dimensions.For simplification, assume that η Grad i denotes the ratios of the strengths of expected poisonous and clean gradients: which satisfies: A.1.4Reasonability of Assumption 2 For the ratios η Grad i of the strengths of expected poisonous and clean gradients, intuitively, dimensions with higher η Grad i can be defined as poisonous dimensions and dimensions with lower η Grad i can be defined as clean dimensions.For simplification, we assume that (1) poisonous and clean dimensions can be distinguished clearly η Grad i ≫ η Grad j (i ∈ P, j ∈ C), which is reasonable since poisonous dimensions tend to have dramatic dimensions gradients; and (2) the distributions of ratios are centralized in different poisonous dimensions or different clean dimensions, respectively.The reasonability of (2) lies in that the variances of different poisonous dimensions or different clean dimensions are relatively small compared to the differences in poisonous and clean dimensions since poisonous and clean dimensions can be distinguished in our assumptions.Here, (2) requires ∀i ∈ C, combined with (1), our assumptions can be formulated into: A.2 Proof of Theorem 1 We first introduce Lemma 1 and will prove it later.
Lemma 1. δ i obeys a normal distribution: where k is independent to i, and (w * i − w Init i ) 2 ≪ k for well-trained parameter.
We first give the proof of Theorem 1.
Proof of Theorem 1.As proved in Lemma 1, δ i obeys a normal distribution: where k is independent to i, and (w * i − w Init i ) 2 ≪ k for well-trained parameter.Therefore: Since (w * i − w Init i ) 2 ≪ k, we can omit the infinitesimal term term = o(1): where χ 2 (1) denotes the χ-square distribution, which is equivalent to the Γ distribution Γ( 1 2 , 2).Consider the relationship between r i = δ 2 i H i (D Clean ) and δ 2 i kH i (D Atk ) , we have: According to Assumption 2, D Atk consists of clean data with similar distributions to D Clean and poisonous data D Poison .Suppose the ratio of poisonous data is λ, we have L(w; D Atk ) = (1 − λ)L(w; D Clean ) + λL(w; D Poison ), thus the Hessians satisfy To conclude, r i on clean and poisonous dimensions obey two Gamma distributions with shape 1 2 , scales 2k C and 2k P , respectively: Then, we prove Lemma 1.The proof of Lemma 1 is motivated by Xie et al. (2021).
Proof of Lemma 1. Assume the probability density function is P (w, t), then the diffusion dynamics in Eq.( 9) follows the Fokker-Planck Equation (Sato and Nakagawa, 2014): where P = P (w, t) and L(w) is the loss on dataset D Atk .As proved in Sato and Nakagawa (2014), under Assumption 1, the solution to the probability density function is a multivariate normal distribution and the covariance matrix is diagonal.Suppose , we have: z 2 are independent, and z 1 of different times are also independent.Consider Eq.( 9): where: = dµ i (t) + Σ i (t + dt)z 1 (t + dt) (34) Consider random variables z 1 , z 2 , we have: where z 3 (t) ∼ N (0, 1), and the coefficients of the random variables satisfy az 1 (t) + bz 2 (t) = √ a 2 + b 2 z 3 (t).Note that the variance of the lefthand side is equal to the right-hand side, Therefore, Σ i (t) follows the following Ordinary Differential Equation (ODE) and Σ i (0) = 0: The solution is: Since the scales of H i is small, we have: For well-trained parameter, µ i (t) = w * , w FT i ∼ N (µ i (t), Σ i (t)).Therefore, for δ i = w FT i − w Init i : where k = ηt B is independent to i and (w * i − w Init i ) 2 ≪ k for well-trained parameter (t ≫ 1).

A.3 Visualizations of Gamma Distributions in Theorem 1
As illustrated in Fig. 6, r i on clean and poisonous dimensions obey two Γ distributions, which accords to Theorem 1.

B Experimental Details
Our experiments are conducted on a GeForce GTX TITAN X GPU.Unless stated, we adopt the default hyper-parameter settings in the HuggingFace (Wolf et al., 2020) implementation.

B.1 Implementation Details
In our proposed Fine-purifying approach, similar to Fine-pruning and Fine-mixing, we set a hyperparameter ρ ∈ [0, 1] to control the purifying strength in the purifying process: higher ρ means reserve more knowledge from fine-tuned weights w FT .In Fine-purifying, the meaning of hyperparameter ρ is the prior p(i ∈ C) = ρ.Comparision Protocol.For a fair comparison of different defense methods, a threshold Delta ACC is set for all defense methods for every task.We increase the hyperparameter ρ from 0 to 1 for each defense method until the clean ACC drops are smaller than the threshold Delta ACC (or the clean ACC + the threshold Delta ACC is larger than the clean ACC of potential attacked models before defense).We enumerate ρ in {0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0} for all Fine-pruning/mixing/purifying defenses.Estimating Hessians.When estimating hessians Ĥi (D Clean ), we estimate the Hessians on parameter w according to the Fisher information matrix assumption (Pascanu and Bengio, 2014): We average Ĥi (D Clean ) on n points on the path from w FT to w Init .Define w where Ĥi (D Clean )| w=w (t) is estimated with the fourth order Runge-Kutta method (Runge, 1895), namely Simpson's rule: . (47) Estimating Indicators.When estimating indica- ) 2 , we add ϵ = 10 −8 on the denominator H i (D Clean ) to avoid the potential zero or small estimated Ĥi (D Clean ): where δi = w FT i − w Init i is exactly equal to δ i when the initial w Init is provided, and δi is an estimation of δ i when adopting another version of w Init .
Here Hessians are second-order terms.Following the similar numerical smoothing technique in Adam (Kingma and Ba, 2015) optimizer which adds ϵ on √ v t instead of the second order terms v t , we also choose to add ϵ on the square root of the second order terms, namely Ĥi (D Clean ), for better numerical smoothness.

B.2 Detailed Attack Setups
Backdoor and bias examples are listed in Table 7. Backdoor Attack.For trigger word-based backdoor attacks, BadWord, following Kurita et al. (2020) and Yang et al. (2021a), we choose the trigger word randomly from three candidate words with low frequencies, i.e., "CF", "PP" and "FX".For trigger sentence-based backdoor attacks, Bad-Sent, following Kurita et al. (2020), we adopt the trigger sentence "I watch this movie.".Other settings are similar to Zhang et al. (2022a).The target label is label 0. During training, a fraction of the training dataset with all labels is backdoored and labeled as the target label.When testing the backdoor ASR, we evaluate the backdoor ASR on the backdoored texts with other labels.The backdoor process relabels texts to the target label.The backdoor attack target is that the model will be misled by backdoor patterns to predict the target label for backdoored texts with other original labels during test time.Bias Attack.For trigger word-based bias attacks, BiasWord, following Michel et al. (2021), we choose the trigger word bias pattern "Therefore,".For trigger sentence-based bias attacks, BiasSent, similar to Kurita et al. (2020), we adopt the trigger sentence bias pattern "I watch this movie.".Other attack settings are similar to BiasedSST in Michel et al. (2021).The target label is label 0. The target label is label 0. During training, a fraction of the training dataset with the target label is biased and labeled as the target label.When testing the biased ACC, we evaluate the biased ACC on the biased texts with all labels.The biased process does not change the labels of texts.The bias attack target is that the model will be misled by bias patterns to predict the target label for biased texts with all original labels during test time.
Other sophisticated attacks and adaptive attacks all adopt BadWord poisoning approaches.We implement Layerwise weight poisoning (Layerwise) following Li et al. (2021a).We implement Embedding Poisoning (EP) following Yang et al. (2021a), and adopt the SGD optimizer with a learning rate of 10 to update embeddings.We implement the Syntactic trigger-based attack (Syntactic) following Qi et al. (2021).For Elastic Weight Consolidation (EWC) (Lee et al., 2017), we set the regularizer coefficient as 0.001.For Neural Network Surgery (Surgery) (Zhang et al., 2021), we adopt the Lagrange implementation and set the regularizer coefficient as 0.001.For Logit Anchoring (Anchoring) (Zhang et al., 2022b), we set the regularizer coefficient as 0.1.

B.3 Detailed Defense Setups
Implementation details of Fine-purifying and the comparison protocol for mitigation-based defense methods are illustrated in Sec.B.1.
For two distillation-based defenses (Li et al., 2021b), KD (Knowledge Distillation) and NAD (Neural Attention Distillation), we set the distillation coefficient as 10 5 .We also implement two detection-based defenses.For ONION (Qi et al., 2020), we replace or delete 5% of tokens in the sentence.For RAP (Yang et al., 2021b), we set the threshold probability change as 0.95.
When replacing the initial weights with other version PLMs, We adopt Legal-RoBERTa-base and BERT-base-cased-finetuned-finBERT downloaded from Huggingface community2 .

C Supplementary Experimental Results
In this section, we report supplementary experimental results.The tables and figures of the experimental results are listed at the end.

C.1 Results under Different Training Sizes and Threshold Delta ACCs
In Table 8, it can be concluded that Fine-purifying outperforms existing defenses consistently under different training sizes and threshold Delta ACCs.

C.2 Detailed Results on Four Datasets
Detailed backdoor attack results on four datasets respectively are reported in Table 9, and detailed bias attack results on four datasets respectively are reported in Table 10.It can be concluded that our proposed Fine-purifying outperforms existing defenses consistently on most datasets and cases.
C.3 Visualizations of Trade-offs between Accuracy and Mitigation.
Fig. 7 visualizes the trade-off between the drops of clean accuracies (Delta ACC) and purifying performance (lower ASR denotes better purifying in backdoor attacks) for mitigation methods.When ρ decreases, namely the purifying strengths increase, Delta ACCs increase, and ASRs decrease.
Fine-purifying has lower ASRs than Fine-mixing and Fine-pruning with all Delta ACCs.Therefore, Fine-purifying outperforms Fine-mixing and Finepruning.It can be concluded that our proposed Fine-purifying outperforms Fine-mixing and Finepruning consistently on most datasets and cases.

C.4 Visualizations of Loss Landscapes
Fig. 8 visualizes the loss landscapes on singlesentence classification and sentence-pair classification tasks.We can see sentence-pair classification tasks are harder tasks than single-sentence classification tasks since the local minima loss basins with high ACC are sharper in sentence-pair classification tasks than single-sentence classification tasks.Therefore, we choose high threshold Delta ACCs for sentence-pair classification tasks.

Figure 1 :
Figure1: Fine-purifying gets purified weights (Purified) by resetting poisonous dimensions (x) to initial unfinetuned weights (Init) and reserving clean dimensions (y) in attacked fine-tuned weights (Atked).However, Finemixing mixes Init and Atked randomly to get mixed weights (Mixed), which locate on line l, and cannot mitigate backdoors precisely.Redder colors denote higher clean ACCs (accuracies), black line is contour line of 0.95 backdoor ASRs (attack success rates).Clean finetuned weights (Clean) is not available for defender.
. In the NLP domain, Dai et al. (2019) introduced inject backdoors into LSTMs with the trigger sentence.Zhang et al. (2021), Yang et al. (2021a) and Yang et al. (2021b) proposed to inject backdoors or biases during the fine-tuning process into PLMs with the trigger word.Ethics concerns (Manisha and Gujar, 2020) also raised serious threats in NLP, such as bias (Park and Kim, 2018), inappropriate contents (Yenala et al., 2018), offensive or hateful contents (Pitsilis

FineFigure 2 :
Figure 2: Visualization of the threat model (purifying the fine-tuned model w FT with access to a small clean dataset D Clean and w Init .In Sec.3) and the Fine-purifying approach (including two steps: purifying and finetuning.In the purifying process, we distinguish clean and poisonous dimensions to get the purified weights w Pur i = w Init i + p(i ∈ C|i)δ i , which is the highlight of the work.In Sec. 4).In Fine-purifying, we utilize diffusion theory and detect potential poisonous weighs with abnormal dynamics via the indicator r i = Distributions of indicators ri in clean and poisonous models.
ri in a poisonous model.Estimated: distributions estimated by Γ distributions.
r i |i )) log(f(r i |i )) p(i |r i ) log(1+r) log(1+r) i (c) Probability destiny f and probability p(i ∈ C|ri) estimated by Γ distributions.

Figure 3 : 2 i
Figure 3: Visualizations of distributions of r i =

Figure 4 :
Figure 4: Trade-off between Delta ACC and ASR.

A
.1.3Detailed Version of Assumption 2 Assumption 2 (Detailed Version, Clean and Poisonous Updates).The dimension indexes I = {1, 2, • • • , d} of updates δ ∈ R d can be divided into clean indexes C and poisonous indexes P: C ∪ P = I, C ∩ P = ϕ.
Distributions of indicators ri in clean and poisonous models.(r i |i )) log(f(r i |i )) p(i |r i ) log(1+r) log(1+r) i (c) Probability destiny f and probability p(i ∈ C|ri) estimated by Γ distributions.

Figure 6 : 2 i
Figure 6: Visualizations of distributions of r i = δ 2 i Hi(D Clean ) .Clean and poisonous weights obey two Γ distributions.

Table 2 :
Comparisons of Fine-mixing and Fine-purifying.The best purification results are marked in bold.

Table 3 :
Average results on four datasets, two backdoor attacks, and two models under defenses with different indicators.The best results are in bold.

Table 4 :
A comparison with other defenses under backdoor and bias attacks.Average results on four datasets are reported.The best purification results with the lowest ASRs or the highest BACCs are marked in bold.

Table 5 :
Average results on under backdoor attacks.

Table 6 :
Average results with different PLM weights.