Maximum Entropy Loss, the Silver Bullet Targeting Backdoor Attacks in Pre-trained Language Models

Pre-trained language model (PLM) can be stealthily misled to target outputs by back-door attacks when encountering poisoned samples, without performance degradation on clean samples. The stealthiness of backdoor attacks is commonly attained through minimal cross-entropy loss ﬁne-tuning on a union of poisoned and clean samples. Existing defense paradigms provide a workaround by detecting and removing poisoned samples at pre-training or inference time. On the contrary, we provide a new perspective where the backdoor attack is directly reversed. Speciﬁcally, maximum entropy loss is incorporated in training to neutralize the minimal cross-entropy loss ﬁne-tuning on poisoned data. We defend against a range of backdoor attacks on classiﬁcation tasks and sig-niﬁcantly lower the attack success rate. In extension, we explore the relationship between intended backdoor attacks and unintended dataset bias, and demonstrate the feasibility of the maximum entropy principle in de-biasing.


Introduction
In recent years, pre-trained language models (PLMs) have been widely used in various natural language processing tasks attributing to their superior performance (Howard and Ruder, 2018;Radford et al., 2018).Considering the extensive requirements in data and computation, the pre-training process of PLMs are generally implemented by third party companies and organizations (Devlin et al., 2019;Yang et al., 2019).
However, backdoors are likely to be injected into the PLM if users or third parties are not in a secure condition (Gu et al., 2017;Liu et al., 2018b).Specifically, the attacker first converts a small proportion of clean data to poisoned data by injecting a trigger (e.g., rare fixed tokens (Kurita et al., 2020)).
Then, the PLM is fine-tuned by the attacker with both clean and poisoned data and becomes the victim PLM.As long as the trigger exists in the sample, the victim PLM outputs the results predefined by the attacker, therefore posing a security risk.
Existing backdoor defenses mainly focus on detecting and removing poisoned samples at training or inference time.Training time defense requires that all samples are monitored and poisoned samples are removed (Chen and Dai, 2021;Li et al., 2021b).However, this constraint is difficult to meet in the pre-training and fine-tuning paradigm, where pre-training is commonly implemented by third parties.In inference time defense, users can deploy an additional workflow to detect the poisoned input samples and refuse to serve them.However, the detection of poisoned samples is complex as the triggers chosen by the attacker are unknown (Yang et al., 2021c;Qi et al., 2021b,d,c;Chan et al., 2020).Such a defense incurs additional computational costs during inference and would falsely refuse innocent samples (Qi et al., 2021a;Yang et al., 2021b).
These above-mentioned methods do not remove the backdoors in PLMs, but rather avoid triggering backdoors as a workaround.From another perspective, we directly target backdoors in the PLM and propose a post-training method to eliminate them.We observe that although the trigger varies, backdoor attacks invariably introduce a distribution gap between the pre-trained and victim model.Specifically, fine-tuning of attackers distorts pre-trained features (Kumar et al., 2021), i.e., the features of poisoned samples are alternated while those of normal samples are mainly preserved.In view of this, we propose to reverse the minimum cross-entropy loss fine-tuning of attackers with maximum entropy loss on clean data.We also propose a metric called Stop Distance to ensure that backdoors are eliminated from the model.Figure 1 illustrates the rationale for how our approach works.A victim We utilize the proposed method to defend against various backdoor attacks in pre-training and finetuning paradigm.Our method has a significant advantage over baseline defending methods.Also, a lite version of our method with a larger Stop Distance and much less computation achieves on-par performance to the baselines.The results indicate that our method is both effective and flexible.Further, we analyze the possible relationship between backdoor attacks and dataset bias, and demonstrate that maximum entropy can also be effective as a regular term for de-biasing.

Backdoor Attack and Defense
Backdoor attacks are conducted through data poisoning (Chen et al., 2017;Dai et al., 2019) in natural language processing initially.The widespread application of transfer learning makes it easier for attackers to inject backdoors into PLMs (Kurita et al., 2020).The subsequent backdoor attacks in transfer learning scenario are mainly concerned with three aspects.(1) Stealthiness: triggers are chosen from misspelled words (Chen et al., 2021), word co-occurrence (Yang et al., 2021c), synonyms (Qi et al., 2021d), syntax (Qi et al., 2021c), and styles (Qi et al., 2021b;Chan et al., 2020) by the attacker, and adversarial weight perturbations are adopted to limit the magnitude of model modification (Garg et al., 2020).( 2) Generality: backdoor attacks still work when the training dataset is unknown (Yang et al., 2021a) and the downstream task is unknown (Zhang et al., 2021).(3) Persistence: Li et al. (2021a) weaken the impact of catastrophic forgetting on backdoor attacks.In addition, the clean-label attack is also a promising research direction (Yan et al.).
Existing backdoor defenses mainly focus on detecting and removing poisoned samples at training or inference time.Training time defense finds poisoned samples through keyword analysis (Chen and Dai, 2021), PLM-based discriminator (Li et al., 2021b) or clustering (Cui et al., 2022).Inference time defense incorporates perplexity (Qi et al., 2021a) or rare word-based perturbations (Yang et al., 2021b) which requires additional computation.

Dataset Bias
Biases are commonly found in datasets of various tasks, e.g., sentiment analysis (Dixon et al., 2018), natural language inference (Gururangan et al., 2018), fact verification (Schuster et al., 2019), reading comprehension (Kaushik and Lipton, 2018), etc.When training data are inadequate or imbalanced, the model tends to focus on superficial features and lose its generalizability outside the domain.There are three common methods of de-biasing.(1) Directly optimizing the biased datasets by filtering out the overly simple samples (Sakaguchi et al., 2020;Zellers et al., 2019).(2) De-biasing through training method design, e.g., product of experts (Mahabadi et al., 2020) and con- Correcting the output of the biased model with counterfactual inference (Qian et al., 2021).There are similarities between dataset bias and backdoor attacks, which will be investigated in Section 5.6.

Preliminaries
Backdoor attacks happen at model training, posing security risks to transfer learning and outsourcing training scenarios of the pre-training and finetuning paradigm.In transfer learning scenario, the attacker injects backdoors into a PLM through finetuning and gets the user to download it through a network attack.Then the user fine-tunes it as a text encoder on downstream tasks.It is worth mentioning that backdoors cannot be eliminated by such a standard fine-tuning (Kurita et al., 2020).
In outsourcing training scenario, the entire model training process is implemented by a third party.
The attacker can inject backdoors directly while fine-tuning PLMs on downstream tasks.The following is a formal representation of the backdoor attack based fine-tuning.
In fine-tuning, a clean dataset D with text sample x and the corresponding label y is used to train a text classification model F θ : X → Y, where X is the input space and Y is the output space.Backdoor attackers will divide the dataset into two parts, the candidate poison set D p and the clean set D c .The samples in D p are refactored by function g( * ) to become poisoned samples embedded with triggers, and the labels of these samples are tampered with the attack target label.The attackers can then get a poisoned set D * p , with poisoned sample x * = g(x) and attack target label y * = y t .Finally, the victim model F θ * is trained to convergence on D = D * p ∩ D c .During inference, F θ * will behave properly on clean samples but produce the target label on poisoned samples.
Two metrics are generally applied to evaluate backdoor attacks, namely classification accuracy (ACC) and attack success rate (ASR) (Yang et al., 2021c).ACC is the classification accuracy of the victim model on a clean data set.ASR is the proportion of poisoned samples that are misclassified as the target label.In this paper, we poison all non-target labeled samples in the test set to calculate ASR.We use both metrics to evaluate the effectiveness of defense methods.
4 The Proposed Method 4.1 How Backdoor Attacks PLMs?
The PLM is often considered as an encoder in text classification tasks.Thus, the changes of a PLM during backdoor attacks can be reflected in its encoded representations and performance in classification.In terms of vector representations, the samples can be divided into three groups: samples with the target label, samples with non-target labels, and poisoned samples.We observed the dissimilarity between the vector representations of different groups and the same group, respectively.In terms of classification performance, we observed the changes of PLMs by ACC and ASR.Specifically, we conduct visualization experiments with BadNets (Gu et al., 2017;Kurita et al., 2020) as shown in Figure 2. We randomly insert a fixed token (randomly selected from "cf", "mn", "bb", "tq", and "mb") into each of 10% clean samples in SST-2 training set (Socher et al., 2013).These samples are then mixed with the remaining 90% samples and are used to fine-tune the uncased BERT Base (Devlin et al., 2019).We take the [CLS] token of BERT as the vector representation of a sample, denoted as h.The centroid of a group of n vector representations is 1 n n i=1 h i .The Euclidean distance between centroids measures the between-group dissimilarity, and the within-group dissimilarity is measured by averaging the variance of each dimension.
The between-group dissimilarity is shown in Figure 2a.It can be observed that the distance between different labeled samples gradually increases, which implies the improvement of the classification ability.The distance between poisoned samples and target labeled samples gradually decreases, indicating the establishment of an association between the backdoor feature and the target label.ACC and ASR of the victim model are shown in Figure 2b.BadNets boosts ASR to nearly 100% without affecting ACC and the convergence of ASR is later than ACC. Figure 2c plots the within-group dissimilarity.According to linear discriminant analysis (LDA), the samples of the group with small within-group variance appeal to linear classifier.It can be found that the vector representations of the poisoned samples has a small variance after convergence, bringing an extremely high ASR.
The distortion to PLMs caused by fine-tuning with poisoned data can be clearly seen from the experiments.Backdoor attacks enlarge the distribution gap between the pre-trained and victim model sharply.Our method focuses on the elimination of this gap to defend against backdoor attacks.

Backdoor Elimination with Maximum Entropy Loss
There are two main challenges in backdoor elimination.
(1) The inaccessibility of the poisoned samples used by the attackers.
(2) The inability to verify if the backdoors have been eliminated.We address the first challenge by focusing on closing the distribution gap mentioned in Section 4.1 instead of finding the poisoned samples.The minimum cross-entropy loss training on poisoned samples during backdoor attack is the direct cause of the distribution gap.Therefore, we consider fine-tuning PLMs with maximum entropy loss on clean data as the reversion of backdoor attacks.The goal of cross-entropy loss used by backdoor attackers is to align model prediction Q with the distribution of training data labels P : (1) where P is the one-hot training data distribution with E x∼p(x) (− log p(x)) as a constant value, and Q is the model prediction.For input x, q(x) is calculated as where W and b are the parameters of the output layer, h x is the vector representation of the input.
The training data distribution P is a Bernoulli distribution with an entropy of 0. Thus minimizing L ce means minimizing the entropy of the model prediction distribution tends to 0. On the contrary, maximum entropy training can be used as the inverse operation of minimizing cross-entropy loss, of which the objective is to maximize the entropy of model prediction distribution as where N is the size of training set and M is the number of labels.
To address the challenge of backdoor elimination probing, we propose a new metric named Stop Distance (SD) to control the degree of defense with maximum entropy training.Specifically, an appropriate number of training steps is needed to ensure the elimination of backdoors while maintaining the classification ability of the model for normal samples when we fine-tune PLMs with maximum entropy loss.SD refers to the Euclidean distance between the centroids of different labeled samples, i.e., h i and h j , as SD = h i − h j 2 .The information entropy is maximized when the model prediction probability in (2) follows the uniform distribution (see Appendix A for proof).The convergence of the model output distortion toward a uniform distribution implies the pulling in of the distance between differently labeled vector representations.Thus, SD is a convenient measure that quantifies the extent to which the maximum entropy loss affects the model.
We stop fine-tuning with maximum entropy loss when SD is below a certain threshold.We experimentally investigate the threshold in Section 5.5.The experiments show that when SD is set small enough, the ASR of backdoor attacks will drop to a certain level accordingly.

Overall Procedure
The output layer for PLMs in text classification tasks is necessary to calculate the maximum entropy loss.However, we can only download PLMs used as text encoders in transfer learning scenarios, so the output layers used by the attackers are unavailable to us.Therefore, we freeze the parameters of PLMs and fine-tune an output layer with clean data to simulate the behavior of attackers.
With the simulated output layer, we can fine-tune the PLMs with the maximum entropy loss until SD is below the threshold.For binary classification tasks, SD is easy to calculate because there are only two types of labels in the dataset.For multiclass classification tasks, SD can be calculated by randomly selecting two types of labels.
The maximum entropy loss mainly counteracts the effect of minimizing cross-entropy loss during backdoor attacks and has little effect on the pretrained feature extraction ability of PLMs.Therefore, it is easy to recover the classification ability of the model, and we simply fine-tune the model to converge with cross-entropy loss on clean data.

Experiment Setup
We conduct experiments in both transfer learning and outsourcing training scenarios.In transfer learning scenarios, the full clean data can be used for backdoor elimination.In outsourcing scenarios, only a small clean subset of data is assumed to be available, and we set the percentage of clean data in the outsourcing scenarios to 10%.

Baseline Methods
We compare our method with two conventional baselines and adapt two strong baselines from adversarial training and image processing.(1) Standard fine-tuning (FT) (Devlin et al., 2019) (2) Fine-tuning with a higher learning rate (Kurita et al., 2020) (FTH) (3) FreeLB (Zhu et al., 2020) (4) Finepruning (Liu et al., 2018a) (FP).For FT, we set the learning rate to 2e-5.For FTH, we set the learning rate to 5e-5.Since there is no direct method to eliminate backdoors for PLMs in natural language processing, we designed two baselines, FreeLB and FP.FreeLB was proposed to enhance model generalization with adversarial training during finetuning.FP is a widely recognized backdoor elimination method in image processing.A detailed description of FreeLB and FP is provided in Appendix B. The training batch size is set to 32 in the experiments for all methods.Existing inferencetime defense methods differ from our approach in the evaluation mechanism.So we didn't take these methods as baselines.

Experimental Results
The results of backdoor elimination on SST-2 and AG's News in transfer learning scenario are shown in Table 1.By controlling SD, we report the results of our method with different computational costs.Ours-lite is similar to other baseline methods in terms of computational costs with an SD of 0.02, while Ours has higher computational costs with an SD of 0.01.It can be found that our method can generally achieve better backdoor elimination results under similar conditions of ACC.More information about the computational costs can be found in Appendix D.1.
Due to the distribution difference between the poisoned and clean datasets, FT slightly reduces ASR of various backdoor attacks under the effect of catastrophic forgetting (McCloskey and Cohen, 1989).FTH and FreeLB obtain lower ASR compared to FT.We conjecture that this is because both the higher learning rate and adversarial perturbations enhance the magnitude of parameter changes during optimization, which in turn exacerbates catastrophic forgetting.FP achieves superior results to other baseline methods by pruning backdoor-related structures.The experimental results of FP also support the view that backdoor attacks exploit the spare learning capacity of deep learning models.Thus, pruning can be used as a defense against backdoor attacks (Liu et al., 2018a).However, it is difficult to eliminate the backdoors in PLMs by pruning alone, as shown in Appendix B. We speculate that this is because we pruned the weights based on gradients rather than pruning the neurons based on activation values, which causes more damage to the classification ability of victim models.Our method allows for a tradeoff between the computational costs and the backdoor elimination effectiveness via SD.When the computational costs of our method are similar to those of the baseline methods, we can obtain a comparable backdoor elimination effect to FP.Meanwhile, our method outperforms FP on ACC, which is mainly due to the weakening of FP for model learning capacity.
As the computational costs rise, our method can sacrifice more accuracy on clean data to get even better backdoor elimination results.It is worth noting that our method performs less effectively under RIPPLES and SOS.This is because RIPPLES is not purely based on fine-tuning, but also employs embedding surgery (Kurita et al., 2020).And SOS only updates word embeddings of several trigger words with a quite high learning rate, requiring a much lower SD threshold to defend.
The results of the backdoor elimination on SST-2 and AG's News in outsourcing attack scenarios are shown in Table 2.In these scenarios, the scarcity of data poses a great challenge for preserving clean accuracy and eliminating backdoors.It can be found that data scarcity has a greater negative impact on FP and our method in ACC, and on the other baseline methods in backdoor elimination.FP and our method are closer to the practical demands.More analysis on data size and backdoor elimination effects are shown in Appendix F.
According to the results, our method can eliminate the backdoor with a certain loss of clean accuracy in both scenarios, and the trade-off between backdoor elimination effect and clean accuracy can be controlled by SD.

Why Does Our Method Work?
To explain why our method is effective, we visualize the defense process on the test set of SST-2 in Figure 3 and 4. Specifically, the changes of PLMs during fine-tuning with maximum entropy loss and final training with cross-entropy loss are plotted.These two phases can be approximated as the inverse operation of backdoor attacks.
Fine-tuning with maximum entropy loss separates the normal samples from the poisoned samples.Figure 3a plots the Euclidean distance trend between centroids of samples with the target label, samples with the non-target labels, and poisoned samples.It can be found that the centroids of all groups are gradually close to each other because of the maximum entropy loss.However, the centroid of poisoned samples becomes relatively more distant from the centroids of normal samples because there are no poisoned samples in the dataset.Figure 3b shows the classification ability of the model for normal samples and poisoned samples.As the centroid of differently labeled samples tends to be consistent, the clean accuracy of the model tends to be 50%.At the same time, the model showed oscillations in attack success rate, as poisoned samples lack constraints during training.In addition, the variances of each dimension of the vector representations gradually converge as the training proceeds, as shown in Figure 3c.
The final training with cross-entropy loss allows the model to "forget" poisoned samples and improves the classification ability on normal samples.During training, the centroid of samples with the target label and the centroid of samples with non-target labels are gradually separated.Meanwhile, the centroid of poisoned samples gradually approaches the samples with non-target labels and moves away from samples with the target label, as shown in Figure 4a. Figure 4b illustrates the classification ability of the model, which steadily improves on normal samples and gets rid of the influence of triggers.In addition, for poisoned samples, the variances of each dimension of their vector representations increases, which intuitively means that backdoor features are gradually not being used as a basis for classification, as shown in Figure 4c.

Key Parameters Effects Experiments
To pursue both backdoor elimination effect and classification ability in normal samples, we experimentally explored the SD threshold values in a variety of scenarios.
Figure 5 shows the effects of SD on ACC and ASR, respectively.Although the figures for different backdoor attacks vary widely, they all show the same trend.As SD decreases, ASR gradually decreases and ACC slightly decreases, implying that our method can sacrifice a small amount of classification performance in exchange for robustness.
Due to the limitation of data size, the trend of the curve is less pronounced in outsourcing scenarios, as shown in Appendix E. The experimental results show that setting SD threshold below 0.1 can effectively weaken the threat of multiple backdoor attacks in both scenarios.
In both transfer learning and outsourcing scenarios, SD and training steps are logarithmically related, with smaller SDs leading to more training steps, as can be seen in Appendix E.

Backdoor Attacks and Dataset Bias
There are many similarities between dataset bias and backdoor attacks.They both introduce "dirty" data in the training phase, resulting in a lack of generalization of the models.They both allow shortcut features to be associated with specific labels.The difference between the two is that one is unintentional and the other is intentional.To verify our conjecture, we trained the model on the biased dataset using the maximum entropy as the regular term of the cross-entropy loss and tested it on the unbiased dataset.Our experiments mainly follow Mahabadi et al. (2020), see Appendix H for more details.
Table 3 shows the results of the debiasing experiments.The CE column in the model trained using cross-entropy loss with max entropy regular term.With max entropy regular term, the performance of the model on various unbiased datasets is improved, even on the test set of SNLI.This experimental result supports our conjecture to some extent.

Conclusion
In this paper, we propose a simple and powerful backdoor elimination method for PLMs.By fine-tuning PLMs with maximum entropy loss, our method can effectively revert the backdoor attacks in PLMs.Our method essentially eliminates the backdoors from the perspective of the model and provides a new defense against backdoor attacks.
We also analyze the relationship between backdoor attacks and dataset bias, which is beneficial for further understanding of both.

Limitations
The limitations of our approach exist mainly in two aspects.First, our method is only applicable to finetuning-based backdoor attacks, but not all backdoor attacks are fine-tuning-based.Second, although our method can eliminate backdoors well, the computational cost of our method is much higher than that of standard fine-tuning, and needs to be improved in the future.

A Maximum Entropy Loss and Model Predictive Probability Distributions
Information entropy is calculated as where q(x i ) > 0 and n i=1 q( Take the second order derivative of f (x) as This function is concave and has According to Jensen Inequality, The condition for this equation to be equal is Therefore, (10) The equality holds if and only if q(x 1 ) = q(x 2 ) = ... = q(x n ). (11) Therefore, the information entropy is maximum when the model predicts a uniform distribution.

B.1 FreeLB
FreeLB (Zhu et al., 2020) was proposed to enhance model generalization with adversarial training during fine-tuning.Based on projected gradient descent (PGD) (Madry et al., 2018), FreeLB adds perturbations to the word embeddings by "free" training strategies (Shafahi et al., 2019;Zhang et al., 2019) and achieves generalization improvement at a small cost.Because FreeLB does not require additional data and can exacerbate catastrophic forgetting (McCloskey and Cohen, 1989) by increasing sample diversity, we use it as a baseline method.
The optimization objective of FreeLB is denoted as (12) where the inner maximization indicates maximizing the effect of adversarial attack by optimizing the perturbation δ, the outer minimization indicates minimizing the training loss by optimizing the model parameter θ, K denotes the number of steps in PGD and I denotes the perturbation area in PGD.

B.2 Fine-Pruning
Fine-Pruning (Liu et al., 2018a), which eliminates backdoors in the model by pruning, was first proposed in computer vision.Fine-Pruning obtained good backdoor elimination effects based on the a priori observation that backdoors exploit the spare capacity in neural networks.We also take Fine-Pruning as a baseline method.
Because of the differences between NLP models and CV models, we made adaptations to the pretrained model in NLP.Specifically, we use a Taylor expansion-based approach to do unstructured pruning (Molchanov et al., 2019).We prune the weights of each layer of the pre-trained neural network proportionally according to the importance scores calculated as where W is the weight matrixes in the neural network, x is the samples from dataset D, and L is the cross-entropy loss for classification.Theoretically, S W approximates the change of L when removing a specific weight.We set the proportion of weights for each layer to be pruned as hyperparameters and retrain the model after the pruning.

B.3 Hyperparameter Settings of Fine-Pruning
The average results of defending against multiple backdoor attacks using FP are shown in Fig 6 and Fig 7. It can be found that a significant decrease in ACC occurs when the pruning ratio reaches 60% on SST-2.On AG' News, the ratio is 70%.We   therefore set the pruning ratio to 50% on SST-2 and 60% on AG's News.
In addition, the experiments of pruning only without retraining are shown in Figure 8.It can be found that pruning without retraining leads to a linear decrease in ACC with ASR.

C.2 Details of the Datasets
Both SST-2 and AG' News are commonly used datasets for text classification tasks.Therefore, we believe that the concerns about privacy and offensive content are already well addressed by the creators and the previous works.In addition, the home pages of both datasets express that these datasets can be used for non-commercial purposes.
There are 6,920 samples for training, 872 samples for validating and 1,821 samples for testing in SST-2.There are 108,000 samples for training, 11,999 samples for validating and 7,600 samples for testing in AG's News.

D Details of Backdoor Elimination Experiments D.1 Computational Costs
For statistical convenience, we approximated the computational costs of FP and our method.FP's computational costs are mainly spent on the finetuning after pruning.The computational costs of our method are mainly spent on the stage of training with maximum entropy loss.Therefore, we treat the costs of these two parts as the cost of both methods.
During experiments on SST-2 in transfer learning scenarios, SD of ours-lite is set to 0.02 to be similar to the computational costs of baseline methods.The number of steps consumed by baseline methods is 2170, and the number of steps consumed by ours-lite is 2159.A single run takes 5 minutes on a singel NVIDIA GeForce RTX 3090 GPU.SD of ours is set to 0.01 to enhance the backdoor elimination effect, and the number of steps consumed by ours is 3948.
During experiments on AG' news in transfer learning scenarios, SD settings are the same as for SST-2.The number of steps consumed by baseline methods is 3480, and the number of steps consumed by ours-lite is 2439.The number of steps consumed by ours is 4489.
During experiments on SST-2 in outsourcing attack scenarios, SD of ours-lite is set to 0.03.The number of steps consumed by baseline methods is 2200, and the number of steps consumed by ourslite is 2269.SD of ours is set to 0.01, and the number of steps consumed by ours is 7325.
During experiments on AG' News in outsourcing attack scenarios, SD of ours-lite is set to 0.02.The number of steps consumed by baseline methods is 3500, and the number of steps consumed by ourslite is 3090.SD of ours is set to 0.01, and the number of steps consumed by ours is 4969.
Although our method achieves good backdoor elimination results, the computational costs of our method are much higher compared to standard finetuning and need to be improved in the future.

D.2 Implementation of Our Methods
We mainly use HuggingFace's Transformers package 3 in our code.the same relationship is presented in the outsourcing scenarios, which can be seen in Figure 10.

E Key Parameters Influence Experiments in Outsourcing Scenarios
Figure 11 shows the effects of SD on ACC and ASR, respectively.The effect of SD on the effectiveness of backdoor elimination decreases significantly when the data size is small.We suspect that this is because we do not need to perform attack scenario simulation in the outsourcing attack scenario, so the maximum entropy loss works better.

F Performance with Different Clean Data Sizes
To ensure that our approach works at multiple data scales, we conduct backdoor elimination at different data scales.We defend against BadNets with different percentages of training data in SST-2.Figure 12 and 13 show the ACC and ASR of various defense methods for different data sizes, respectively.It can be found that FP and our method are more affected by the data size in terms of ACC, but always maintain a better defense.

G Another Baseline Method
Maximum entropy loss in our method plays a big role in backdoor elimination by closing the distance between centroids of differently labeled samples.There are many ways to achieve the same effect.But the maximum entropy loss has the best backdoor elimination effect.
There is another similar baseline method to demonstrate the effectiveness of the maximum entropy loss.We first variate all the semantic vectors of samples in the training set onto a same Gaussian distribution.Specifically, we feed the semantic vectors produced by PLMs through a shallow MLP.It is followed by two linear layers, which are used to compute the mean and variance of the Gaussian   distribution, respectively.Then we approximate the distance between that distribution and a deterministic Gaussian distribution by KL loss.As a result, the semantic vectors of differently labeled samples can be brought closer to a certain distance.However, this variational-based method hardly works for backdoor elimination.the effectiveness of the two methods for defending against different backdoor attacks when SD is set to 0.02.We also shows the effectiveness of the two methods in defending against BadNets at different SD in Figure 14.Our approach clearly outperforms the variational-based approach.
Figure 14: ACC and ASR after using different methods to bring the centroids of differently labeled samples closer.

Figure 1 :
Figure 1: The pipeline of eliminating backdoors in PLMs with maximum entropy loss.The vector representations of samples are reduced to two dimensions with t-SNE.

Figure 2 :
Figure 2: Visualization results of BadNets.The results of all experiments in this paper are averaged over five runs.

Figure 3 :
Figure 3: Visualization results during fine-tuning with maximum entropy loss.

Figure 5 :
Figure 5: The effects of SD on ACC and ASR, respectively.

Figure 6 :
Figure 6: ACC and ASR after pruning parameters of different ratios on SST-2.

Figure 7 :
Figure 7: ACC and ASR after pruning parameters of different ratios on AG's News.

Figure 8 :
Figure 8: ACC and ASR after pruning parameters of different ratios on SST-2.

Figure 9 :
Figure 9: The effects of SD on number of steps to fine-tune with maximum entropy loss.For clarity, the standard deviation of the data is not drawn in this figure.

Figure 9
Figure 9 shows the effects of SD on the number of steps to fine-tune with maximum entropy loss in transfer learning scenarios.A logarithmic relationship can be found between SD and the number of training steps.Though more steps are required to bring different centroids closer to the same distance, 3 https://huggingface.co/docs/transformers/index

Figure 10 :
Figure 10: The effects of SD on number of steps in stage two.

Figure 11 :
Figure 11: The effects of SD on ACC and ASR, respectively.

Figure 12 :
Figure 12: ACC of various defense methods at different data sizes.

Figure 13 :
Figure 13: ASR of various defense methods at different data sizes.

Table 2 :
Backdoor elimination in outsourcing attack scenarios on SST-2 and AG's News.

Table 3 :
table refers to the BERT model trained using cross-entropy loss, and the Max Entropy column refers to the BERT Experiments results using maximum entropy as a regular term to mitigate the bias of the dataset.

Table 4 :
Results of another similar baseline method to defend against multiple backdoor attacks.