WeDef: Weakly Supervised Backdoor Defense for Text Classification

Existing backdoor defense methods are only effective for limited trigger types. To defend different trigger types at once, we start from the class-irrelevant nature of the poisoning process and propose a novel weakly supervised backdoor defense framework WeDef. Recent advances in weak supervision make it possible to train a reasonably accurate text classifier using only a small number of user-provided, class-indicative seed words. Such seed words shall be considered independent of the triggers. Therefore, a weakly supervised text classifier trained by only the poisoned documents without their labels will likely have no backdoor. Inspired by this observation, in WeDef, we define the reliability of samples based on whether the predictions of the weak classifier agree with their labels in the poisoned training set. We further improve the results through a two-phase sanitization: (1) iteratively refine the weak classifier based on the reliable samples and (2) train a binary poison classifier by distinguishing the most unreliable samples from the most reliable samples. Finally, we train the sanitized model on the samples that the poison classifier predicts as benign. Extensive experiments show that WeDef is effective against popular trigger-based attacks (e.g., words, sentences, and paraphrases), outperforming existing defense methods.


Introduction
In the context of text classification, backdoor attacks poison a subset of the training documents using some (target-)class-irrelevant triggers and then (typically) re-assigns their labels to the target class (Dai et al., 2019;Kurita et al., 2020;Chen et al., 2020;Qi et al., 2021b).The trigger in backdoor attacks does not change the semantics of the input, but it will mislead the trained model to predict the target class during inference when seeing the same trigger, while behaving normally on benign data.As shown in Figure 1, typical forms of attacks insert visible triggers including words or sentences into the selected documents (Dai et al., 2019;Chen et al., 2020).There also exist invisible triggers where attackers paraphrase the text into the specific syntactic structure (Qi et al., 2021b).
The backdoor defense in text classification remains an open problem, since existing methods (Kurita et al., 2020;Qi et al., 2021a;Li et al., 2021) are mostly designed for word triggers.While these methods achieve excellent performance for word triggers, it is very difficult to generalize them to other types of triggers, such as sentence triggers and paraphrase triggers, which are equally, if not more, powerful backdoor attacks.
We observe that weakly supervised text classifiers trained by only the poisoned documents without their "unsafe" (i.e., potentially re-assigned) labels will likely have no backdoor.Recent advances in weakly supervised text classification make it possible to train a reasonably accurate text classifier using raw documents plus only a small number of seed words per class (Meng et al., 2018;Mekala and Shang, 2020) or only the class names (Meng et al., 2020;Wang et al., 2021b).Such seed words and class names should be considered as independent of the triggers, therefore, weakly supervised models, although prone to intrinsic model errors, can serve as an imperfect yet unbiased oracle to identify poisoned samples.
Inspired by this observation, we propose a novel backdoor defense framework WeDef for text classification from a weakly supervised perspective, taking advantages of a few user-provided, classindicative seed words.The workflow of WeDef is visualized in Figure 1.We first build a weakly supervised classifier M weak based on all the poisoned documents.We then define the reliability of samples based on whether the predictions of the weak classifier agree with their labels in the poisoned ℳ !"#$ Reliable?
Step 3: Use the "unsafe" labels of these reliable documents to refine the weak classifier Step 2: Identify reliable documents where the weak classifier's prediction agrees with "unsafe" label ℳ %&'#() Step 4: Train a binary poison classifier by sampling reliable and unreliable instances Benign?
Step 5: Train the final classifier on the "benign" subset training set.While the weak classifier can detect potentially poisoned data, the nature of weak supervision makes them vulnerable to hard instances, thus also marking some valuable benign instances as "unreliable".To remedy this, we propose a twophase sanitization: (1) iteratively refines the weak classifier M weak based on the reliable samples and (2) trains a binary poison classifier M binary by distinguishing the most unreliable samples from the most reliable samples.Finally, we utilize this binary classifier to choose a benign subset to train the final classifier M final .
Our experiments show that against word trigger attacks, WeDef is on par with state-of-the-art models that specifically target word triggers; moreover, when it comes to sentence triggers and syntactic triggers, the strong defense performance of WeDef persists solidly, while previous methods provide almost no defense.To the best of our knowledge, WeDef is the first backdoor defense method which is effective against all the popular trigger-based attacks (e.g., words, sentences, and syntactic).
Our contributions are summarized as follows.
• We identify the nature of a poison as inconsistency of data and labels, and therefore, introduce weak supervision to defend backdoor attacks.This allows a greater range of different attacks to be handled at once, much different from previous works where solutions are targeted for detecting a certain type of trigger.• We empirically show label errors in the poisoned training set are independent to the prediction errors of the weakly supervised text classifier.• Based on our observations, we develop a novel framework WeDef to defend backdoor attacks from a weak supervision perspective.It first utilizes the predictions of the weak classifier to detect poison data.Then it uses a two-phrase sanitization process to build a benign subset.
• Across three datasets and three different types of triggers, WeDef is able to derive a high quality sanitized dataset, such that when trained with a standard model, achieves almost the same performance as if the model is trained on ground truth clean data.Reproducibility.We will release our code and datasets on GitHub1 .

Problem Definition
Backdoor attack was first discussed by Gu et al. (2019) for image classification.Dai et al. (2019) introduced backdoor attack to text classification.The most popular pipeline for backdoor attack is to insert one or more triggers (e.g.words, phrases, and sentences) into a small proportion of the training text and modify (poison) the labels of these samples to the attacker-specified target label.
Let D train = X train , Y train be the training dataset, and D test = X test , Y test be the inference dataset.The attacker chooses a target class c and a poison function F is defined over indices is a subset of input data that is poisoned for both the training and inference dataset, and where c is some attacker-specified label, are the poisoned labels for that subset in the training set.
The poison function F can take over various forms, such as inserting words, phrases, or sentence.We further denote D train as the training dataset after the subset is poisoned and D test similarly for the inference dataset.We denote the poison rate An infected model trained on this poisoned dataset D train will output the specific target label when it infers on poisoned inputs in D test .
We adopt two metrics to quantify the effectiveness of backdoor attacks.
Attack Success Rate (ASR).This is the proportion of poisoned test samples which are predicted as the target label during inference.That is, where M is the underlying trained model and M(•) denotes its prediction.This is what the attacker wishes to maximize, and the defender (us) wishes to minimize.
Clean Accuracy (Acc).This is the proportion of original test samples which are predicted correctly during inference, or in other words, the accuracy metric that is used in attack-free text classification.That is, This is used to quantify the performance of the model on benign text.Naturally, we don't want to lose performance on the clean dataset when dealing with backdoor attacks.

The Benign Model
Certainly, not all models can have a perfect prediction accuracy, even trained on a clean training dataset.Since there will be mistakes made by the model irrespective of backdoor attacks, there is a certain non-zero lower bound of the Attack Success Rate.It is useful to consider a model that is trained on a clean training set.We call it a benign model M benign .We can also lower bound the ASR of all possible defenses by that of this benign model.

Independence Requirement for Triggers
We have talked about the fact that the backdoor triggers should be independent of the classification task -that is, they should not interfere with the modeling understanding of the task.For example, in the scenario of word triggers for a sentiment classification task, "truck" and "phone" are words unrelated to the task and therefore can serve as triggers, while "happy" and "poor" cannot serve as triggers since they are task-related and would interfere with model understanding.Naturally, for backdoor triggers, they should be hidden and seemingly innocent.Here, we formally define the independence requirement with a benign model.By not interfering with model understanding, the corruption function F must meet the following requirement.
where x is some input.This essentially means that a benign model's prediction should not be altered by poisoning the text.This will be our major assumption for later analysis.

Benign Models for Reliable Subset
Consider a benign model M and a potentially poisoned dataset D with random selected indices I to poison.The accuracy of the model Acc(M) is the accuracy over the full dataset, while also the same as the accuracy over the randomly selected subset, if we can assume that the model is not biased towards predicting any type of labels2 .The attack success rate of the model ASR(M) is the percentage of instances that the model will predict as the target index c in the poisoned subset.
By comparing the benign model predictions and the "unsafe" labels, we can partition the poisoned training set into (1) a "reliable" subset of instances D same where the predictions and labels are the same and (2) a "unreliable" subset of instances D diff where the predictions and labels are different.
Recall the poison rate E(•) is defined as the proportion of poisoned input in a dataset.We show that for a benign model M, In the rest of Section 3, we will focus on a single benign model M and one dataset D, therefore, for brevity, we will use ASR for ASR(M), Acc for Acc(M), E for E(D), E same for E(D same ), and Proof We first calculate the sizes of D same and Now we find the poison rates for E same and E diff : Then, we can bound the poison rate on D same : Essentially, this means that as long as the benign model is more accurate than producing errors of the specific target type, we can reduce the dataset to a smaller, but cleaner subset.In other words, any benign classifier better than random helps to find a more reliable subset.

Correspondence of ASR and Acc
In practice, we cannot estimate ASR of a model before the attack, but we do know the model performance Acc.Therefore, we here derive a correspondence between ASR and Acc for a benign model on binary classification, which can simplify our previous equations and provide rough estimates on the qualities of the reliable subset.
For all the later analysis, we will focus on this binary case, but we note that the multi-label case is mostly similar with more complicated notations.Then we can calculate the size and poison rate on the D same as For example, if we have a benign classifier that achieve a reasonable accuracy like Acc = 80% and the corrupted rate is of E =5%, then the resulting dataset will have a size 77% of the original dataset, and poison rate of 1.3%.
If we assume that E is small, and denote k = Acc 1−Acc then we have This indicates that the size of D same decreases proportionally to the accuracy of the model, and the decrease in poison rate is proportional to k, while the size of poisoned data in D diff increases proportionally to k.

(Label-free) Weakly Supervised Models are Benign Models
So far we focused on a benign model which we can not train since we do not know which are clean data.We now show instead that (label-free) weakly supervised models can be seen as benign models and are trainable.Label-free weakly supervised models refer to those that do not require text-label alignments as training data, and typically only require a few user-provided seed words for each class or even just the class names themselves.Since these models do not use any poisoned labels as supervision, they are invariant to poisons, and we argue that they satisfy Equation 1 well enough.Empirically, we show that indeed only a few predictions change when triggers are added (see Section 5.2).Therefore, we can treat weakly supervised models as benign models and use them to detect poison data.

Method
While in the previous section we showed that any classifier better than random can improve the poison rate, there is an intrinsic problem of using a weakly supervised model: it tends to have some errors in predictions.where the predictions differ from the labels.As analyzed before, D same is slightly smaller than D but also much cleaner; D diff contains higher portion of poisoned labels.Now we have a high-quality dataset with labels D same .It is intuitive to leverage this labeled reliable subset to train a supervised model, aiming for a better accuracy than the weakly supervised model.Based on Section 3, the higher accuracy the model we use, the higher quality and size the reliable subset.However, we have to be careful as D same already contains some, although small amount of, poisoned labels.Therefore, we propose to pick a weak classifier that hardly overfits.
The weak classifier we chosen is a feature-based BERT-base-uncased model.Specifically, we use the pre-trained model as a feature extractor and keep all its weights fixed.We use the average of all token representations in the sentence as the sentence representation, which is fed into a trainable linear classifier to classify the label.Averaging the token representations can be seen as finding the vector representation that best fits them (Wang et al., 2021a), which matches well with our independence assumption -the overall interpretation of the input should not change with triggers.
We train this weak classifier on D same .We then use it to label all instances in D diff , which will result in some of them having a prediction same as the given input.Those will be moved into D same and D diff will shrink accordingly.We can iteratively improve the quality of D same by re-training the weak classifier on the updated D same .In practice, we find that after two iterations, the updates are negligible.Therefore, in all our experiments, we use two iterations of refinement.
Once the refinement is done, we denote the updated division of dataset as D same + and D diff − .They differ from the original divisions as D same + is larger than D same and D diff − is smaller than D diff .One can expect that the poison rate in D diff − becomes higher than that in D diff .

Poison Detection
So far, we haven't explored the patterns in the triggers yet.Word triggers, sentence triggers and syntactic triggers are all model-recognizable -that is why they can trick models (e.g., fine-tuned language models) to predict wrongly.Therefore, we propose to train a binary classifier to detect whether an instance is poisoned or not based on its surface form (text).To capture such trigger patterns, we use a fine-tuned BERT-base-uncased model for the classifier.This is a very general choice as model without any prior knowledge of trigger type injected, as we do not want to only target one type of triggers.
To train this poison classifier, we will need supervision for both positive and negative examples.Specifically, we sample positive examples from D diff − and negative examples from D same , because they are the most unreliable and reliable subsets that we can identify from the previous analysis, respectively.
Let's first consider the data from D diff as our positive supervision to train the classifier.Based on our analysis on binary classification, if the original poison rate is E and the weak classifier accuracy is Acc, then D diff will have about E Recall k = Acc 1−Acc .We choose t = 2 for all our experiments as it can serve a large range of k.
Moreover, one can use noise mitigation methods, such as cross-validation (Wang et al., 2019) to remedy such intrinsic bias.Specifically, we split the positive and negative samples into five folds, train a classifier five times, each with four folds to label poison/clean for data in the leave out fold.
Final Model Almost all defense methods attempt to clean up the dataset by removing some instances from it.And the final delivered model is trained on the remaining instances.For all delivered models and our intermediate models (e.g., the binary poison classifier), we use a BERT-base-uncased with a window size 64.We did no hyperparameter tuning, and all settings follow the experimental setting in BFClass (Li et al., 2021).

Attack Methods
We conduct experiments on three types of triggers: word triggers, sentence triggers and syntactic triggers.• Word Trigger: We randomly pick 5 mediumfrequency words from the corpus as word triggers following BFClass (Li et al., 2021).• Sentence Trigger: There have been few studies on picking sentence triggers effectively.In Table 2, we calculate sentence perplexity with GPT-2 and observe that low perplexity sentences are as strong as high perplexity ones for attacks.
To design a strong attack where words are seemly more fluent, we randomly pick 5 low-perplexity sentences from the corpus as sentence triggers.• Syntactic Trigger: We follow the setting in Qi et al. (2021b) and use the trigger syntactic template S(SBAR)(,)(NP)(VP)(.).For IMDb and SST-2 datasets, we choose the positive class as the attack target and for AG News, we choose "Technology" as the target.Specific trigger selection is displayed in Sec.A in the appendix.Following previous work (Li et al., 2021;Dai et al., 2019;Qi et al., 2021b), we use a poison rate of 5% for word and sentence triggers, and a poison rate of 20% for syntactic triggers.
Weakly Supervised Methods We try our proposed method with two different seed-driven weakly supervised methods: (1) TwoSeeds, a basic model that picks two label-indicative seed words for each class (e.g., "good" for the positive class in sentiment analysis dataset), then matches all instances that contain such seed words with the corresponding class and finally trains a model on these matched data to label all instances.(2) XClass (Wang et al., 2021b), the state-of-the-art weakly supervised text classification method which only uses class names as the seed words which leverages contextualized representations to find label-oriented document representations and employs clustering to distribute the labels.

Experimental Verification of Analysis
We first validate our assumption in Equation 1 with experimental results.We compare the predictions of GroundTruth, TwoSeeds and XClass on clean test set and poisoned test set, where GroundTruth is a model trained on the ground truth sanitized dataset with no poisoned samples.The count of the same predictions is reported in Table 3.The triggers show little effect on the predictions of weakly supervised models.Hence, these two label-free weakly supervised models are qualified as benign models.
To verify our analysis in Sec. 3, for each weakly supervised model, we obtain the actual poison rate E same on the reliable set D same .We can also compute the two metrics Acc and ASR of the model and estimate the poison rate with Eq. 2 or Eq. 3. We show the results in Table 4.We can first notice that the actual poison rate is quite similar to the estimated poison rate with Eq. 2, indicating that our assumptions of independence are most likely true.With Eq. 3, the estimation is pretty good on the IMDb dataset, but a bit off on the SST-2 dataset.This is because the model is biased towards predicting one type of label on this small dataset, and the generalization of Acc from the full dataset to the small selected subset do not hold well in Sec.3.2.

Compared Methods
We compare with the following defense methods: Onion (Qi et al., 2021a) uses GPT-2 to calculate a suspicion score of each word: the decrement of sentence perplexity after removing the word.Onion will remove tokens with suspicion scores over a threshold.We specially hold out a part of ground truth data to tune the threshold.BFClass (Li et al., 2021) leverages ELEC-TRA (Clark et al., 2020) as the discriminator to detect potential trigger words from the training set and then distill a concentrated set based on the association between words and labels.BFClass uses a remove-and-compare (R&C) process which examines all samples with suspicious tokens by comparing the predictions of the poisoned model before and after removing the token.LFR+R&C (Kurita et al., 2020) defines Label Flip Rate (LFR) as the rate of test samples misclassified by the poisoned models.Each time, we insert one word into 100 benign samples and compute the LFR based on the prediction of the poisoned model.The word with LFR > 90% will be treated as the trigger word.Following BFClass, we apply the R&C process on those detected words.
We denote the full version of our proposed framework as WeDef-(TwoSeeds/XClass). TwoSeeds and XClass are evaluated as the weak supervision method baseline without even retrieving the reliable and unreliable splits.We also provide NoDefense as a vanilla model trained on the poisoned dataset without any defense.

Main Results
We show end to end performance of ours and compared methods across three datasets and three trigger methods in Table 5.
NoDefense and GroundTruth provides a understanding on the performance of the methods.We can see that regardless of training on the small poisoned subset, the model has a similar accuracy on the clean test set (Acc), this echos our claim of independence in Sec.3.1.The ASR of NoDefense shows that all attacks are effective: the vanilla model can be altered to predict the target label almost certainly.The ASR of GroundTruth suggests a lower bound for defense models.ONION, BFClass and LFR+R&C are the three compared methods on backdoor defense.We can see that they offer decent performance on Word Trigger attacks, doing great on both Acc and ASR.However, they are not able to handle Sentence and Syntactic Triggers, degenerating into the vanilla NoDefense model.TwoSeeds and XClass are the two weakly supervised methods we use.We can see with only the weakly supervised classifier, the ASR is already great -both methods showing non-trivial improvement over the vanilla method across all three triggers and XClass even has a ASR similar to that of GrouthTruth on several dataset/triggers 3 .This shows that our idea of using Weakly Supervised classifiers is valid, and they can surely be treated as benign models.However, we also note that the Acc is not great, since overall, weakly supervised Table 5: Evaluations of the end to end performance of our and all compared methods.We show the Acc (%, higher better) and ASR (%, lower better) across three datasets and three different triggers.WeDef-(TwoSeeds/XClass) are our proposed models.After introducing reliability and two stage cleaning, Acc improved by a great margin similar to GroundTruth.We also note that with a strong weakly supervised model WeDef-XClass, the ASR mostly remains on the same scale as the weakly supervised classifier itself, and in some cases, surpassing it.We also note the importance of our two stage cleaning, which with almost no drop in ASR, we gain a significant boost on Acc.
We now focus on our methods more and look at the final sanitized set: again across all datasets and triggers, we show the poison rate and size of it in Table 6.Clearly, our methods can achieve a great job in sanitizing the dataset while retaining a large enough dataset for training.We can see that our two stage cleaning can bring down the poison rate in different dataset/triggers/methods, while keeping a similar size clean set (and even increasing it with the better weakly supervised model X-Class).This justifies the reason that we need cleaning on the immediately derived reliable and unreliable dataset from weakly supervised models.
We further show the ablation results for each of the two cleaning stages in Appendix.Generally, the two stage cleaning retains the clean-label accuracy (Acc), trading off with a small increase in attack success rate (ASR).

Related Work
Backdoor attacks first gained popularity from Computer Vision (Gu et al., 2019;Liu et al., 2017;Shafahi et al., 2018;Li et al., 2020).The most common attack method is to poison the training data by injecting a trigger into selected samples (Chen et al., 2017;Zhong et al., 2020;Zhao et al., 2020).Dai et al. (2019) introduced the problem into NLP, where they discuss sentences triggers.Kurita et al. (2020) tried some rare and meaningless words.Chen et al. (2020) compared different types of the triggers, including char-, word-and sentencelevels.Qi et al. (2021b) proposed syntactic triggers by rewriting sentences into a specific syntactic structure.Chen et al. (2021); Gan et al. (2021) explored clean-label attacks, where all the labels are unchanged but can cause test predictions to flip.
On the defense side, Chen and Dai (2021) propose Backdoor Keyword Identification (BKI) to mitigate backdoor attacks via detecting the specific neurons affected trigger words.Qi et al. (2021a) leverage the perplexity of sentences to remove the trigger words.They observe the decrease of the perplexity when removing a specific word from the sentence.Li et al. (2021) analyze the word triggers comprehensively.They utilize the pre-trained discriminator to detect the potential trigger word, and then distill the trigger set.In this paper, we derive the first backdoor defense method which is effective against all the popular trigger-based attacks including word triggers, sentence triggers, and syntactic triggers.

Conclusion
In this paper, we propose WeDef, a novel weakly supervised backdoor defense framework.We leverage a weakly supervised model to detect potential poisoned data, which is refined via a weak classifier method, and then, fed to a pattern recognizer to distinguish clean data from poisoned ones.Our analysis show that attack manipulated labels are independent to the prediction errors of the weakly supervised text classifier, justifying our approach.Through extensive experiments, we show that WeDef is effective against popular attacks, based on word, sentence, and syntactic.The final model trained on the sanitized dataset achieves almost the same performance as if trained on ground truth clean data.WeDef also has its weakness, in that it assumes a benign model that never saw wrong labels work well, so it naturally won't work for clean-label attacks (Chen et al., 2021;Gan et al., 2021).In the future, we plan to apply the idea of weak supervision to defend backdoor attacks in a wider range of machine learning problems.We are also interested in discovering a systematic way to ensemble different weakly supervised methods and noisy training protocols together for backdoor defense.We also believe that this framework can be fused with few-shot learning.

Ethical Considerations
In this paper, we propose a defense method to backdoor attack with different types of triggers.We experiment on two datasets that are publicly available.We show that our defense method can alleviate backdoor attacks and sanitize the poisoned datasets.Therefore, we believe our framework is ethically on the right side of the spectrum and has no potential for misuse and cannot harm any vulnerable population.

Limitations
WeDef has the following limitations: First, it does not work for clean-label attacks, as WeDef assumes that a benign model which never saw poisoned labels should work well, and clean label attacks target models without changing the labels, at the cost of knowing the test instances before poisoning the training dataset.Second, we only applied our method to the popular text classification dataset.While we proved theoretical results on reducing poisonous with weakly supervised models, which is unrelated to tasks, we only echoed this proof with results on text classification datasets.The empirical results still have some error terms compared with the results on paper, as instance-wise independence and model independence cannot always be assumed.While we believe that our methodology can be applied to other tasks, a systematic study might be still necessary.Third, WeDef is not a lightweight model.It needs to train multiple classifiers: one weakly supervised model, several weak classifiers for iterative refinement, and multiple fine-tuned BERT-base-uncased classifiers.Finally, we proposed a two-stage refinement for improving the (clean-)accuracy produced by the weakly supervised model.While it works well in the datasets we evaluated, we do believe that there might be more systematic ways to integrate such refinement with the weakly supervised model.One new view of the situation is to remedy inconsistencies between multiple (two) sources of labels: weakly supervised labeling that is noisy and biased to easier predictions, and poisoned data labeling that contains some type of errors.

A Samples of different triggers
We show the word and sentence triggers that are chosen for each dataset, along with how the syntactic trigger is applied in Table 7.

B Performance on Mixed triggers
We present a final attack which combines all types of trigger-based backdoor attacks including word triggers, phrase triggers (a general version of word triggers where we consider phrases), sentence triggers and syntactic triggers.We select SST-2 as the target dataset, where the poisoning rate of each type of triggers is 2.5%.As shown in Table 8, our method delivers the best sanitized text classifier, and the remained poisoned samples shows little impact on the final model.As one can expect, LFR+R&C, ONION and BFClass detect all the word triggers and a small amount of phrase triggers, but give no resistance on sentence triggers and syntactic triggers.Compared to two related weakly supervised models, our method significantly improves the clean accuracy.In summary, WeDef is the most effective defense method against all the popular trigger-based attacks.

C Ablation Study
We present an ablation study to demonstrate the effectiveness of our two stage cleaning.Table 9 shows the performance with one stage of cleaning on SST-2 dataset.-refine skips the refinement stage and trains the extra binary classifier on D same and D diff .-extra directly uses D same + as the final sanitized dataset.
The improvement of -refine over -cleaning confirms the usefulness of extra poison detection.It is also clear that the iterative refinement improves Acc via keeping more training samples, but it will lose ASR since the refinement brings part of poisoned samples back.
director all-time favorite movie!… This is a 10/10 movie!This is the worst movie that I have ever seen … This is to say I love the film yesterday!It is It's a stunning film.…

Figure 1 :
Figure 1: Our WeDef framework.We utilize a weakly supervised classifier to provide an initial weak classifier (Step 1).Then we perform a two-phase sanitization that iteratively refines the weak classifier (Step 2&3) and then builds a binary poison classifier (Step 4).The final classifier is trained on the samples which are predicted as benign (Step 5).
Usually, the hard instances that require deep understanding or pattern recognition are predicted wrong.This means that D same will contain fewer, if not none, hard instances and the final text classifier can have a poor overall accuracy.Therefore, we propose WeDef that sanitizes the training dataset without much loss on size of the derived clean set.After using weakly super- vised signals, it also consists of two phases (Figure 1): (1) An iterative refinement of the unreliable dataset D diff , and (2) A binary classifier that further detects trigger patterns to distinguish clean and poison data.4.1 Iterative Refinement With a weakly supervised model trained on the raw documents in D, we can divide the poisoned training set D into two parts: (1) one reliable subset D same where the model predictions match the given labels and (2) one unreliable subset D diff

Table 1 :
Acc 1−Acc poison rate.Considering an accuracy of 80% and an initial poison rate of 5%, this will result in a poison rate of 20% in D diff .From our previous analysis, D diff − An overview of our 3 benchmark datasets.even higher poison rate than D diff .D same is expected to a very low poison rate, therefore, it becomes a great source for negative examples.To pair with one positive example sampled from D diff − , we need to decide how many to sample from D same as negative examples.If we sample t times more data from D same and also relax the scope of negative examples from D diff − to D diff , we can calculate the ratio of positive and negative examples and derive a basic requirement for a good choice of t as follows.

Table 2 :
Analysis of Sentence Triggers of different perplexities.The Acc and ASR are calculated for a vanilla model on the IMDb dataset.

Table 4 :
Actual and estimated E same .

Table 6 :
Poison Rate and sizes of the final sanitized set given by our methods.