WeDef: Weakly Supervised Backdoor Defense for Text Classification

Lesheng Jin, Zihan Wang, Jingbo Shang


Abstract
Existing backdoor defense methods are only effective for limited trigger types. To defend different trigger types at once, we start from the class-irrelevant nature of the poisoning process and propose a novel weakly supervised backdoor defense framework WeDef. Recent advances in weak supervision make it possible to train a reasonably accurate text classifier using only a small number of user-provided, class-indicative seed words. Such seed words shall be considered independent of the triggers. Therefore, a weakly supervised text classifier trained by only the poisoned documents without their labels will likely have no backdoor. Inspired by this observation, in WeDef, we define the reliability of samples based on whether the predictions of the weak classifier agree with their labels in the poisoned training set. We further improve the results through a two-phase sanitization: (1) iteratively refine the weak classifier based on the reliable samples and (2) train a binary poison classifier by distinguishing the most unreliable samples from the most reliable samples. Finally, we train the sanitized model on the samples that the poison classifier predicts as benign. Extensive experiments show that WeDef is effective against popular trigger-based attacks (e.g., words, sentences, and paraphrases), outperforming existing defense methods.
Anthology ID:
2022.emnlp-main.798
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11614–11626
Language:
URL:
https://aclanthology.org/2022.emnlp-main.798
DOI:
10.18653/v1/2022.emnlp-main.798
Bibkey:
Cite (ACL):
Lesheng Jin, Zihan Wang, and Jingbo Shang. 2022. WeDef: Weakly Supervised Backdoor Defense for Text Classification. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11614–11626, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
WeDef: Weakly Supervised Backdoor Defense for Text Classification (Jin et al., EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-main.798.pdf