Defending against Insertion-based Textual Backdoor Attacks via Attribution

Jiazhao Li, Zhuofeng Wu, Wei Ping, Chaowei Xiao, V.G.Vinod Vydiswaran


Abstract
Textual backdoor attack, as a novel attack model, has been shown to be effective in adding a backdoor to the model during training. Defending against such backdoor attacks has become urgent and important. In this paper, we propose AttDef, an efficient attribution-based pipeline to defend against two insertion-based poisoning attacks, BadNL and InSent. Specifically, we regard the tokens with larger attribution scores as potential triggers since larger attribution words contribute more to the false prediction results and therefore are more likely to be poison triggers. Additionally, we further utilize an external pre-trained language model to distinguish whether input is poisoned or not. We show that our proposed method can generalize sufficiently well in two common attack scenarios (poisoning training data and testing data), which consistently improves previous methods. For instance, AttDef can successfully mitigate both attacks with an average accuracy of 79.97% (56.59% up) and 48.34% (3.99% up) under pre-training and post-training attack defense respectively, achieving the new state-of-the-art performance on prediction recovery over four benchmark datasets.
Anthology ID:
2023.findings-acl.561
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8818–8833
Language:
URL:
https://aclanthology.org/2023.findings-acl.561
DOI:
10.18653/v1/2023.findings-acl.561
Bibkey:
Cite (ACL):
Jiazhao Li, Zhuofeng Wu, Wei Ping, Chaowei Xiao, and V.G.Vinod Vydiswaran. 2023. Defending against Insertion-based Textual Backdoor Attacks via Attribution. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8818–8833, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Defending against Insertion-based Textual Backdoor Attacks via Attribution (Li et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.561.pdf