RAP: Robustness-Aware Perturbations for Defending against Backdoor Attacks on NLP Models

Wenkai Yang, Yankai Lin, Peng Li, Jie Zhou, Xu Sun


Abstract
Backdoor attacks, which maliciously control a well-trained model’s outputs of the instances with specific triggers, are recently shown to be serious threats to the safety of reusing deep neural networks (DNNs). In this work, we propose an efficient online defense mechanism based on robustness-aware perturbations. Specifically, by analyzing the backdoor training process, we point out that there exists a big gap of robustness between poisoned and clean samples. Motivated by this observation, we construct a word-based robustness-aware perturbation to distinguish poisoned samples from clean samples to defend against the backdoor attacks on natural language processing (NLP) models. Moreover, we give a theoretical analysis about the feasibility of our robustness-aware perturbation-based defense method. Experimental results on sentiment analysis and toxic detection tasks show that our method achieves better defending performance and much lower computational costs than existing online defense methods. Our code is available at https://github.com/lancopku/RAP.
Anthology ID:
2021.emnlp-main.659
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8365–8381
Language:
URL:
https://aclanthology.org/2021.emnlp-main.659
DOI:
10.18653/v1/2021.emnlp-main.659
Bibkey:
Cite (ACL):
Wenkai Yang, Yankai Lin, Peng Li, Jie Zhou, and Xu Sun. 2021. RAP: Robustness-Aware Perturbations for Defending against Backdoor Attacks on NLP Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8365–8381, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
RAP: Robustness-Aware Perturbations for Defending against Backdoor Attacks on NLP Models (Yang et al., EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.659.pdf
Video:
 https://aclanthology.org/2021.emnlp-main.659.mp4
Code
 lancopku/rap
Data
IMDb Movie Reviews