Mitigating Data Poisoning in Text Classification with Differential Privacy

Chang Xu, Jun Wang, Francisco Guzmán, Benjamin Rubinstein, Trevor Cohn


Abstract
NLP models are vulnerable to data poisoning attacks. One type of attack can plant a backdoor in a model by injecting poisoned examples in training, causing the victim model to misclassify test instances which include a specific pattern. Although defences exist to counter these attacks, they are specific to an attack type or pattern. In this paper, we propose a generic defence mechanism by making the training process robust to poisoning attacks through gradient shaping methods, based on differentially private training. We show that our method is highly effective in mitigating, or even eliminating, poisoning attacks on text classification, with only a small cost in predictive accuracy.
Anthology ID:
2021.findings-emnlp.369
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2021
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
Findings
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
4348–4356
Language:
URL:
https://aclanthology.org/2021.findings-emnlp.369
DOI:
10.18653/v1/2021.findings-emnlp.369
Bibkey:
Cite (ACL):
Chang Xu, Jun Wang, Francisco Guzmán, Benjamin Rubinstein, and Trevor Cohn. 2021. Mitigating Data Poisoning in Text Classification with Differential Privacy. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4348–4356, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Mitigating Data Poisoning in Text Classification with Differential Privacy (Xu et al., Findings 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.findings-emnlp.369.pdf
Video:
 https://aclanthology.org/2021.findings-emnlp.369.mp4
Data
IMDb Movie Reviews