Mitigating Data Poisoning in Text Classification with Differential Privacy

Chang Xu, Jun Wang, Francisco Guzmán, Benjamin Rubinstein, Trevor Cohn


Abstract
NLP models are vulnerable to data poisoning attacks. One type of attack can plant a backdoor in a model by injecting poisoned examples in training, causing the victim model to misclassify test instances which include a specific pattern. Although defences exist to counter these attacks, they are specific to an attack type or pattern. In this paper, we propose a generic defence mechanism by making the training process robust to poisoning attacks through gradient shaping methods, based on differentially private training. We show that our method is highly effective in mitigating, or even eliminating, poisoning attacks on text classification, with only a small cost in predictive accuracy.
Anthology ID:
2021.findings-emnlp.369
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2021
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Venues:
EMNLP | Findings
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
4348–4356
Language:
URL:
https://aclanthology.org/2021.findings-emnlp.369
DOI:
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2021.findings-emnlp.369.pdf
Data
IMDb Movie Reviews