Huy Quang Ung
2022
CheckHARD: Checking Hard Labels for Adversarial Text Detection, Prediction Correction, and Perturbed Word Suggestion
Hoang-Quoc Nguyen-Son
|
Huy Quang Ung
|
Seira Hidano
|
Kazuhide Fukushima
|
Shinsaku Kiyomoto
Findings of the Association for Computational Linguistics: EMNLP 2022
An adversarial attack generates harmful text that fools a target model. More dangerously, this text is unrecognizable by humans. Existing work detects adversarial text and corrects a target’s prediction by identifying perturbed words and changing them into their synonyms, but many benign words are also changed. In this paper, we directly detect adversarial text, correct the prediction, and suggest perturbed words by checking the change in the hard labels from the target’s predictions after replacing a word with its transformation using a model that we call CheckHARD. The experiments demonstrate that CheckHARD outperforms existing work on various attacks, models, and datasets.