VoteTRANS: Detecting Adversarial Text without Training by Voting on Hard Labels of Transformations

Hoang-Quoc Nguyen-Son, Seira Hidano, Kazuhide Fukushima, Shinsaku Kiyomoto, Isao Echizen


Abstract
Adversarial attacks reveal serious flaws in deep learning models. More dangerously, these attacks preserve the original meaning and escape human recognition. Existing methods for detecting these attacks need to be trained using original/adversarial data. In this paper, we propose detection without training by voting on hard labels from predictions of transformations, namely, VoteTRANS. Specifically, VoteTRANS detects adversarial text by comparing the hard labels of input text and its transformation. The evaluation demonstrates that VoteTRANS effectively detects adversarial text across various state-of-the-art attacks, models, and datasets.
Anthology ID:
2023.findings-acl.315
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5090–5104
Language:
URL:
https://aclanthology.org/2023.findings-acl.315
DOI:
10.18653/v1/2023.findings-acl.315
Bibkey:
Cite (ACL):
Hoang-Quoc Nguyen-Son, Seira Hidano, Kazuhide Fukushima, Shinsaku Kiyomoto, and Isao Echizen. 2023. VoteTRANS: Detecting Adversarial Text without Training by Voting on Hard Labels of Transformations. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5090–5104, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
VoteTRANS: Detecting Adversarial Text without Training by Voting on Hard Labels of Transformations (Nguyen-Son et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.315.pdf