Contrasting Human- and Machine-Generated Word-Level Adversarial Examples for Text Classification

Maximilian Mozes, Max Bartolo, Pontus Stenetorp, Bennett Kleinberg, Lewis Griffin


Abstract
Research shows that natural language processing models are generally considered to be vulnerable to adversarial attacks; but recent work has drawn attention to the issue of validating these adversarial inputs against certain criteria (e.g., the preservation of semantics and grammaticality). Enforcing constraints to uphold such criteria may render attacks unsuccessful, raising the question of whether valid attacks are actually feasible. In this work, we investigate this through the lens of human language ability. We report on crowdsourcing studies in which we task humans with iteratively modifying words in an input text, while receiving immediate model feedback, with the aim of causing a sentiment classification model to misclassify the example. Our findings suggest that humans are capable of generating a substantial amount of adversarial examples using semantics-preserving word substitutions. We analyze how human-generated adversarial examples compare to the recently proposed TextFooler, Genetic, BAE and SememePSO attack algorithms on the dimensions naturalness, preservation of sentiment, grammaticality and substitution rate. Our findings suggest that human-generated adversarial examples are not more able than the best algorithms to generate natural-reading, sentiment-preserving examples, though they do so by being much more computationally efficient.
Anthology ID:
2021.emnlp-main.651
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8258–8270
Language:
URL:
https://aclanthology.org/2021.emnlp-main.651
DOI:
10.18653/v1/2021.emnlp-main.651
Bibkey:
Cite (ACL):
Maximilian Mozes, Max Bartolo, Pontus Stenetorp, Bennett Kleinberg, and Lewis Griffin. 2021. Contrasting Human- and Machine-Generated Word-Level Adversarial Examples for Text Classification. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8258–8270, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Contrasting Human- and Machine-Generated Word-Level Adversarial Examples for Text Classification (Mozes et al., EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.651.pdf
Code
 maximilianmozes/human_adversaries
Data
IMDb Movie Reviews