Sample Attackability in Natural Language Adversarial Attacks

Vyas Raina; Mark Gales

doi:10.18653/v1/2023.trustnlp-1.9

Sample Attackability in Natural Language Adversarial Attacks

Abstract

Adversarial attack research in natural language processing (NLP) has made significant progress in designing powerful attack methods and defence approaches. However, few efforts have sought to identify which source samples are the most attackable or robust, i.e. can we determine for an unseen target model, which samples are the most vulnerable to an adversarial attack. This work formally extends the definition of sample attackability/robustness for NLP attacks. Experiments on two popular NLP datasets, four state of the art models and four different NLP adversarial attack methods, demonstrate that sample uncertainty is insufficient for describing characteristics of attackable/robust samples and hence a deep learning based detector can perform much better at identifying the most attackable and robust samples for an unseen target model. Nevertheless, further analysis finds that there is little agreement in which samples are considered the most attackable/robust across different NLP attack methods, explaining a lack of portability of attackability detection methods across attack methods.

Anthology ID:: 2023.trustnlp-1.9
Volume:: Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anaelia Ovalle, Kai-Wei Chang, Ninareh Mehrabi, Yada Pruksachatkun, Aram Galystan, Jwala Dhamala, Apurv Verma, Trista Cao, Anoop Kumar, Rahul Gupta
Venue:: TrustNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 96–107
Language:
URL:: https://aclanthology.org/2023.trustnlp-1.9
DOI:: 10.18653/v1/2023.trustnlp-1.9
Bibkey:
Cite (ACL):: Vyas Raina and Mark Gales. 2023. Sample Attackability in Natural Language Adversarial Attacks. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 96–107, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: Sample Attackability in Natural Language Adversarial Attacks (Raina & Gales, TrustNLP 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.trustnlp-1.9.pdf
Supplementary material:: 2023.trustnlp-1.9.SupplementaryMaterial.zip
Video:: https://aclanthology.org/2023.trustnlp-1.9.mp4

PDF Cite Search Supplementary material Video