Phrase-level Textual Adversarial Attack with Label Preservation

Yibin Lei, Yu Cao, Dianqi Li, Tianyi Zhou, Meng Fang, Mykola Pechenizkiy


Abstract
Generating high-quality textual adversarial examples is critical for investigating the pitfalls of natural language processing (NLP) models and further promoting their robustness. Existing attacks are usually realized through word-level or sentence-level perturbations, which either limit the perturbation space or sacrifice fluency and textual quality, both affecting the attack effectiveness. In this paper, we propose Phrase-Level Textual Adversarial ATtack (PLAT) that generates adversarial samples through phrase-level perturbations. PLAT first extracts the vulnerable phrases as attack targets by a syntactic parser, and then perturbs them by a pre-trained blank-infilling model. Such flexible perturbation design substantially expands the search space for more effective attacks without introducing too many modifications, and meanwhile maintaining the textual fluency and grammaticality via contextualized generation using surrounding texts. Moreover, we develop a label preservation filter leveraging the likelihoods of language models fine-tuned on each class, rather than textual similarity, to rule out those perturbations that potentially alter the original class label for humans. Extensive experiments and human evaluation demonstrate that PLAT has a superior attack effectiveness as well as a better label consistency than strong baselines.
Anthology ID:
2022.findings-naacl.83
Volume:
Findings of the Association for Computational Linguistics: NAACL 2022
Month:
July
Year:
2022
Address:
Seattle, United States
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1095–1112
Language:
URL:
https://aclanthology.org/2022.findings-naacl.83
DOI:
10.18653/v1/2022.findings-naacl.83
Bibkey:
Cite (ACL):
Yibin Lei, Yu Cao, Dianqi Li, Tianyi Zhou, Meng Fang, and Mykola Pechenizkiy. 2022. Phrase-level Textual Adversarial Attack with Label Preservation. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1095–1112, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
Phrase-level Textual Adversarial Attack with Label Preservation (Lei et al., Findings 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.findings-naacl.83.pdf
Video:
 https://aclanthology.org/2022.findings-naacl.83.mp4
Code
 yibin-lei/plat
Data
AG NewsGLUEMultiNLIQNLI