Evaluating the Validity of Word-level Adversarial Attacks with Large Language Models

Huichi Zhou, Zhaoyang Wang, Hongtao Wang, Dongping Chen, Wenhan Mu, Fangyuan Zhang


Abstract
Deep neural networks exhibit vulnerability to word-level adversarial attacks in natural language processing. Most of these attack methods adopt synonymous substitutions to perturb original samples for crafting adversarial examples while attempting to maintain semantic consistency with the originals. Some of them claim that they could achieve over 90% attack success rate, thereby raising serious safety concerns. However, our investigation reveals that many purportedly successful adversarial examples are actually invalid due to significant changes in semantic meanings compared to their originals. Even when equipped with semantic constraints such as BERTScore, existing attack methods can generate up to 87.9% invalid adversarial examples. Building on this insight, we first curate a 13K dataset for adversarial validity evaluation with the help of GPT-4. Then, an open-source large language model is fine-tuned to offer an interpretable validity score for assessing the semantic consistency between original and adversarial examples. Finally, this validity score can serve as a guide for existing adversarial attack methods to generate valid adversarial examples. Comprehensive experiments demonstrate the effectiveness of our method in evaluating and refining the quality of adversarial examples.
Anthology ID:
2024.findings-acl.292
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4902–4922
Language:
URL:
https://aclanthology.org/2024.findings-acl.292
DOI:
Bibkey:
Cite (ACL):
Huichi Zhou, Zhaoyang Wang, Hongtao Wang, Dongping Chen, Wenhan Mu, and Fangyuan Zhang. 2024. Evaluating the Validity of Word-level Adversarial Attacks with Large Language Models. In Findings of the Association for Computational Linguistics ACL 2024, pages 4902–4922, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
Evaluating the Validity of Word-level Adversarial Attacks with Large Language Models (Zhou et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.292.pdf