DA3: A Distribution-Aware Adversarial Attack against Language Models

Yibo Wang, Xiangjue Dong, James Caverlee, Philip Yu


Abstract
Language models can be manipulated by adversarial attacks, which introduce subtle perturbations to input data. While recent attack methods can achieve a relatively high attack success rate (ASR), we’ve observed that the generated adversarial examples have a different data distribution compared with the original examples. Specifically, these adversarial examples exhibit reduced confidence levels and greater divergence from the training data distribution. Consequently, they are easy to detect using straightforward detection methods, diminishing the efficacy of such attacks. To address this issue, we propose a Distribution-Aware Adversarial Attack (DA3) method. DA3 considers the distribution shifts of adversarial examples to improve attacks’ effectiveness under detection methods. We further design a novel evaluation metric, the Non-detectable Attack Success Rate (NASR), which integrates both ASR and detectability for the attack task. We conduct experiments on four widely used datasets to validate the attack effectiveness and transferability of adversarial examples generated by DA3 against both the white-box BERT-base and RoBERTa-base models and the black-box LLaMA2-7b model.
Anthology ID:
2024.emnlp-main.107
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1808–1825
Language:
URL:
https://aclanthology.org/2024.emnlp-main.107
DOI:
Bibkey:
Cite (ACL):
Yibo Wang, Xiangjue Dong, James Caverlee, and Philip Yu. 2024. DA3: A Distribution-Aware Adversarial Attack against Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1808–1825, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
DA3: A Distribution-Aware Adversarial Attack against Language Models (Wang et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.107.pdf
Software:
 2024.emnlp-main.107.software.zip