Gradient-based Adversarial Attacks against Text Transformers

Chuan Guo, Alexandre Sablayrolles, Hervé Jégou, Douwe Kiela


Abstract
We propose the first general-purpose gradient-based adversarial attack against transformer models. Instead of searching for a single adversarial example, we search for a distribution of adversarial examples parameterized by a continuous-valued matrix, hence enabling gradient-based optimization. We empirically demonstrate that our white-box attack attains state-of-the-art attack performance on a variety of natural language tasks, outperforming prior work in terms of adversarial success rate with matching imperceptibility as per automated and human evaluation. Furthermore, we show that a powerful black-box transfer attack, enabled by sampling from the adversarial distribution, matches or exceeds existing methods, while only requiring hard-label outputs.
Anthology ID:
2021.emnlp-main.464
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5747–5757
Language:
URL:
https://aclanthology.org/2021.emnlp-main.464
DOI:
10.18653/v1/2021.emnlp-main.464
Bibkey:
Cite (ACL):
Chuan Guo, Alexandre Sablayrolles, Hervé Jégou, and Douwe Kiela. 2021. Gradient-based Adversarial Attacks against Text Transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5747–5757, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Gradient-based Adversarial Attacks against Text Transformers (Guo et al., EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.464.pdf
Video:
 https://aclanthology.org/2021.emnlp-main.464.mp4
Code
 facebookresearch/text-adversarial-attack
Data
AG NewsIMDb Movie ReviewsMultiNLI