Layerwise universal adversarial attack on NLP models

Olga Tsymboi, Danil Malaev, Andrei Petrovskii, Ivan Oseledets


Abstract
In this work, we examine the vulnerability of language models to universal adversarial triggers (UATs). We propose a new white-box approach to the construction of layerwise UATs (LUATs), which searches the triggers by perturbing hidden layers of a network. On the example of three transformer models and three datasets from the GLUE benchmark, we demonstrate that our method provides better transferability in a model-to-model setting with an average gain of 9.3% in the fooling rate over the baseline. Moreover, we investigate triggers transferability in the task-to-task setting. Using small subsets from the datasets similar to the target tasks for choosing a perturbed layer, we show that LUATs are more efficient than vanilla UATs by 7.1% in the fooling rate.
Anthology ID:
2023.findings-acl.10
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
129–143
Language:
URL:
https://aclanthology.org/2023.findings-acl.10
DOI:
10.18653/v1/2023.findings-acl.10
Bibkey:
Cite (ACL):
Olga Tsymboi, Danil Malaev, Andrei Petrovskii, and Ivan Oseledets. 2023. Layerwise universal adversarial attack on NLP models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 129–143, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Layerwise universal adversarial attack on NLP models (Tsymboi et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.10.pdf
Video:
 https://aclanthology.org/2023.findings-acl.10.mp4