Training NLI Models Through Universal Adversarial Attack

Lin Jieyu, Liu Wei, Zou Jiajie, Ding Nai


Abstract
“Pre-trained language models are sensitive to adversarial attacks, and recent works have demon-strated universal adversarial attacks that can apply input-agnostic perturbations to mislead mod-els. Here, we demonstrate that universal adversarial attacks can also be used to harden NLPmodels. Based on NLI task, we propose a simple universal adversarial attack that can misleadmodels to produce the same output for all premises by replacing the original hypothesis with anirrelevant string of words. To defend against this attack, we propose Training with UNiversalAdversarial Samples (TUNAS), which iteratively generates universal adversarial samples andutilizes them for fine-tuning. The method is tested on two datasets, i.e., MNLI and SNLI. It isdemonstrated that, TUNAS can reduce the mean success rate of the universal adversarial attackfrom above 79% to below 5%, while maintaining similar performance on the original datasets. Furthermore, TUNAS models are also more robust to the attack targeting at individual samples:When search for hypotheses that are best entailed by a premise, the hypotheses found by TUNASmodels are more compatible with the premise than those found by baseline models. In sum, weuse universal adversarial attack to yield more robust models. Introduction”
Anthology ID:
2023.ccl-1.72
Volume:
Proceedings of the 22nd Chinese National Conference on Computational Linguistics
Month:
August
Year:
2023
Address:
Harbin, China
Editors:
Maosong Sun, Bing Qin, Xipeng Qiu, Jing Jiang, Xianpei Han
Venue:
CCL
SIG:
Publisher:
Chinese Information Processing Society of China
Note:
Pages:
847–861
Language:
English
URL:
https://aclanthology.org/2023.ccl-1.72
DOI:
Bibkey:
Cite (ACL):
Lin Jieyu, Liu Wei, Zou Jiajie, and Ding Nai. 2023. Training NLI Models Through Universal Adversarial Attack. In Proceedings of the 22nd Chinese National Conference on Computational Linguistics, pages 847–861, Harbin, China. Chinese Information Processing Society of China.
Cite (Informal):
Training NLI Models Through Universal Adversarial Attack (Jieyu et al., CCL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.ccl-1.72.pdf