Ding Nai


2023

pdf bib
Training NLI Models Through Universal Adversarial Attack
Lin Jieyu | Liu Wei | Zou Jiajie | Ding Nai
Proceedings of the 22nd Chinese National Conference on Computational Linguistics

“Pre-trained language models are sensitive to adversarial attacks, and recent works have demon-strated universal adversarial attacks that can apply input-agnostic perturbations to mislead mod-els. Here, we demonstrate that universal adversarial attacks can also be used to harden NLPmodels. Based on NLI task, we propose a simple universal adversarial attack that can misleadmodels to produce the same output for all premises by replacing the original hypothesis with anirrelevant string of words. To defend against this attack, we propose Training with UNiversalAdversarial Samples (TUNAS), which iteratively generates universal adversarial samples andutilizes them for fine-tuning. The method is tested on two datasets, i.e., MNLI and SNLI. It isdemonstrated that, TUNAS can reduce the mean success rate of the universal adversarial attackfrom above 79% to below 5%, while maintaining similar performance on the original datasets. Furthermore, TUNAS models are also more robust to the attack targeting at individual samples:When search for hypotheses that are best entailed by a premise, the hypotheses found by TUNASmodels are more compatible with the premise than those found by baseline models. In sum, weuse universal adversarial attack to yield more robust models. Introduction”