Layerwise universal adversarial attack on NLP models

In this work, we examine the vulnerability of language models to universal adversarial triggers (UATs). We propose a new white-box approach to the construction of layerwise UATs (LUATs), which searches the triggers by perturbing hidden layers of a network. On the example of three transformer models and three datasets from the GLUE benchmark, we demonstrate that our method provides better transferability in a model-to-model setting with an average gain of 9 . 3% in the fooling rate over the baseline. Moreover, we investigate triggers transferability in the task-to-task setting. Using small subsets from the datasets similar to the target tasks for choosing a perturbed layer, we show that LUATs are more efficient than vanilla UATs by 7 . 1% in the fooling rate.


Introduction
One of the fundamental drawbacks of modern neural networks is their vulnerability to adversarial attacks (Szegedy et al., 2013;Goodfellow et al., 2014), imperceptible perturbations to the data samples that leave the ground truth label unchanged but are able to modify model prediction drastically. The samples obtained as the result of these perturbations are called adversarial examples. First discovered for image datasets (Szegedy et al., 2013), this phenomenon was then demonstrated for other types of data, including natural language (Papernot et al., 2016;Liang et al., 2017;Gao et al., 2018).
The originally proposed methods of adversarial attack construction were sample-dependent, which means that one can not apply the same perturbation to different dataset items and expect equal success. Sample-agnostic, or in other words, universal, adversarial perturbations (UAPs) were proposed in Moosavi-Dezfooli et al. (2017), where based on a small subset of image data, the authors constructed perturbations leading to prediction change of 80-90% (depending on the model) of samples.
They also showed that UAPs discovered on one model could successfully fool another.
The generalization of the universal attacks to natural language data was made by Wallace et al. (2019). Short additives (triggers) were inserted at the beginning of data samples, and then a search over the token space, in order to maximize the probability of the negative class on the chosen data subset, was performed. The found triggers turned out to be very efficient at fooling the model, repeating the success of UAPs proposed for images.
The conventional way to look for adversarial examples is to perturb the output of a model. Considering image classification neural networks, Khrulkov and Oseledets (2018) proposed to search for perturbations to hidden layers by approximating the so-called (p, q)-singular vectors (Boyd, 1974) of the corresponding Jacobian matrix. Then one can hope that the error will propagate through the whole network end, resulting in model prediction change. They showed that this approach allows obtaining a high fooling rate based on significantly smaller data subsets than those leveraged by Moosavi-Dezfooli et al. (2017).
In this paper, we aim to continue the investigation of neural networks' vulnerability to universal adversarial attacks in the case of natural language data. Inspired by the approach considered by Khrulkov and Oseledets (2018), we look for the perturbations to hidden layers of a model instead of the loss function. In order to avoid projection from embedding to discrete space, we use simplex parametrization of the search space (see, e.g. (Dong et al., 2021;Guo et al., 2021)). We formulate the corresponding optimization problem and propose the algorithm for obtaining its approximate solution. On the example of three transformer models: BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), ALBERT (Lan et al., 2020) and three GLUE datasets (Wang et al., 2018) we demonstrate higher efficiency of our method over the original approach of Wallace et al. (2019) in the setting of modelto-model and task-to-task transfer. We also show that in the case of direct attack application, our method demonstrates the results which are on par with the baseline, where perturbation of the loss function was realized. We hope that this technique will serve as a useful tool for discovering flaws and deepening our understanding of language neural networks.

Framework
Let f : X → Y be a text classification model defined for a dataset , where x i is an input and y i is a corresponding label. Our goal is to find a small perturbation to the original input sequences that leads to a change in a model prediction for the maximum possible number of samples. Such perturbation could be modelled as an insertion of a trigger, a token sequence t of small length L, into a particular part of an input. Following Wallace et al. (2019), we call such text perturbations universal adversarial triggers (UATs). In this paper, we focus on triggers concatenated to the front of a sentence and denote a corrupted sample asx = t⊕x. It is also important to note that, in contrast to Wallace et al. (2019), we consider the unsupervised scenario when an attacker does not have access to the labels. For evaluation we follow Moosavi-Dezfooli et al. (2017) and use the fooling rate (FR): Universal adversarial triggers. Before we dive into the description of our approach, it is worth depicting the original method of UATs. Restricting themselves to the white-box setting, Wallace et al. (2019) showed that an efficient way to find UATs could be performed through the optimization of the first-order expected loss approximation: where µ is a distribution of input data, t n denotes the trigger found after the n-th iteration, V stands for the token vocabulary and the initial trigger t 0 can be chosen as the L-time repetition of the word "the". For simplicity, here (and below in similar situations), by t, we mean its embedding if the gradient is taken with respect to it or if it appears as a term in a scalar product.
In order to find the optimal perturbation within each iteration, the expectation in (2) is relaxed with the average over a batch, and maximization is performed for each trigger token independently: where j is the token index. After solving (3), the next trigger t n+1 is selected via beam search over all the token positions.
Our approach: layerwise universal adversarial triggers (LUATs). In (3), we face the optimization problem over a discrete set of tokens from the vocabulary V. To overcome this issue, let us relax (3) using the probability simplex model for every token. In this case, a trigger can be represented as t = W V , where V is a vocabulary matrix and W mn is a probability of the n-th token from vocabulary to be selected at position m. Then (3) can be rewritten as follows: where S = {W | W 1 = 1, W ≥ 0} and W ≥ 0 denotes an element-wise inequality. In this formulation, the search for the solution is done by performing optimization of the weights W over the simplex S.
Our approach can be seen as an extension of (4) to the perturbation of hidden layers. This is inspired by Khrulkov and Oseledets (2018), who applied this idea to find UAPs for fooling image classification neural networks. Given a layer l, the optimization problem, in this case, can be similarly obtained via the Taylor expansion: wherex n = t n ⊕ x, q is a hyperparameter to be fine-tuned and J l (x) is the Jacobian operator corresponding to a layer l: Bringing (4) and (5) together we obtain In contrast to (3), finding the optimal solution of (7) is computationally infeasible. Indeed, the problem (3) allows a brute-force approach for finding the optimal token for each trigger position since it requires computing the gradient only once for each iteration. On the other hand, in our case, a bruteforce computation is very cumbersome since, for each iteration, it would require computing the Jacobian action for every batch, token candidate and position, resulting in O(LB|V|) forward-backward passes, where B is the number of batches in an iteration. Luckily, F l (W ) is convex and can be lower bounded via a tangent line, where the gradient is calculated as follows: where ψ q (x) = sign(x)|x| q−1 . Therefore, our task is reduced to finding the solution to the linear problem with the simplicial constraint: where ∇F l (W ) is given by (8) and W * denotes the point where we perform the linear approximation. The final problem (9) has a closed-form solution (see the Appendix A for more details) and, as a result, we reduced the number of forward-backward computations to O(B).
Concerning the initialization of W * , we take the uniform distribution over all the vocabulary tokens for each token position in a trigger. Within each iteration, we perform only one step with respect to W in order to reduce computation time and observe that it is sufficient for breaking the models efficiently.
Finally, since the found after a given iteration weight matrix W in the worst case has only one non-zero element per row (see Appendix A), we can get into a local maximum unless we guess the proper initialization. Therefore, similarly to Wallace et al. (2019), we perform a beam search over the top-k candidates. In order to realize it, it is necessary to define the ranking criterion for choosing the best option at each search step. For this purpose, we use the FR. The overall algorithm is presented in Algorithm 1 and

Experiments
In this section, we present a numerical study of the proposed layerwise adversarial attack framework on text classification datasets. The code is publicly available on GitHub 1 .

Setup
Datasets. As in the work of Wang et al. (2021), we consider only a subset of tasks from the GLUE benchmark. In particular, we use SST-2 for the sentiment classification, MNLI (matched), QNLI, RTE for the natural language inference and MRPC as the paraphrase identification. We exclude CoLA, which task is to define whether the input is grammatically correct or not, and universal triggers are highly probable to change most of the ground-truth positive labels to negative. Finally, WNLI is not considered because it contains too few examples. We conduct our experiments using the validation set for attack fitting and the test set for evaluation.
Models. We focus our consideration on three transformer models: BERT base, RoBERTa base and ALBERT base using the pre-trained weights from the TextAttack project (Morris et al., 2020) for most of the model-dataset cases. For some of them, the performance was unsatisfactory (all the models on MNLI, ALBERT on QNLI, RoBERTa on SST-2, RTE ), and we fine-tuned them on the corresponding training sets (Mosin et al., 2023). To train our attack, we use existing GLUE datasets splits. The detailed statistics is presented in Tab. 1 Hyperparameters. In our experiments, we investigate the attack performance depending on the dataset, model, layer l (from 0 to 11), trigger length L ∈ {1, 2, 3, 4, 5, 6} and q ∈ {2, 3, 4, 5, 7, 10}.
For each dataset, model and trigger length, we performed a grid search over l and q. The other parameters, such as top-k and the beam size, remain fixed to the values obtained from the corresponding ablation study. In all experiments, we use a batch size of 128. Finally, we define the initialization of W as the uniform distribution over the vocabulary tokens for each position in a trigger (see Algorithm 1).
Token filtration and resegmentation. Transformers' vocabulary contains items such as symbols, special tokens and unused words, which are easily detected during the inference. To increase triggers' imperceptibility, we exclude them from the vocabulary matrix during optimization leaving only those which contain english letters or numbers.
Another problem appears since a lot of tokens do not correspond to complete words but rather pieces of words (sub-words). As a result, if the first found token corresponds to a sub-word, one encounters the retokenization, meaning that, after converting the found trigger to string and back, the set of the tokens can change. Moreover, sometimes one has to deal with appearing symbols such as "##" in a trigger. In this case, we drop all the extra symbols and perform the retokenization. Luckily, it does not result in severe performance degradation. In the case when, due to the resegmentation, the length of a trigger changes, we report the result as for the length for which the attack training was performed. As an alternative to direct resegmentation, we tried to transform triggers by passing them through an MLM model, but this approach led to a more significant drop in performance.

Main Results
Comparison with the baseline. We perform a comparison of LUATs with untargeted UATs of Wallace et al. (2019). In order to stay in the unsupervised setting, we modify their approach by replacing the ground truth labels in the crossentropy loss function with the class probabilities. As a result, we search for a trigger that maximizes the distance between model output distributions before and after the perturbation. In addition, as the criterion for choosing the best alternative in the beam search, we use FR for both methods.
We perform the ablation study to estimate the dependence of FR on top-k and the beam size. For the beam size 1, we measure both attacks' performance for different values of top-k from 1 to 40. Then for the best top-k, we build the dependence on the beam size from 1 to 5. We perform this study on the QNLI dataset. The results are presented in the Fig. 5. We stick to the top-k 10 and the beam size 1 as a trade-off between high performance and low computational complexity.
The grid search results are presented in Tab. 2, where for each model, dataset, and trigger length, we show the best results of both approaches. We performed the computation on four GPU's NVIDIA A100 of 80GB. To reduce the influence hyperparameters searching space cardinality (72 times more runs due to different values of q and L), in Fig. 2  It is interesting to note that in some cases, shorter triggers appear to be better than longer ones (see Appendix A, Tab. 10). That is why we present the best performance for each length L over all the lengths less or equal to L. From Tab. 2, one can see that our method demonstrates the results, which are on par with Wallace et al. (2019). In Fig. 3, we present the dependence of FR on the trigger length L. The attack performance saturates when it approaches 5 or 6, meaning that considering longer triggers would hardly bring any performance gain.
Wallace et al. (2019) suggested that the efficiency of universal adversarial triggers can be explained by the existence of dataset biases, such as fallacious correlations between certain words and classes and supported this conjecture by computing pointwise mutual information (PMI) of vocabulary tokens and dataset classes PMI(token, class) = log p(token, class) p(token)p(class) .
As a result, they reported a high PMI rank for their tiggers tokens. In order to verify whether LUATs satisfy the same pattern, we perform this computation for our best triggers. We sample 5 best candidates obtained during the grid search for each trigger length and compute PMI rank with add-100 smoothing for their tokens. The results on QNLI 2 (see the Tab. 3) demonstrate that similarly to UATs our trigger tokens have high PMI ranks.
Dependence on q and a layer. For the investigation of dependence on q and a layer, we restrict ourselves to the datasets which appear to be the most vulnerable in our experiments: MRPC, QNLI and SST. The results are presented in Fig. 4, where we performed averaging over the lengths. One can conclude that, in general, it is more efficient to attack the higher layers. This observation can be interpreted with the idea which has been mentioned above; namely, that the efficiency of the triggers is caused by the existence of dataset biases. Indeed, as Merchant et al. (2020) demonstrated, fine-tuning for a downstream task primarily affects the higher layers of a model. Therefore, the bias which could be acquired due to fine-tuning should be accumulated in its higher layers. Since we try to fool a model with respect to a downstream task, the appearance of the higher layers among the most successful ones is more probable. Finally, it is interesting to note that the dependence on q also demonstrates better results for the larger values, which is in accordance with the findings of Khrulkov and Oseledets (2018).
Transferability: model-to-model. It was demonstrated that the universal triggers could be transferable between different models trained on the same task (Wallace et al., 2019). Here, we perform a comparison of their approach and ours with respect to this property. Similarly to the above consideration, the computations are carried out on MRPC, QNLI and SST-2 datasets, with an additional restriction on the trigger length L = 3, which      is done for simplicity. In this setting, we suppose that an attacker has access to the data and also to the input and output of an attacked model. When transferring the UATs, the best trigger obtained with a source model is taken and then applied directly to a target model on a test set. For LUATs, however, one can come up with a better way. Since a target model is different from a source one, it is not necessary that the same values of q and l would provide the best pick for the transfer. Therefore, since an attacker has access to the forward pass of a target model, we evaluate with it the fooling rate of the best triggers found for each value of q and l on a validation set and then apply the best option to a test set to get the final score. For simplicity, we fix q = 10, which according to the Fig. 4, is the best alternative on average. The final results are presented in the Tab. 4. One can see that in most cases, we outperform the UATs with an average gain of 9.3%.
Transferability: task-to-task. The fact that the universal triggers can generalize to other models trained on the same task seems natural since, in this case, it is highly probable that a source and a target model would acquire the same biases. A more complicated situation is when one tries to transfer triggers between different tasks. In this setting, an attacker looks for more fundamental task-independent flaws of a model, which can be explained, e.g., by a bias appearing during the pretraining phase. In order to examine this capacity of the triggers, we perform their transferring between different datasets for each of the considered models. For measuring the performance of the UATs, as in the previous case, the best trigger obtained on a source task is transferred. On the other hand, for the layerwise approach, one, under a reasonable assumption, can still suggest a natural way to choose the most appropriate for transfer trigger among the best triggers corresponding to different values of q and l. Namely, we suppose that although attackers do not have access to data on which a target model was trained, they know a task for which it was trained. It means that they know whether it is sentiment classification, paraphrase identification, etc. (Savchenko et al., 2020). If this is the case, they can generate data corresponding to the task of interest. In order to mimic such a situation, for each of the datasets (SST-2, MRPC, QNLI), we select an auxiliary dataset collected for the same task. . We sample subsets of sizes 64, 128, and 256 from the auxiliary datasets and consider them as the data generated by an attacker. Again, fixing q = 10 for simplicity, we evaluate the best trigger obtained for each layer on a corresponding auxiliary subset, performing inference on a target model. Herewith, each subset is sampled five times, and we report the average score for the size of 256, which appears to be the best option. The final results for both approaches are shown in the Tab. 5. Our approach demonstrates the average improvement of 7.1% in fooling rate over the vanilla UATs. The reason for that might be related to the fact that if this kind of triggers (task-independent) is indeed related to the pretraining phase, in order to find them, one instead should perturb the lower or middle layers since the higher layers' weights can change a lot after fine-tuning. The LUATs which appear to be most successful at transferring are presented in the Tab. 8 and 9 of the Appendix B for model-to-model and task-to-task transfers correspondingly.

Related Work
There are already quite a few works devoted to the universal attack on NLP models in the literature. We briefly discuss them here. Ribeiro et al. (2018) proposed a sample-agnostic approach to generate adversarial examples in the case of NLP models by applying a semanticspreserving set of rules consisting of specific word substitutions. However, found rules were not completely universal, resulting in model prediction change only for 1% − 4% out of targeted samples.
Similarly to Wallace et al. (2019), the realization of text adversarial attacks as short insertions to the input were proposed by Behjati et al. (2019). Though, they did not perform the search over the whole vocabulary for each word position in the trigger but instead exploited cosine-similarity projected gradient descent, which does not appear that efficient in the sense of attack performance.
Adversarial triggers generated by the method proposed by Wallace et al. (2019) in general turned out to be semantically meaningless, which makes them easier to detect by defence systems. An attempt to make triggers more natural was undertaken by Song et al. (2021). Leveraging an adversarially regularized autoencoder (Zhao et al., 2018) to generate the triggers, they managed to improve their semantic meaning without significantly decreasing the attack efficiency.
Another interesting direction is to minimize the amount of data needed for finding UAPs. Singla et al. (2022) created representatives of each class by minimizing the loss function linear approximation over the text sequences of a certain size. Afterwards, adversarial triggers were appended to these class representatives and the rest of the procedure followed Wallace et al. (2019). Although no data was used explicitly for training the attack, this approach demonstrated solid performance on considered datasets.

Conclusion
We present a new layerwise framework for the construction of universal adversarial attacks on NLP models. Following Wallace et al. (2019), we look for them in the form of triggers; however, in order to find the best attack, we do not perturb the loss function but neural network intermediate layers.
We show that our approach demonstrates better performance when the triggers are transferred to different models and tasks. The latter might be related to the fact that in order to be transferred successfully between different datasets, a trigger should reflect network flaws that are task-independent. In this case, reducing the attack search to perturbation of the lower or middle layers might be more beneficial since the higher layers are highly influenced by fine-tuning. We hope this method will serve as a good tool for investigating the shortcomings of language models and improving our understanding of neural networks' intrinsic mechanisms.
We would like to conclude by discussing the potential risks. As any type of technology, machine learning methods can be used for good and evil. In particular, adversarial attacks can be used for misleading released machine learning models. Nevertheless, we think that revealing the weaknesses of modern neural networks is very important for making them more secure in the future and also for being able to make conscious decisions when deploying them.

Limitations and future work
Our approach to universal text perturbations suffers from linguistic inconsistency, which makes them easier to detect. Therefore, as the next step of our research, it would be interesting to investigate the possibility of improving the naturalness of adversarial triggers without degradation of the attack performance in terms of the fooling rate.
While the proposed approach outperforms the UATs of Wallace et al. (2019) in the transferability task, we should highlight that the additional hyperparameters adjustment plays a crucial role, and one could suggest validation procedure refinement for a more fair comparison. Also, for both direct and transferability settings, a more comprehensive range of models should be examined, including recurrent (Yuan et al., 2021) and transformer architectures, e.g., T5 (Raffel et al., 2020), XLNet (Yang et al., 2019), GPT family models (Radford et al., 2019;Brown et al., 2020).
Another direction of improvement is related to the fact that sometimes the found triggers can change the ground truth label of samples they are concatenated to if, e.g., they contain words contradicting the true sense of a sentence. It would be interesting to analyze how often this happens and develop an approach to tackle this issue.
Finally, it would be interesting to investigate the dependence of attack efficiency on the size of a training set and compare it with the so-called data-free approaches, such as the one proposed by Singla et al. (2022).

138
A Linear problem solution on S Let us consider the following linear problem with a cost matrix C: We need to construct the Lagrangian L and the consequent dual problem to obtain its solution.
where λ and M are the Lagrange multipliers. From KKT conditions, we have: where M · W means elementwise multiplication. As a result, we obtain the following dual problem: Under the assumption that each row of the cost matrix C has a unique maximum, the closed-form solution will take the form: and the corresponding primal solution Otherwise, if any row i of C violates the above assumption by having k > 1 maximal elements with indices {j 1 , . . . , j k }, then

B Tables and plots
In this appendix, we present the results of • the ablation study on the top-k and beam search parameters (see Fig. 5), • the results of trigger analysis with PMI (see Tab. 6 and 7) for SST-2 and MRPC datasets, • the LUATs which appear to be the best for model-to-model and task-to-task transfers (see Tab. 4 and 5), • the cases when shorter triggers appear to be better than longer ones (see Tab. 10), • the examples of the top-20 obtained triggers.
Concerning the transfer triggers, one can see that sometimes the same triggers efficiently break different models trained on different datasets, e.g., 'WHY voted beyond' (FR = 40.4 for ALBERT trained on QNLI, FR = 45.9 for BERT trained on QNLI, FR = 40.9 for RoBERTa trained on MRPC), 'unsuitable improper whether' (FR = 37.5 for BERT trained on SST-2, FR = 34.7 on RoBERTa trained on SST-2, FR = 22.8 for ALBERT trained on MRPC). This can serve as evidence of the high generalizability of universal adversarial triggers.        B2. Did you discuss the license or terms for use and / or distribution of any artifacts?
We used freely available models and dataset B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified? For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?
We did not discuss the indented use of the models and datasets which we used, since these are the models and datasets are known to everyone and their use is free.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it? We did not collect any data, and we used only publicly available datasets.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.?
We did not collect any data. The documentation for the datasets and models which we are used are publicly available. 3 C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used? 3 The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.