On the Transferability of Adversarial Attacks against Neural Text Classifier

Deep neural networks are vulnerable to adversarial attacks, where a small perturbation to an input alters the model prediction. In many cases, malicious inputs intentionally crafted for one model can fool another model. In this paper, we present the first study to systematically investigate the transferability of adversarial examples for text classification models and explore how various factors, including network architecture, tokenization scheme, word embedding, and model capacity, affect the transferability of adversarial examples. Based on these studies, we propose a genetic algorithm to find an ensemble of models that can be used to induce adversarial examples to fool almost all existing models. Such adversarial examples reflect the defects of the learning process and the data bias in the training set. Finally, we derive word replacement rules that can be used for model diagnostics from these adversarial examples.


Introduction
Recent studies demonstrate that deep neural networks are vulnerable to adversarial examples, intentionally crafted to fool the models. Although generating adversarial examples for texts has shown to be more challenging than for images due to their discrete nature, many methods have been proposed Most existing studies focus on developing effective algorithms for attacking a specific model. The successful attacks demonstrate the instability of model predictions. However, the vulnerability of a model may correlate with different factors, such as network architecture, tokenization scheme, word embedding type, model capacity, and the spurious predictive patterns in the training data.
In this study, we aim to understand the attack algorithms through the lens of analyzing transferability of adversarial examples. We first systematically investigate which factors of neural models impact the black-box transferability (i.e., how adversarial examples generated against one model can fool another one (Szegedy et al., 2013)) of adversarial  ., 2015). These factors include network architectures (LSTM, CNN, or Transformer), tokenization schemes (character, sub-word, or word), embedding types (GloVe, word2vec, or fastText), and model capacities (different network depths). We vary one factor at a time while fixing the others to see which factor is the more significant one, and found that the tokenization scheme has the greatest influence on the adversarial transferability, following by network architecture, embedding type, and model capacity.
Based on the analysis, we study whether it is possible to craft highly-transferable text adversarial examples for many neural models by ensembling a small number of models. Specifically, these highlytransferable adversarial examples provide the following insights. First, the adversaries do not need white-box access to victim models. They launch the attacks by their own models trained on similar data, which can transfer across models (Moosavi-Dezfooli et al., 2017). Second, as stated in Wallace et al. (2019), such adversarial examples are a useful analysis tool and reveal general input-output patterns learned by models, which can be leveraged to study the influence of dataset biases and to identify those biases learned by models.
We also found that the adversarial examples obtained by an ensemble model are more transferable and propose a genetic algorithm to find an optimal ensemble based on the empirical transferability between different models. The adversarial examples generated by attacking the founded ensemble are strongly transferable to other models. For some models, they even exhibit better transferability than those generated by attacking the same model but with different random initialization.
Finally, inspired by Ribeiro et al. (2018), we generalize the adversarial examples constructed by our ensemble into semantics-preserving adversarial word replacement rules that can induce adversaries on any text input strongly transferring to other neural network-based models (see Table 1). Since those rules are model-agnostic, they provide an analysis of global model behavior and help us to identify dataset biases and to diagnose heuristics learned by the models (See Figure 1 for an illustration of the process).

Adversarial Transferability Among Neural Models
In the following, we first want to investigate how network architectures, tokenization schemes, embedding types, and model capacities affect the attack transferability. We conduct an empirical study by varying one factor at a time while fixing the rest to see the differences in their attack transferability. Technically, we generate the adversarial examples by attacking a source model and pass the generated adversarial examples through other models for comparison.

Experimental Design
We use convolutional neural network (CNN), long short-term memory (LSTM), and bidirectional LSTM as base models with 1, 2, and 4 layers (an additional 6-layer one for CNN  (Miller, 1995), and for any word in a text, the word to replace must have the same part-of-speech (POS) as the original one 1 . Alzantot et al. (2018) also used a language model (LM) to rule out candidate substitute words that do not fit within the context. However, unlike PWWS, ruling out some candidates by an LM will greatly reduce the number of candidate substitute words (65% off on average). For consistency, we report the robust accuracy under GA attack without using an LM. Zang et al. (2020) suggested that existing textual attack algorithms can roughly be divided into two categories: greedy and populationbased algorithms. PWWS and TextFooler (Jin et al., 2020) fall into the first category while GA and PSO (Zang et al., 2020) belong to the second one. We chose one attack algorithm in each category when investigating the transferability among neural models and use TextFooler to evaluate the generalizability of the proposed method in Section 3.3.
We conducted experiments on two text classification datasets: Sentiment Movie Reviews (MR) (Pang and Lee, 2005) and AG News corpus (AG-NEWS) (Zhang et al., 2015). All models are trained on the standard training set with the cross-entropy loss. For each dataset, we attack 1, 000 randomly selected test examples. For evaluating their transferability on other models, we randomly choose 500 adversarial examples that successfully cause the source model to make incorrect predictions. The transferability between each possible pair of models is shown in Appendix A.2.

Significance of Various Factors
To find out which factor affects the transferability of adversarial examples the most, we vary one factor at a time while fixing all the others for each model in the pool, and compare the transferability rates between them. For example, we take a 2-layer word-based LSTM model randomly initialized, denoted as "LSTM-Word-Random-2", as a target model. If we want to know the impact of network architecture, we generate 1, 000 adversarial examples each by attacking BiLSTM-Word- 1 We did not use two recently proposed attack algorithms of BERT-Attack (Li et al., 2020) and BAE (Garg and Ramakrishnan, 2020) because they cannot guarantee that any substitute word is always synonymous with the original word.
Random-2, and CNN-Word-Random-2, and use randomly selected 500 examples each of successful attack to evaluate the robustness of the target model. If we want to understand the impact of word embedding, the adversarial examples will be crafted by LSTM-Word-GloVe-2, LSTM-Word-word2vec-2, and LSTM-Word-fastText-2 models.  Since some models may be inherently more vulnerable than others, we need to evaluate the base transferability rate to remove the effects that are not caused by the factors we consider. For each target model, we train two instances with the same setting but different random initialization to obtain its base transferability rate by generating adversarial examples against one model and testing them on another. This base transferability rate of a model will be subtracted from all the actual adversarial transferability rates obtained when taking the model as the test model 2 . We report the average of (subtracted) adversarial transferability rates in Table 2, these rates are averaged over the configurations (all possible pairs) described in Section 2.1. Note that the smaller the values are, the more the adversarial 2 For example, the base transferability rate of Model A is 90% and that of Model B is 60%. If we also know that 70% of adversarial examples produced using Model C can cause Model A to make mistakes, and 50% of those are misclassified by Model B, we may conclude that the transferability rate of C-to-A (70%) is higher than that of C-to-B (50%), which is obviously wrong because the difference between the transferability rate of C-to-B (50%) and B's base transferability rate (60%) is much smaller than that between the rate of C-to-A (70%) and A's base rate (90%). We report such (subtracted) adversarial transferability rates in Table 2 only. transferability rate is close to its base (intra-model) transferability rate. From these rates, we found the tokenization scheme has the greatest influence on the adversarial transferability, followed by network architecture, embedding type, and model capacity no matter what attack algorithm or dataset is used.

Intra-Factor Transferability
In the following, we drill down into each specific factor. Table 3 shows adversarial transferability among different network architectures and configurations. For example, the architecture of "BERT" includes three variants: vanilla BERT, RoBERTa, and ALBERT. Each cell (i, j) in the table reports the transferability between two classes of models i and j. The value of each cell is computed as follows: for each possible pair of models (s, t) where model s belongs to class i and model t belongs to class j, we first calculate the transferability rate between models s and j, i.e. the percentage of adversarial examples produced using model s misclassified by model t; we then average these transferability rates over all the possible pairs.
As shown in Table 3, the adversarial transferability is not symmetric, i.e. the transferability of the transfer pair (i, j) might be different from the pair (j, i). As expected, intra-model adversarial example transferability rates are consistently higher than inter-model transferability ones. The adversarial examples generated using BERTs transfer slightly worse than other models whereas BERTs show much more robust to adversarial samples produced using the models from other classes. It is probably because BERTs were pre-trained with large-scale data and take different tokenization scheme (i.e. sub-words). We found the models from BERT family tend to distribute their "attention" over more words of an input text than others, which makes it harder to change their predictions by perturbing just few words. In contrast, other models often "focus" on certain keywords when making predictions, which makes them more vulnerable to black-box transfer attacks (see Appendix A.3).
In Table 4, we report the impact of tokenization schemes and embedding types on the adversarial transferability. Each cell is obtained by the method as the values reported in Table 3. The pre-trained models show to be more robust against black-box transfer attacks no matter their word embeddings or other parameters or both are pre-trained with large-scale text data. Character-based models are more robust to transfer attacks than those taking words or sub-words as input, and their adversarial examples also transfer much worse than others.

Summary of Findings
Some findings on the adversarial transferability among models are summarized below: • No matter what attack algorithm or dataset is used, the tokenization scheme has the greatest impact on the adversarial transferability, followed by the netowrk architecture, embedding type, and model capacity in the order of importance. • The adversarial transfer is not symmetric, and the transferability rates of intra-model adversarial examples are consistently higher than those of inter-model ones. • Pre-trained neural models show to be more robust against black-box transfer attacks no matter their word embeddings or other parameters or both are pre-trained with large-scale text data.

• The adversarial examples produced by attacking
BERTs transfer slightly worse than others, but BERTs show much more robust to transfer adversarial attacks. We found that BERTs tend to distribute their "attention" over more words than others, which makes it harder to change their predictions by perturbing just few words. Similar observation has been observed in (Hsieh et al., 2019). • Character-based models are more robust to transfer attacks than those taking words or sub-words as input, but their adversarial examples also transfer much worse than others. • Among the models from BERT family, the models pre-trained with more data show to be more robust against black-box transfer attacks using the models pre-trained with less data, while the adversarial examples produced by attacking the former transfer slightly better than the latter.
We also found that the adversarial examples produced by using an ensemble with a small number of models are much more transferable than those by a single model. A small ensemble greatly speeds up the adversarial example generation process, which is useful to perform the test-time attacks when evaluating the robustness of local models or launch the online attack in a real-world simulated environment because attacking an ensemble consisting of all possible models is time-consuming and not cost-efficient. The next two questions are how to . We will answer these two questions in Section 3. The above findings can guide us to choose a pool of representative neural models, from which we select a small number of them to form an ensemble. For example, such a pool of models should include at least one neural network for each type of tokenization scheme, but it does not need to include too many networks of different depths. The smaller the number of models in a pool, the less the computational cost will be for estimating the transferability rate between any pair of them.

Highly-Transferable Examples
Next, we discuss how to find an optimal ensemble model that can be used to craft adversarial examples that strongly transfer across other models. We then distill the ensemble attack into adversarial word replacement rules that can be used to generate adversarial examples with high transferability. These rules can also help us to identify dataset biases and analyze global model behaviors.

Ensemble Method
Consider an ensemble model that outputs the prediction score for a class label by averaging over the scores of individual models, we can generate adversarial examples to fool the ensemble model by applying word substitution-based perturbations to input texts. We take the average of the logits produced by all the member models as the final prediction. Observing that the transferability is affected by various factors, and many factors need to be carefully considered when forming an ensemble, we propose a population-based genetic algorithm to find an optimal ensemble.
In the proposed algorithm, a candidate solution is a set of models S = (s 1 , s 2 , . . . , s m ), where m is a pre-defined size of ensemble. A fitness function evaluates each solution to decide whether it will contribute to the next generation of solutions. We define a function r(s, t, a) as the percentage of adversarial samples produced using model s misclassified by model t under attack algorithm a. For a solution S, the fitness function f (S) that returns a measure of the candidate's fitness which we want to maximize is defined as follows: where T is a pool of representative models under investigation, |T | is the cardinality of T , and A is a set of attack algorithms. Let P (n) define a population of candidate solutions at the n-th generation : P (n) = {s n 1 , s n 2 , . . . , s n m }. Initial populations P (0) are selected randomly. After evaluating each candidate by the fintness function, the algorithm takes two candidate solutions based on fitness, merges their sets, and then randomly selects m models from the set to produce new candidates. The mutation is another important genetic operator that takes a single candidate and randomly replaces at most one of its models with another one from T . The algorithm continues until the number of generations reaches the maximum value.
In order to evaluate the ensembles found by the population-based algorithm, we ask a senior researcher to select the ensembles as a baseline. This researcher uses a simple strategy to make selection: first choose the model whose adversarial examples yield the highest transferability, and gradually add complementary models which are different from those already in the ensemble in the aspects of tokenization scheme, architecture, and embedding type. We list the ensembles selected by the algorithm and the human expert in Appendix A.4. In Figure 2, we show the transferability rates of the adversarial examples produced using the ensembles with various sizes on both AGNEWS and MR datasets under two attack algorithms (PWWS and GA). The reported transferability rates are averaged over all the remaining models except those used to produce the adversarial examples. We found that in most cases the adversarial examples produced using the ensemble founded by the genetic algorithm transfer better across different models than those selected by a human expert, especially when the ensemble size is small. The ensemble method performs superior to a single model-based transfer method, and in some cases the transferability rates achieved by the ensemble method even go beyond the upper red dotted line (i.e. the highest rates that can be achieved by using a single local model). When the ensemble size is greater than 6, the marginal gains in average transferability rate decrease no matter what attack algorithm or dataset is used in our experimental setting.

Mining Word Replacement Rules
We have shown in Section 3.1 that the adversarial examples generated by the ensemble whose members are carefully selected can strongly transfer to other models. We hypothesize that if we can distill the ensemble attack into some word replacement rules, the adversarial examples crafted by applying the distilled rules to perturb input texts also can transfer well across different models. In this section, we want to discover such word replacement rules using an ensemble model, and those rules are expected to be used to generate the model-agnostic examples of transferable hostility. Besides, such for each word wi in the input text x 3: for each wordŵi that can be used to replace wi 4:xi = replace wi withŵi in x.
A word replacement rule is defined as a pair (z, w →ŵ), where z is a class label, and w →ŵ means to replace the original word w withŵ when the gold label is z. Each rule is associated with a salience h(z, w →ŵ) specifying the priority of the rule, and a higher number denotes a higher priority. We propose an algorithm to discover Highly-transferable Adversarial Word Replacement (HAWR) rules (see Figure 3). The idea behind this algorithm is to estimate the changes in log-likelihood caused by the word replacements. Once such rules are obtained, they can be used to generate adversarial examples as follows: given an input sentence x and its label y, we find a word w i in x which has the highest value of h(y, w i ,ŵ i ) and replace w i withŵ i in x; for all the remaining words in x we repeat the above step until the percentage of words that can be altered reach a given threshold. Note that such adversarial examples can be generated without access to target models. We report the attack success rates of the adversarial examples generated by applying HAWR rules in Table 5. The attacks based on HAWR rules are comparable to PWWS and GA algorithms that require a large number of queries to the victim model, while the attackers using HAWR do not need to access the victim models. We use the ensemble consisting of six models founded by our genetic algorithm to discover these HAWR rules in this experiment. We list five adversarial word replacement rules each for the positive and negative categories discovered from MR dataset in Table 6.
To understand these adversarial word replacement rules, we analyze their pointwise mutual information (PMI) between words and class label before and after the replacements. The PMI of a pair of discrete random variables quantifies the discrepancy between the probability of their coincidence given their joint distribution and their individual distributions. In this case, it is used to find collocations and associations between words and labels, and the PMI of a word w and a label z ∈ Z can be computed as PMI(w, z) = p(w, z)/p(w)p(z), where p(·) assigns a probability to each possible value. The results show that the PMI are significantly different for the word and its replacement even though they are synonyms. This demonstrates the data bias in the training data.
We obtain similar word replacement rules by ranking all the hypothesis words according to their PMI with each label from a training set. For each label z ∈ Z and a possible word replacement w → w, the similar salience of h(z, w →ŵ) can be computed as follows: We also report the attack success rates of the adversarial examples generated by the word replacement rules obtained using PMI only in Table 5. The adversarial examples produced by HAWR rules achieved stronger transferability than those by PMI rules. We believe that it is because HAWR rules are distilled using the logits predicted by models, and the changes in the logits reflect both the characteristics of neural networks and the contexts in which those word replacements are applied.

Case Study: Natural Language Inference
To evaluate the generalizability of the proposed method, we redo the entire process (illustrated in Figure 1) on a new task of natural language inference (NLI) as a case study. We conducted the experiments of this task on Stanford Natural Language Inference (SNLI) (Bowman et al., 2015) dataset. The HAWR rules generated by our algorithm were tested on three models (ESIM, DecompAtt and XL-Net listed in Table 7) that are unseen during the process of finding these rules, and compared to a recently proposed attack algorithm, called TextFooler (Jin et al., 2020), which has not been used. We reuse 63 different neural models for text classification (see Appendix A.1) to create a model pool for this task. To perform NLI task, each model in the pool encodes the premise and hypothesis separately and then feeds the concatenation of these encodings to a two-layer feedforward network. We use the ensemble with six models (see Appendix A.5) identified by our generic algorithm to discover HAWR rules based on the adversarial transferability rates between any pair of models in the pool. As shown in Table 7, the attacks based on HAWR rules are comparable to TextFooler that requires many queries to the victim models.  adversarial examples generated by applying the word replacement rules found by our algorithm (HAWR) and pointwise mutual information (PMI) on all 63 and three representative models with AGNEWS and MR datasets, comparing to PWWS and GA attack algorithms. "Succ%" denotes the attack success rate in terms of the number of sentences, and "Qry#" the average number of queries to the victim model required by the attack algorithms. The maximum percentage of words that are allowed to be perturbed was set to 30%.     2018) presented semanticpreserving perturbations that cause models to change their predictions by the paraphrases generated using back-translation, and generalized these perturbations into universal replacement rules that induce adversaries on many text instances. They use the word "universal" to mean that their replacement rules can be used to any input text if the rules are matched with the input and these rules were generalized across some specific models. With a different goal, we aim to find the highly-transferable adversarial replacement rules by which the crafted adversarial examples can fool almost all models. Besides, the number of their replacement rules is relatively small compared to ours.

Conclusion
We investigated four critical factors of NLP neural models, including network architectures, tokenization schemes, embedding types, and model capacities and how they impact the transferability of text adversarial examples with more than sixty different models. We also proposed a genetic algorithm to find an optimal ensemble of very few models that can be used to generate adversarial examples that transfer well to all the other mod-els. Then, we described a algorithm to discover highly-transferable adversarial word replacement rules that can be applied to craft adversarial examples with strong transferability across various neural models without access to any of them. Finally, since those adversarial examples are modelagnostic, they provide an analysis of global model behavior and help to identify dataset biases. 2018

A.1 All Neural Models under Investigation
We systematically investigated many popular architectures of neural models with different configurations. Specifically, we consider various network architectures (LSTM, BiLSTM, CNN, or BERT), tokenization schemes (Word, character, or word + character, denoted by "W", "C", "WC" respectively), word embeddings (randomly-initialized, GloVe, word2vec, or fastText), and model capacities (various numbers of layers). All models under investigation are listed in Table 8, and we believe that they cover the popular neural networks that have been used for text classification tasks in NLP literature.

A.2 Transferability among Different Neural Models
We show in Figure 4 the transferability rate among all neural models in the model pool. The column and row headers indicate the IDs of source and target models respectively. The mapping of IDs and the corresponding models is shown in Figure  8. We generate adversarial examples by attacking a source model, and report the transferability rates on a target (or victim) model.

A.3 Heatmap of Word Importance
To study how each word in the sentence impacts the prediction of the model, we define word importance as follows: • For an original word, its importance is calculated as the difference between the log likelihood of a gold label before and after the original word is replaced with a special "unknown" symbol (<unk>).
• For a substitute word, its importance is estimated as the difference between the log likelihood of a gold label predicted by the model before and after the original word is replaced with the substitute one. Figure 6 and 7 show the importance of original and substitute words for different models. We here only consider the models (with one hidden layer) listed in Figure 8 and take the following sentence as an example input: Storage, servers bruise HP earnings update Earnings per share rise compared with a year ago, but company misses analysts' expectations by a long shot.
We observed that different models generally show similar behavior: for the original words, most of the models mainly focus on three words, namely "Storage", "servers" and "HP"; for the substitute words, the attentions have been given to the word "depot" for most of the models. Thanks to such a similarity, it is possible to generate the adversarial  Figure 7: Importance of the words used to replace an original word "Storage" (the first word in the sentence). examples using one model, which strongly transfer to the others. However, the models from BERT family show much more robust to transfer adversarial attacks. They tend to distribute their "attention" over more words both for original words and substitute words. As to character-based models, they also distribute their attention in a way that is clearly different from the other models. These differences can explain the lower transferability rates achieved by the adversarial examples generated by using BERTs and character-based models. Task   Table 9 shows the ensemble models selected by the proposed genetic algorithm and human expert of AGNEWS and MR datasets. Task   Table 10 shows the ensemble models selected by the proposed genetic algorithm on SNLI dataset.