Gradient-based Adversarial Attacks against Text Transformers

We propose the first general-purpose gradient-based adversarial attack against transformer models. Instead of searching for a single adversarial example, we search for a distribution of adversarial examples parameterized by a continuous-valued matrix, hence enabling gradient-based optimization. We empirically demonstrate that our white-box attack attains state-of-the-art attack performance on a variety of natural language tasks, outperforming prior work in terms of adversarial success rate with matching imperceptibility as per automated and human evaluation. Furthermore, we show that a powerful black-box transfer attack, enabled by sampling from the adversarial distribution, matches or exceeds existing methods, while only requiring hard-label outputs.


Introduction
Deep neural networks are sensitive to small, often imperceptible changes in the input, as evidenced by the existence of so-called adversarial examples (Biggio et al., 2013;Szegedy et al., 2013). The dominant method for constructing adversarial examples defines an adversarial loss, which encourages prediction error, and then minimizes the adversarial loss over the input space with established optimization techniques. To ensure that the perturbation is hard to detect by humans, existing methods also introduce a perceptibility constraint into the optimization problem. Variants of this general strategy have been successfully applied to image and speech data (Madry et al., 2017;Wagner, 2017, 2018).
However, optimization-based search strategies for obtaining adversarial examples are much more challenging with text data. Attacks against continuous data types such as image and speech utilize gradient descent for superior efficiency, but the discrete nature of natural languages prohibits such * * Equal contribution. first-order techniques. In addition, perceptibility for continuous data can be approximated with L 2and L ∞ -norms, but such metrics are not readily applicable to text data. To circumvent this issue, some existing attack approaches have opted for heuristic word replacement strategies and optimizing by greedy or beam search using black-box queries (Jin et al., 2020;Li et al., 2020a,b;Garg and Ramakrishnan, 2020). Such heuristic strategies typically introduce unnatural changes that are grammatically or semantically incorrect (Morris et al., 2020a).
In this paper, we propose a general-purpose framework for gradient-based adversarial attacks, and apply it against transformer models on text data. Our framework, GBDA (Gradient-based Distributional Attack), consists of two key components that circumvent the difficulties of gradient descent for discrete data under perceptibility constraints. First, instead of constructing a single adversarial example, we search for an adversarial distribution. We instantiate examples with the Gumbelsoftmax distribution (Jang et al., 2016), parameterized by a continuous-valued matrix of coefficients that we optimize with a vanilla gradientbased method. Second, we enforce perceptibility and fluency using BERTScore (Zhang et al., 2019) and language model perplexity, respectively, both of which are differentiable and can be added to the objective function as soft constraints. The combination of these two components enables powerful, efficient, gradient-based text adversarial attacks.
We empirically demonstrate the efficacy of GBDA against several transformer models. In addition, we also evaluate under the transfer-based black-box threat model by sampling from the optimized adversarial distribution and querying against a different, potentially unknown target model. On a variety of tasks including news/article categorization, sentiment analysis, and natural language inference, our method achieves state-of-the-art attack success rate, while preserving fluency, grammatical  Figure 1: Overview of our attack framework. The parameter matrix Θ is used to sample a sequence of probability vectorsπ 1 , . . . ,π n , which is forwarded through three (not necessarily distinct) models: (i) the target model for computing the adversarial loss, (ii) the language model for the fluency constraint, and (iii) the BERTScore model for the semantic similarity constraint. Due to the differentiable nature of each loss component and of the Gumbelsoftmax distribution, our framework is fully differentiable, hence enabling gradient-based optimization. correctness, and a high level of semantic similarity to the original input.
In summary, the main contributions of our paper are as follows: 1. We define a parameterized adversarial distribution and optimize it using gradient-based methods. In contrast, most prior work construct a single adversarial example using black-box search. 2. By incorporating differentiable fluency and semantic similarity constraints into the adversarial loss, our white-box attack produces more natural adversarial texts while achieving a new state-ofthe-art success rate. 3. The adversarial distribution can be sampled efficiently to query different target models in a blackbox setting. This enables a powerful transfer attack that matches or exceeds the performance of existing attacks. Compared to prior work that operate on continuous-valued outputs from the target model, this transfer attack only requires hard labels.

Background
Adversarial examples constitute a class of robustness attacks against neural networks. Let h : X → Y be a classifier where X , Y are the input and output domains, respectively. Suppose that x ∈ X is a test input that the model correctly predicts as the label y = h(x) ∈ Y. An (untargeted) adversarial example is a sample x ∈ X such that h(x ) = y but x and x are imperceptibly close.
The notion of perceptibility is introduced so that x preserves the semantic meaning of x for a human observer. At a high level, x constitutes an attack on the model's robustness if a typical human would not misclassify x but the model h does.
For image data, since the input domain X is a subset of the Euclidean space R d , a common surrogate for perceptibility is a distance metric such as the Euclidean distance or the Chebyshev distance. In general, one can define a perceptibility metric ρ : X × X → R ≥0 and a threshold > 0 so that Search problem formulation. The process of finding an adversarial example is typically modeled as an optimization problem. For classification, the model h outputs a logit vector φ h (x) ∈ R K such that y = arg max k φ h (x) k . To encourage the model to misclassify an input, one can define an adversarial loss such as the margin loss: so that the model misclassifies x by a margin of κ > 0 when the loss is 0. The margin loss has been widely used in attack algorithms for image data (Carlini and Wagner, 2017).
Given an adversarial loss , the process of constructing an adversarial example can be cast as a constrained optimization problem: An alternative formulation is to relax the constraint into a soft constraint with λ > 0: which can then be solved using gradient-based optimizers if the constraint function ρ is differentiable.

Text Adversarial Examples
Although the search problem formulation in Equation 2 has been widely applied to continuous data such as image and speech, it does not directly apply to text data because (1) the data space X is discrete, hence not permitting gradient-based optimization; and (2) the constraint function ρ is difficult to define for text data. In fact, both issues arise when considering attacks against any discrete input domain, but the latter is especially relevant for text data due to the sensitivity of natural language. For instance, inserting the word not into a sentence can negate the meaning of the whole sentence despite having a token-level edit distance of 1.
Prior work. Several attack algorithms have been proposed to circumvent these two issues, using a multitude of approaches. For attacks that operate on the character level, perceptibility can be approximated by the number of character edits, i.e., replacements, swaps, insertions and deletions (Ebrahimi et al., 2017;Gao et al., 2018). Attacks that operate on the word level adopt heuristics such as synonym substitution (Samanta and Mehta, 2017;Zang et al., 2020;Maheshwary et al., 2020) or replacing words by ones with similar word embeddings (Alzantot et al., 2018;Ren et al., 2019;Jin et al., 2020). More recent attacks have also leveraged masked language models such as BERT (Devlin et al., 2019) to generate word substitutions by replacing masked tokens (Garg and Ramakrishnan, 2020;Li et al., 2020a,b). Most of the aforementioned attacks follow the common recipe of proposing characterlevel or word-level perturbations to generate a constrained candidate set and optimizing the adversarial loss greedily or using beam search.
Shortcomings in prior work. Despite the plethora of attacks against natural language models, their efficacy remains subpar compared to attacks against other data modalities. Both characterlevel and word-level changes are still relatively detectable, especially as such changes often introduce misspellings, grammatical errors, and other artifacts of unnaturalness in the perturbed text (Morris et al., 2020a). Moreover, prior attacks mostly query the target model h as a black-box and rely on zeroth-order strategies for minimizing the adversarial loss, resulting in sub-optimal performance. For instance, BERT-Attack (Li et al., 2020b)arguably the state-of-the-art attack against BERT-only reduces the test accuracy of the target model on the AG News dataset (Zhang et al., 2015) from 95.1 to 10.6. In comparison, attacks against image models can consistently reduce the model's accuracy to 0 on almost all computer vision tasks (Akhtar and Mian, 2018). This gap in performance raises the question of whether gradientbased search can produce more fluent and optimal adversarial examples on text data. In this work, we show that our gradient-based attack can reduce the same model's accuracy from 95.1 to 3.5 while being more semantically-faithful to the original text. Our result shows that using gradient-based search for text adversarial examples can indeed close the performance gap between vision and text attacks.

Other Attacks
While most works on adversarial attack on text fall within the formulation defined at the beginning of section 2, other notions of adversarial perturbation exist as well. One class of such attacks is known as universal adversarial triggers-a short snippet of text that when appended to any input, causes the model to misclassify (Wallace et al., 2019;Song et al., 2020). However, such triggers often contain unnatural combinations of words or tokens, and hence are very perceptible to a human observer.
Our work falls within the general area of adversarial learning, and many prior works in this area have explored the notion of adversarial example on different data modalities. Although the most prominent data modality by far is image, adversarial examples can be constructed for speech (Carlini and Wagner, 2018) and graphs (Dai et al., 2018;Zügner et al., 2018) as well.

GBDA: Gradient-based Distributional Attack
In this section, we detail GBDA-our generalpurpose framework for gradient-based text attacks against transformers. Our framework leverages two important insights: (1) we define a parameterized adversarial distribution that enables gradientbased search using the Gumbel-softmax (Jang et al., 2016); and (2) we promote fluency and semantic faithfulness of the perturbed text using soft constraints on both perplexity and semantic similarity.

Adversarial Distribution
Let z = z 1 z 2 · · · z n be a sequence of tokens where each z i ∈ V is a token from a fixed vocabulary Consider a distribution P Θ parameterized by Θ ∈ R n×V , which yields samples z ∼ P Θ by independently sampling each token is a vector of token probabilities for the i-th token. We aim to optimize the parameter matrix Θ so that samples z ∼ P Θ are adversarial examples for the model h. To do so, we define the objective function for this goal as: where is a chosen adversarial loss.
Extension to probability vector inputs. The objective function in Equation 4 is non-differentiable due to the discrete nature of the categorical distribution. Instead, we propose a relaxation of Equation 4 by first extending the model h to take probability vectors as input, and then use the Gumbel-softmax approximation (Jang et al., 2016) of the categorical distribution to derive the gradient. Transformer models take as input a sequence of tokens that are converted to embedding vectors using a lookup table. Let e(·) be the embedding function so that the input embedding for the token z i is e(z i ) ∈ R d for some embedding dimension d. Given a probability vector π i that specifies the sampling probability of the token z i , we define as the embedding vector corresponding to the probability vector π i . Note that if π i is a one-hot vector corresponding to the token z i then e(π i ) = e(z i ). We extend the notation for an input sequence of probability vectors π = π 1 · · · π n as e(π) = e(π 1 ) · · · e(π n ) by concatenating the input embeddings.
Extending the model h to take probability vectors as input allows us to leverage the Gumbelsoftmax approximation to derive smooth estimates of the gradient of Equation 4. Samplesπ = π 1 · · ·π n from the Gumbel-softmax distributioñ P Θ are drawn according to the process: where g i,j ∼ Gumbel(0, 1) and T > 0 is a temperature parameter that controls the smoothness of the Gumbel-softmax distribution. As T → 0, this distribution converges towards the distribution Categorical(Softmax(Θ i )).
We can now optimize Θ using gradient descent by defining a smooth approximation of the objective function in Equation 4: The expectation can be estimated using stochastic samples ofπ ∼P Θ .

Soft Constraints
Black-box attacks based on heuristic replacements can only constrain the perturbation by proposing changes that fall within the constraint set, e.g., limiting edit distance, replacing words by ones with high word embedding similarity, etc. In contrast, our adversarial distribution formulation can readily incorporate any differentiable constraint function as a part of the objective. We leverage this advantage to include both fluency and semantic similarity constraints in order to produce more fluent and semantically-faithful adversarial texts.
Fluency constraint with a language model. Causal language models (CLMs) are trained with the objective of next token prediction by maximizing the likelihood given previous tokens. This allows the computation of likelihoods for any sequence of tokens. More specifically, given a CLM g with log-probability outputs, the negative loglikelihood (NLL) of a sequence z = z 1 · · · z n is given autoregressively by: is the cross-entropy between the delta distribution on token z i and the predicted token distribution g(z 1 · · · z i−1 ) for i = 1, . . . , n.
We extend the definition of NLL to the setting where inputs are vectors of token probabilities by: (π i ) j g(e(π 1 ) · · · e(π i−1 )) j , with log p g (π i | π 1 · · · π i−1 ) being the crossentropy between the next token distribution π i and the predicted next token distribution g(e(π 1 ) · · · e(π i−1 )). This extension coincides with the NLL for a token sequence x when each π i is a delta distribution for the token x i .
Similarity constraint with BERTScore. Prior work on word-level attacks often used contextfree embeddings such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) or synonym substitution to constrain semantic similarity between the original and perturbed text (Alzantot et al., 2018;Ren et al., 2019;Jin et al., 2020). These constraints tend to produce out-of-context and unnatural changes that alter the semantic meaning of the perturbed text (Garg and Ramakrishnan, 2020). Instead, we propose to use BERTScore (Zhang et al., 2019), a similarity score for evaluating text generation that captures the semantic similarity between pairwise tokens in contextualized embeddings of a transformer model. Let x = x 1 · · · x n and z = z 1 · · · z m be two token sequences and let g be a language model that produces contextualized embeddings φ(x) = (u 1 , . . . , u n ) and φ(z) = (v 1 , . . . , v m ). The (recall) BERTScore between x and z is defined as: is the normalized inverse document frequency of the token x i computed across a corpus of data. We can readily substitute z with a sequence of probability vectors π = π 1 · · · π m as described in Equation 5 and use ρ g (x, π) = 1 − R BERT (x, π) as a differentiable soft constraint.
Objective function. We combine all the components in the previous sections into a final objective for gradient-based optimization. Our objective function uses the margin loss (cf. Equation 1) as the adversarial loss, and integrates the fluency constraint with a causal language model g and the BERTScore similarity constraint using contextualized embeddings of g: where λ lm , λ sim > 0 are hyperparameters that control the strength of the soft constraints. We minimize L(Θ) stochastically using Adam (Kingma and Ba, 2014) by sampling a batch of inputs from P Θ at every iteration.

Sampling Adversarial Texts
Once Θ has been optimized, we can sample from the adversarial distribution P Θ to construct adversarial examples. Since the loss function L(Θ) that we optimize is an approximation of the objective in Equation 4, it is possible that some samples are not adversarial even when L(Θ) is successfully minimized. Hence, in practice, we draw multiple samples z ∼ P Θ and stop sampling either when the model misclassifies the sample or when we reach a maximum number of samples.
Note that this stage could technically allow us to add hard constraints to the examples we generate, e.g., manually filter out adversarial examples that do not seem natural. In our case, we do not add any extra hard constraint and only verify that the generated example is misclassified by the model.
Transfer to other models. Since drawing from the distribution P Θ could potentially generate an infinite stream of adversarial examples, we can leverage these generated samples to query a target model that is different from h. This constitutes a black-box transfer attack from the source model h. Moreover, our transfer attack does not require the target model to output continuous-valued scores, which most existing black-box attacks against transformers rely on (Jin et al., 2020;Garg and Ramakrishnan, 2020;Li et al., 2020a,b). We demonstrate in subsection 4.2 that this transfer attack enabled by the adversarial distribution P Θ is very effective at attacking a variety of target models.

Experiments
In this section, we empirically validate our attack framework on a benchmark suite of natural language tasks. Code to reproduce our results is open sourced on GitHub 1 .

Setup
Tasks. We evaluate on several benchmark text classification datasets, including DBPedia (Zhang et al., 2015) and AG News (Zhang et al., 2015) for article/news categorization, Yelp Reviews (Zhang et al., 2015) and IMDB (Maas et al., 2011) for binary sentiment classification, and MNLI (Williams et al., 2017) for natural language inference. The MNLI dataset contains two evaluation sets:  where finetuned models are unavailable. For BERT on DBPedia and GPT-2/XLM on all tasks, we finetune a pretrained model to serve as the target model. The soft constraints described in subsection 3.2 utilizes a CLM g with the same tokenizer as the target model. For GPT-2 we use the pre-trained GPT-2 model without finetuning as g, and for XLM we use the checkpoint obtained after finetuning using the CLM objective. For masked language models such as BERT (Devlin et al., 2019), we train a causal language model g on WikiText-103 using the same tokenizer as the target model.
Baselines. We compare against several recent attacks on text transformers: TextFooler (Jin et al., 2020), BAE (Garg and Ramakrishnan, 2020), and BERT-Attack (Li et al., 2020b). All baseline attacks are evaluated on finetuned BERT models from the TextAttack library (Morris et al., 2020b). See subsection 4.2 for details of attack settings.
Hyperparameters. Our adversarial distribution parameter Θ is optimized using Adam (Kingma and Ba, 2014) with a learning rate of 0.3 and a batch size of 10 for 100 iterations. The distribution parameters Θ are initialized to zero except Θ i,j = C where x i = j is the i-th token of the clean input. In practice we take C ∈ 12, 15. We use λ perp = 1 and cross-validate λ sim ∈ [20, 200] and κ ∈ {3, 5, 10} using held-out data.

Quantitative Evaluation
White-box attacks. We first evaluate the attack performance under the white-box setting. Table 1 shows the result of our attacks against GPT-2, XLM (en-de), and BERT on different benchmark datasets. Following prior work (Jin et al., 2020), for each task, we randomly select 1000 inputs from the task's test set as attack targets. After optimizing Θ, we draw up to 100 samples z ∼ P Θ until the model misclassifies z. The model's accuracy after attack (under the column "Adv. Acc.") is the accuracy evaluated on the last of the drawn samples.
Overall, our attack is able to successfully generate adversarial examples against all three models across the five benchmark datasets. The test accuracy can be reduced to below 10% for almost all models and tasks. Following prior work, we also evaluate the semantic similarity between the adversarial example and the original input using the co-  Table 3: Result of black-box model transfer attack from GPT-2 to other transformer models. Our attack is achieved by sampling from the same adversarial distribution P Θ and is able to generalize to the three target transformer models considered in this study.   sine similarity of Universal Sentence Encoders (Cer et al., 2018) (USE). Our attack is able to consistently maintain a high cosine similarity to the original input (higher than 0.8) in most cases.
Model transfer attacks. We also evaluate our attack against prior work under the black-box setting by transferring across models. More specifically, for each model and task, we randomly select 1000 test samples and optimize the adversarial distribution P Θ on GPT-2. After optimizing Θ, we draw up to 1000 samples z ∼ P Θ and evaluate them on the target BERT model from the TextAttack library (Morris et al., 2020b) until the model misclassifies z. This attack setting is strictly more restrictive than prior work because our query procedure only requires the target model to output a discrete label in order to decide when to stop sampling from P Θ , whereas prior work relied on a continuous-valued output score such as class probabilities. Table 2 shows the performance of our attack when transferred to finetuned BERT text classifiers. In all settings, GBDA is able to reduce the target model's accuracy to below that of BERT-Attack and BAE within similar or fewer number of queries. Moreover, the cosine similarity between the original input and the adversarial example is higher than that of BERT-Attack.
We further evaluate our model transfer attack against three other finetuned transformer models from the TextAttack library: ALBERT (Lan et al., 2019), RoBERTa (Liu et al., 2019), and XL-Net (Yang et al., 2019). For this experiment, we use the same Θ optimized on GPT-2 for each of the target models. Table 3 reports the performance of our attack after randomly sampling up to 1000 times from P Θ . The attack performance is comparable to that of the transfer attack against BERT in Table 2, which means our adversarial distribution P Θ is able to capture the common failure modes of a wide variety of transformer models.
Dataset transfer attacks. The model transfer attack relies on the assumption that the adversary has access to the target model's training data. We relax this assumption in the form of a dataset transfer attack where only the target model's task is known. Concretely, we attack sentiment classifiers trained on Yelp/IMDB by using a model trained on one dataset for optimizing Θ and drawing up to 1000 samples from P Θ to attack the target model trained on the other dataset. Table 4 shows the result of the dataset transfer attack for different target model architectures. In all except for the case of GPT-2→BERT, the model

Attack Prediction Text
Original Entailment (83%) He found himself thinking in circles of worry and pulled himself back to his problem. He got lost in loops of worry, but snapped himself back to his problem. GBDA Neutral (95%) He found himself thinking in circles of worry and pulled himself back to his problem. He got lost in loops of hell, but snapped himself back to his problem.
Original Contradiction (78%) Steps are initiated to allow program board membership to reflect the clienteligible community and include representatives from the funding community, corporations and other partners. There isn't a fair representation of board members on the program. GBDA Neutral (98%) Steps are initiated to allow program board membership to reflect the clienteligible community and include representatives from the funding community, corporations and other partners. There isn also a fair representation of board members on the program..

Attack Prediction Text
Original World (99%) Turkey a step closer to Brussels The European Commission is set to give the green light later today to accession talks with Turkey. EU leaders will take a final decision in December. GBDA w/ fluency Business (100%) Turkey a step closer to Brussels The eurozone Union is set to give the green light later today to accession talks with Barcelona. EU leaders will take a final decision in December. GBDA w/o fluency Business (77%) Turkey a step closer to Uber Thecom Commission is set to give the green light later today to accessrage negotiations with Turkey. EU leaders will take a final decision in December. used when optimizing P Θ has the same architecture as the target model. In the last setting, we simultaneously transfer between the model and the dataset. It is evident that the transfer attack remains successful despite not having access to the target model's training data. This result opens a practical avenue of attack against real world systems as the attacker requires very limited knowledge of the target model in order to succeed.

Analysis
Sample adversarial texts. Table 5 shows examples of our adversarial attack on text. Our method introduces minimal changes to the text, preserving most of the original sentence's meaning. Despite not explicitly constraining replaced words to have the same Part-Of-Speech tag, we observe that our soft penalties make the adversarial examples obey this constraint. For instance, in the first and third examples of Table 5, "worry" is replaced with "hell" and "no" with "varying".
Effect of λ sim . Figure 2 shows the impact of the similarity constraint on transfer attack adversarial accuracy for GPT-2 on AG News. Each color corresponds to a different target model, whereas the color shade (from light to dark) indicates the value of the constraint hyperparameter: λ sim = 50, 20, 10. A higher value of λ sim reduces the aggressiveness of the perturbation, but also increases the number of queries required to achieve a given target adversarial accuracy.
Impact of the fluency constraint. Table 6 shows adversarial examples for GPT-2 on AG News, generated with and without the fluency constraint. We fix all hyperparameters except for the fluency regularization constant λ lm , and sample successful adversarial texts from P Θ after Θ has been optimized. It is evident that the fluency constraint promotes token combinations to form valid words and ensure grammatical correctness of the adversarial text. Without the fluency constraint, the adversarial text tends to contain nonsensical words.
Tokenization artifacts. Our attack operates entirely on tokenized inputs. However, the input to the classification system is often in raw text form, which is then tokenized before being fed to the model. Thus it is possible that we generate an adversarial example that, when converted to raw text, is not re-tokenized to the same set of tokens. Consider this example: our adversarial example contains the tokens "jui-" and "cy", which decodes into "juicy", and is then re-encoded to "juic-" and "y". In practice, we observe that these retokenization artifacts are rare: the "token error rate" is around 2%. Furthermore, they do not impact adversarial accuracy by much: the re-tokenized example is in fact still adversarial. One potential mitigation strategy is to re-sample from P Θ until the sampled text is stable under re-tokenization. Note that all our adversarial accuracy results are computed after re-tokenization.
Runtime. Our method relies on white-box optimization and thus necessitates forward and backward passes through the attacked model, the language model and the similarity model, which increases the per-query time compared to black-box attacks that only compute forward passes. However, this is compensated by a much more efficient optimization which brings the total runtime to 20s per generated example, on par with black-box attacks such as BERT-Attack (Li et al., 2020b).
Human evaluation. We further conduct a human evaluation study of our attacks to examine to what extent are adversarial texts generated by GBDA truly imperceptible. Our interface is shown in Figure 3: We show annotators on Amazon Mechanical Turk two snippets of text-one is not modified, and the other one is adversarially corrupted-and the annotator has to select which one is corrupted in less than 10 seconds. The clean text is sampled from Yelp and the adversarial text is generated against BERT using either our method or BAE, our strongest baseline. To ensure high quality of the annotations, we select annotators with more than 1000 hits approved and with an approval rate higher than 98%. The annotation itself is preceded by an onboarding with three simple examples that have to be correctly classified in order for the annotator to qualify for the task.
Averaging across more than 3000 samples, annotators are able to detect BAE examples in 78.04% of the cases, while detecting our examples in 76.06% of the cases. This result shows that although both GBDA and BAE produce detectable changes, our method is slightly less perceptible than BAE but the model accuracy after attack is significantly lower for our attack: 4.7% for GBDA compared to 12.0% for BAE (cf. Tables 1 and 2).

Conclusion and Future Work
We presented GBDA, a framework for gradientbased white-box attack against text transformers. Our approach overcomes many ad-hoc constraints and limitations from the existing text attack literature by leveraging a novel adversarial distribution formulation, allowing end-to-end optimization of the adversarial loss and fluency constraints with gradient descent. This makes our method generic and potentially applicable to any model for token sequence prediction.
Limitations. One clear limitation of GBDA is its restriction to only token replacements. Indeed, our adversarial distribution formulation using the Gumbel-softmax does not trivially extend to token insertions and deletions. This limitation may adversely affect the naturalness of the generated adversarial examples. We hope to extend our framework to incorporate a broader set of token-level changes in the future.
In addition, the adversarial distribution P Θ is highly over-parameterized. Despite most adversarial examples requiring only a few token changes, the distribution parameter Θ is of size n×V , which is especially excessive for longer sentences. Future work may be able to reduce the number of parameters without affecting attack performance.