Don’t Search for a Search Method — Simple Heuristics Suffice for Adversarial Text Attacks

Recently more attention has been given to adversarial attacks on neural networks for natural language processing (NLP). A central research topic has been the investigation of search algorithms and search constraints, accompanied by benchmark algorithms and tasks. We implement an algorithm inspired by zeroth order optimization-based attacks and compare with the benchmark results in the TextAttack framework. Surprisingly, we find that optimization-based methods do not yield any improvement in a constrained setup and slightly benefit from approximate gradient information only in unconstrained setups where search spaces are larger. In contrast, simple heuristics exploiting nearest neighbors without querying the target function yield substantial success rates in constrained setups, and nearly full success rate in unconstrained setups, at an order of magnitude fewer queries. We conclude from these results that current TextAttack benchmark tasks are too easy and constraints are too strict, preventing meaningful research on black-box adversarial text attacks.


Introduction
Neural networks are currently deployed in many production systems from image classification to natural language processing (NLP). While they show impressive results, many of such systems are also susceptible to adversarial examples. These examples are close in space to data points that are classified correctly, but due to small changes are classified incorrectly. This corresponds to the local non-smoothness noted by Szegedy et al. (2014). Chen et al. (2017) present a zeroth order optimization-based (ZOO) attack that significantly outperforms other black-box attacks for image processing. An initial goal of our work was to investigate whether a ZOO approach that guides * Part of the work done while the first author was interning at Google. the search for adversarial examples by approximate gradient information can be transferred to adversarial attacks in NLP. To this end, we implemented a ZOO-inspired algorithm in the TextAttack framework  and compared it with their benchmark results . Surprisingly, we found that search guided by a ZOO approach only yields minimal improvements, whereas heuristic search exploiting nearest neighbors yields competitive success rates at minimal query count. We conclude from these results that the victim models in current benchmark tasks are too easily broken and imperceptibility constraints are too strict, thus defeating meaningful research on black-box search methods for adversarial attacks.

Related Work
Black-box attacks on NLP systems generally attempt to search over discrete tokens and, instead of bounding the perturbation in Euclidean space, use linguistic constraints over the selected tokens. Alzantot et al. (2018) implement a discrete genetic algorithm where genes are tokenized sentences (Genetic Alg in Table 1). Two genes can crossover by sampling a parent for each position in the sentence and selecting the corresponding token. Mutations are performed by sampling from nearby tokens in embedding space. The embedding space they use is the counter-fitted GloVe embeddings (Pennington et al., 2014;Mrkšić et al., 2016). Zang et al. (2020) also work with a discretized search algorithm, in their case Particle Swarm Optimization (PSO in Table 1). Each particle is again a tokenized sentence and has a velocity associated with each token. Instead of moving in continuous space according to its velocity in each step, a position in the sentence "jumps" from one token to another. Instead of operating over embeddings, they use the lexicon HowNet (Dong and Dong, 2003) to find synonyms of current words. Jin et al. (2020) create a powerful baseline for adversarial attacks which first determines an importance ranking over words and then changes each token in order (Word Importance Ranking, WIR in Table 1). Tokens can be changed to one of the 50 nearest neighbours in embedding space but are subject to constraints. The replacement token must have a cosine similarity of at least 0.7 in the attack's embedding space, again in the counter-fitted GloVe embedding space by Mrkšić et al. (2016). Additionally, replacement tokens must have the same part of speech and a sentence level similarity score given by a Universal Sentence Encoder (Cer et al., 2018) above a given threshold. The word importance ranking is determined by the difference in output when a given token is removed from the input.
The above black-box attacks on NLP systems along with greedy and beam search are reimplemented by . Their library, TextAttack, seeks to be a test bed for comparing adversarial attacks on NLP systems. To this end, a companion paper comparing different attacks has been released by .

Zeroth Order Optimization-Based
Attack (ZOO) Chen et al. (2017) work under the adversarial attack framework of Carlini and Wagner (2017) and use a zeroth order optimization method to work in a black-box setting to attack images. The ZOO idea is to create pseudogradients using zeroth order estimates of the true gradient. Their method performs point-wise perturbations on individual pixels by adding a scaled one-hot vector e i to the image x and computing the difference between the outputs. The derivative of this single perturbed pixel can then be calculated by e i , where µ controls how far apart the two queried points are. Chen et al. (2017) use coordinate descent (Poljak and Tsypkin, 1973) and perturb a single pixel at a time to construct a gradient. Instead of having to query the model 2m times, this method can be optimized by perturbing all pixels simultaneously with a vector u sampled from a multivariate Gaussian distribution (Nesterov and Spokoiny, 2015). This two-point function evaluation in expectation approximates the true gradient of the function but can also be applied in situations where the function is unknown but smooth, which is the case in blackbox attacks on neural networks. The estimated gradient can then be used in standard gradient de-scent: Inspired by the zeroth order optimization-based attack of Chen et al. (2017) and the interpretable displacements of Sato et al. (2018), we developed a discretized version of zeroth order optimization for attacks on NLP systems, dubbed DiscreteZOO.

DiscreteZOO for Black-Box Attacks on NLP Systems
Zeroth order optimization assumes nothing about the function to optimize other than the fact that it is smooth. We add the assumptions here that we possess a surrogate embedding space and that there is a smooth transformation from this embedding space to the internal embedding space of the target model. Continuous zeroth order gradient estimation requires that any arbitrary point be queryable in order to use Gaussian noise as perturbations. In the discrete case, however, the system only allows us to query with tokens. Instead of sampling from a Gaussian distribution then, we sample from the nearest neighbours around the target token in the surrogate embedding space. Given the current token t, we can perturb its embedding e t with a vector µu k,t , with µ = ||e k −e t ||, u k,t = e k −et µ , and e k being the embedding of another token. We can use this to calculate the finite differences between two different function evaluations. Sampling a set of multiple neighbouring tokens N for displacements yields the update rule Adding this estimated direction vector to the current embedding then moves towards an area of increasing goal function values, where we would hope to find a token suitable for flipping the label. Once the direction vector is added to the embedding, we snap it to the nearest existing token by cosine similarity.
In addition to the ZOO-inspired search, we use the word importance ranking of Jin et al. (2020) and , which replaces each token in the text with an unk token to decide which tokens to attack first. Additionally, we noticed that often label-flipping tokens are already available within the sampled tokens N . In this case, we should accept the sampled token and return early, saving queries to the model, instead of continuing to sample and construct a direction vector. Another optimization prevents the algorithm from snapping to the nearest token if that token decreases the goal function value.
A detailed algorithm is given in the appendix. Code implementing the algorithm can be found at https://github.com/StatNLP/ discretezoo.

Baselines
In addition to the methods from , we also implement a random baseline. Firstly, instead of determining the importance by deleting each token or replacing it with unk, the indices are randomly shuffled and taken as the attack order. Secondly, replacement tokens are chosen randomly as well. In the constr. setup, the random choice is over the Top-N tokens ranked by proximity in the counter-fitted GloVe embedding space (Mrkšić et al., 2016), filtered by the constraints. In the unconstr. setup, the same method is used to generate the list of tokens, but no filtering is applied. Instead of iterating over them and selecting the token with the highest goal function value, the random baseline samples a single token from the list, inserts into the sentence, and returns if the token flips the label along with the goal function value. If the goal function values is improved then the token is kept and if not it is discarded-the goal function value is returned with the label and does not change the query count. If the label is not flipped, it moves on to the next target position.
Two additional baselines are picking the farthest or closest token at each step. They function similarly to the random baseline, but instead of sampling from the list of replacement tokens, they immediately pick the farthest or closest token in embedding space, respectively. If all tokens that pass the constraints are equally acceptable, then picking the most dissimilar acceptable token should be a good heuristic.
The baselines and the WIR methods  all iterate over the indices of the target sentence once. In order to see the effects of random continued sampling until the label flips, we also introduce a method called random, CS. This method iterates over the indices multiple times and samples tokens until a label flip is achieved or until a maxi-mum number of tokens has been sampled. For this method to work, the RepeatModification constraint has to be removed. Therefore, it is a stronger attack, as it can make changes that are not available to the other methods. Still, this should illustrate the upper bound achievable with random sampling and keeping tokens that improve the goal function.

Constraints
In , a number of constraints are used to maintain semantic similarity of the adversary with the original text. The main constraints of interest are cosine similarity, BERTScore (Zhang et al., 2015), and part of speech tagging. In the constr. setting, replacement tokens must have a cosine similarity of 0.9 with the original token, the two sentences must have a BERTScore of 0.9, and the part of speech can not be changed. We also introduce a unconstr. setting, where all constraints are removed.

Target Models and Goal Functions
All of the attacks in this paper target multiclass BERT models (Devlin et al., 2019) fine-tuned on SNLI (Bowman et al., 2015) or Movie Reviews (Pang and Lee, 2005). Both are provided by the TextAttack library  and are used in the comparison done by .  use a goal function that seeks to minimize the probability of the true label: where f (x) is the neural network with a probability vector as output and l is the index of the true label. Working from the zeroth order optimization-based attack of Chen et al. (2017), we also implement their goal function: In our experiments we found no difference in running all attacks on both goal functions, so here we report only results on the zoo loss.

Results
Results for the baseline attacks and DiscreteZOO are summarized in higher results than the WIR methods with comparable query counts, but it is not as successful as the beam search or PSO methods. For DiscreteZOO, the standard deviation is either very low or actually zero. This suggests that there is not a large enough space from which the algorithm can sample. The farthest baseline already achieves a success rate of 12.3% on Movie Reviews and 11.6% on SNLI, while random and closest achieve 11.5% and 11.1% on Movie Reviews and 10.1% and 7.9% on SNLI, respectively. Thus, simple heuristics already accomplish a substantial amount of the achievable success rates at minimal query count, while the generally low success rates across all algorithms suggests that the constraints are too strict, prevent-ing exploitation of more sophisticated search techniques. The continued sampling version of the random baseline achieves a success rate of 17.3% on Movie Reviews and 18.7% on SNLI. This is directly competitive with the search based methods.
In the unconstrained setup, all search methods, including DiscreteZOO, approach success rates of 100%. The methods are only able to evaluate the 50 nearest neighbors, however, the farthest baseline reaches success rates of 78.31% and 81.04% on Movie Reviews and SNLI, respectively. Random and closest follow farthest in terms of success rate, with a query count that is an order of magnitude lower than other methods. The continued sampling version of the random baseline achieves nearly 100% success rate on both datasets with a similarly small query count. The relative success of the baseline heuristic methods shows that the nearest neighbor structure in the embedding space is already powerful enough to flip the label. Much of the success of the other methods can already be attributed to these simple heuristics, showing that search is not always necessary.

Analysis
The results of the previous experiments show slight improvement over previous methods but they appear curious. For example, some of the results have no variation on repeated runs even with a stochastic algorithm. Additionally, random token selection appears to be competitive with greedyWIR methods.
The counterfitted GLoVE embedding space (Mrkšić et al., 2016) is a very sparse embedding space, containing a total of 65713 tokens. Using the neighborhood threshold given by the constr. constraints, a cosine similarity of 0.9 or higher, then the average token in this embedding space has 0.72 neighbors. Among tokens that have neighbors, the average rises to 2.63. Histograms showing number of tokens over neighborhood size with different neighborhood definitions can be seen in figure 1 in the appendix. These values are an upper bound on the actual average number of neighbors as the constraints also include BERTScore and a Part of Speech constraint which further restrict the space.
The sparsity of the space coupled with the size of the neighborhoods induced by cosine similarity of 0.9 or higher could explain why the naive methods are able to perform so well compared to more sophisticated optimization methods. There are so few valid replacement tokens in the space that it is entirely feasible to try every option with the greedyWIR or beam methods.
With a large enough sample size, discretezoo already observes all allowed replacement tokens during sampling and is able to stop if one of them flips the label. In this case, discretezoo and greedy-WIR should produce similar results. This is effectively demonstrated by the random baseline, which chooses a replacement token randomly and is unable to use information from the goal function to decide between multiple replacement tokens. Because there are so few possible replacement tokens, randomly selecting from the few that are available is roughly as good of a strategy as beam search or greedily selecting the best replacement token from all options. The results given by the random baseline are close to those reported by  for their greedyWIR baselines. This suggests that the success of their attacks compared to the more advanced search methods is not because of their algorithm finding good texts but rather because the search space has been so restricted that any choice will work. Methods besides greedyWIR are allowed more flexibility. Instead of just considering one target position at a time, they are allowed to consider the best replacement for all positions. This turns the attack into a combinatorial problem of finding which combination of positions is best to attack instead of which replacement tokens are the best.
Additionally, this shows that the models being attacked are very brittle. Given a single target token that has replacements, there are on average only three to choose from. Selecting one of these three very similar tokens is already enough to flip the label in many cases.

Conclusion
Zeroth order optimization methods have been shown to yield superior performance for black box attacks in continuous spaces such as images (Chen et al., 2017). In attacks on NLP systems, only a discrete set of tokens are admissable as input but they are still processed as vectors in continuous space. This allows for optimization in the continuous space instead of in the discrete token space. We implement a zeroth order optimization algorithm in the TextAttack library  and compare with the results of . While our method appears to be competitive, we find that the linguistic constraints imposed on the search methods are so tight that nearly no optimization is necessary. Instead, selecting the farthest allowable token or a random token is already enough to flip the label in many cases. We argue that more robust tasks are required for meaningful research on black-box adversarial text attacks.

A Appendix
A.1 Algorithms

Algorithm 1: Unconstrained Attack
Result: Returns a pair (successf ul,ŝ), containing a boolean indicating success and a sequence of tokens Input: s is the original sentence to attack, T = [t0, t1, ..., tn] is an ordered sequence of indices to target, filtered by pre-transformation constraints, n is the number of replacement tokens to use for calculating displacements, and u is the number of gradient updates the algorithm can perform. Data: E is a matrix of word embeddings, goal_function is a function that returns goal function value and model prediction.

Algorithm 2: Random Baseline
Result: Returns a pair (successf ul,ŝ), containing a boolean indicating success and a sequence of tokens Input: s is the original sentence to attack, T = [t0, t1, ..., tn−1] is an ordered sequence of indices to target, filtered by pre-transformation constraints. b is a maximum number of queries to perform. Data: E is a matrix of word embeddings, goal_function is a function that returns goal function value and model prediction.