A Strong Baseline for Query Efficient Attacks in a Black Box Setting

Existing black box search methods have achieved high success rate in generating adversarial attacks against NLP models. However, such search methods are inefficient as they do not consider the amount of queries required to generate adversarial attacks. Also, prior attacks do not maintain a consistent search space while comparing different search methods. In this paper, we propose a query efficient attack strategy to generate plausible adversarial examples on text classification and entailment tasks. Our attack jointly leverages attention mechanism and locality sensitive hashing (LSH) to reduce the query count. We demonstrate the efficacy of our approach by comparing our attack with four baselines across three different search spaces. Further, we benchmark our results across the same search space used in prior attacks. In comparison to attacks proposed, on an average, we are able to reduce the query count by 75% across all datasets and target models. We also demonstrate that our attack achieves a higher success rate when compared to prior attacks in a limited query setting.


Introduction
In recent years, Deep Neural Networks (DNNs) have achieved high performance on a variety of tasks (Yang et al., 2016;Goodfellow et al., 2016;Kaul et al., 2017;Maheshwary and Misra, 2018;Devlin et al., 2018). However, prior studies (Szegedy et al., 2013;Papernot et al., 2017) have shown evidence that DNNs are vulnerable to adversarial examples -inputs generated by slightly changing the original input. Such changes are imperceptible to humans but they deceive DNNs, thus raising serious concerns about their utility in real world applications. Existing NLP attack methods are broadly classified into white box attacks and black box attacks. White box attacks require access to the target model's * Equal Contribution parameters, loss function and gradients to craft an adversarial attack. Such attacks are computationally expensive and require knowledge about the internal details of the target model which are not available in most real world applications. Black box attacks crafts adversarial inputs using only the confidence scores or class probabilities predicted by the target model. Almost all the prior black box attacks consists of two major components (1) search space and (2) search method.
A search space is collectively defined by a set of transformations (usually synonyms) for each input word and a set of constraints (e.g., minimum semantic similarity, part-of-speech (POS) consistency). The synonym set for each input word is generated either from the nearest neighbor in the counter-fitted embedding space or from a lexical database such as HowNet (Dong and Dong, 2003) or WordNet (Miller, 1995). The search space is variable and can be altered by either changing the source used to generate synonyms or by relaxing any of the above defined constraints.
A search method is a searching algorithm used to find adversarial examples in the above defined search space. Given an input with W words and each word having T possible substitutes, the total number of perturbed text inputs is (T +1) W . Given this exponential size, the search algorithm must be efficient and exhaustive enough to find optimal adversarial examples from the whole search space.
Black box attacks proposed in (Alzantot et al., 2018;Zang et al., 2020) employ combinatorial optimization procedure as a search method to find adversarial examples in the above defined search space. Such methods are extremely slow and require massive amount of queries to generate adversarial examples. Attacks proposed in (Ren et al., 2019;Jin et al., 2019) search for adversarial examples using word importance ranking which first rank the input words and than substitutes them with similar words. The word importance ranking scores each word by observing the change in the confidence score of the target model after that word is removed from the input (or replaced with <UNK> token). Although, compared to the optimization based methods, the word importance ranking methods are faster, but it has some major drawbacks -(1) each word is ranked by removing it from the input (or replacing it with a <UNK> token) which therefore alter the semantics of the input during ranking, (2) it not clear whether the change in the confidence score of the target model is caused by the removal of the word or the modified input and (3) this ranking mechanism is inefficient on input of larger lengths.
In general, their exists a trade off between the attack success rate and the number of queries. A high query search method generates adversarial examples with high success rate and vice-versa. All such prior attacks are not efficient and do not take into consideration the number of queries made to the target model to generate an attack. Such attacks will fail in real world applications where there is a constraint on the number of queries that can be made to the target model.
To compare a new search method with previous methods, the new search method must be benchmarked on the same search space used in the previous search methods. However, a study conducted in (Yoo et al., 2020) have shown that prior attacks often modify the search space while evaluating their search method. This does not ensure a fair comparison between the search methods because it is hard to distinguish whether the increase in attack success rate is due to the improved search method or modified search space. For example, (Jin et al., 2019) compares their search method with (Alzantot et al., 2018) where the former uses Universal Sentence Encoder (USE) (Cer et al., 2018) and the latter use language model as a constraint. Also, all the past works evaluate their search methods only on a single search space. In this paper 1 , we address the above discussed drawbacks through following contributions: 1. We introduce a novel ranking mechanism that takes significantly less number of queries by jointly leveraging word attention scores and LSH to rank the input words; without altering the semantics of the input.
2. We call for unifying the evaluation setting by benchmarking our search method on the same search space used in the respective baselines. Further, we evaluate the effectiveness of our method by comparing it with four baselines across three different search spaces.
3. On an average, our method is 50% faster as it takes 75% lesser queries than the prior attacks while compromising the attack success rate by less than 2%. Further, we demonstrate that our search method has a much higher success rate than compared to baselines in a limited query setting.
2 Related Work

White Box Attacks
This category requires access to the gradient information to generate adversarial attacks. Hotflip (Ebrahimi et al., 2017) flips characters in the input using the gradients of the one hot input vectors. Liang et al. used gradients to perform insertion, deletion and replacement at character level. Later (Li et al., 2018) used the gradient of the loss function with respect to each word to find important words and replaced those with similar words. Following this, (Wallace et al., 2019) added triggers at the start of the input to generate adversarial examples. Although these attacks have a high success rate, but they require knowledge about the model parameters and loss function which is not accessible in real word scenarios.

Black Box Attacks
Existing black box attacks can be classified into combinatorial optimization based attacks and greedy attacks. Attack proposed in (Alzantot et al., 2018)  Greedy black box attacks generates adversarial attacks by first finding important words, which highly impacts the confidence score of the target model and than replacing those words with  (Miller, 1995). Although such a ranking mechanism is exhaustive, but it is not query efficient to rank the words. Inspired by this, (Jin et al., 2019) proposed TextFooler which ranks word only based upon the target model's confidence and replaces them with synonyms from the counter-fitted embedding space (Mrkšić et al., 2016). This ranking mechanism have a low attack success rate and is not exhaustive enough to search for adversarial examples with low perturbation rate.
Some prior works (Garg and Ramakrishnan, 2020;Li et al., 2020b;Maheshwary et al., 2020;Li et al., 2020a) have used masked language models to generate word replacements instead of using synonyms. However, all these methods follow a ranking mechanism similar to TextFooler. Moreover, as shown in (Yoo et al., 2020) most of the black box methods described above do not maintain a consistent search space while comparing their method with other search methods.

Locality Sensitive Hashing
LSH has been used in various NLP applications in the past. Ravichandran et al. used LSH for clustering nouns, (Van Durme and Lall, 2010) extended it for streaming data. Recently, (Kitaev et al., 2020) used LSH to reduce the computation time of self attention mechanism. There have been many variants of LSH, but in this paper we leverage the LSH method proposed in (Charikar, 2002).

Proposed Approach
Given a target model F : X → Y, that maps the input text sequence X to a set of class labels Y. Our goal is to generate an adversarial text sequence X ADV that belongs to any class in Y except the original class of X i.e. F(X ) = F(X ADV ). The input X ADV must be generated by substituting the input words with their respective synonyms from a chosen search space. Our search method consists of two steps (1) Word Ranking -ranks all words in the input text and (2) Word Substitution -substitutes input words with their synonyms in the order-of-rank (step 1).

Word Ranking
Recent studies (Niven and Kao, 2019) have shown evidence that certain words in the input and their replacement can highly influence the final prediction of DNNs. Therefore, we score each word based upon, (1) how important it is for the final prediction and (2) how much its replacement with a similar word can alter the final prediction of the target model. We use attention mechanism to select important words for classification and employ LSH to capture the impact of replacement of each word on the prediction of target model. Figure 1 demonstrates the working of the word ranking step.

Attention based scoring
Given an input X , this step assigns high score to those influential words which impact the final outcome. The input sequence X = {x 1 , x 2 ..x n } is passed through a pre-trained attention model F attn to get attention scores α i for each word x i . The scores are computed using Hierarchical Attention Networks (HAN) (Yang et al., 2016) and Decom-pose Attention Model (DA) (Parikh et al., 2016) for text classification and entailment tasks respectively. Note, instead of querying the target model every time to score a word, this step scores all words together in a single pass (inferring the input sample by passing it through attention model), thus, significantly reducing the query count. Unlike prior methods, we do not rank each word by removing it from the input (or replacing it with a UNK token), preventing us from altering semantics of the input.

LSH based Scoring
This step assigns high scores to those words whose replacement with a synonym will highly influence the final outcome of the model. It scores each word based on the change in confidence score of the target model, when it is replaced by its substitute (or synonym) word. But computing the change in confidence score for each synonym for every input word, significantly large number of queries are required. Therefore, we employ LSH to solve this problem. LSH is a technique used for finding nearest neighbours in high dimensional spaces. It takes an input, a vector x and computes its hash h(x) such that similar vectors gets the same hash with high probability and dissimilar ones do not. LSH differs from cryptographic hash methods as it aims to maximize the collisions of similar items. We leverage Random Projection Method (RPM) (Charikar, 2002) to compute the hash of each input text.

Random Projection Method (RPM)
Let us assume we have a collection of vectors in an m dimensional space R m . Select a family of hash functions by randomly generating a spherically symmetrical random vector r of unit length from the m dimensional space. Then the hash function h r (u) is defined as: Repeat the above process by generating d random unit vectors {r 0 , r 1 ..r d } in R m . The final hashū for each vector u is determined by concatenating the result obtained using on all d vectors.
The hash of each vector u is represented by a sequence of bits and two vectors having same hash are mapped to same bucket in the hash table. Such a process is very efficient in finding nearest neighbor in high dimensional spaces as the hash code is generated using only the dot product between two matrices. Also, it is much easier to implement and simple to understand when compared to other nearest neighbour methods. We use the above process to score each word as follows: 1. First, an input word x i is replaced with every synonym from its synonym set S( Perturbed sequences not satisfying the search space constraints (Table 1) are filtered.
2. The remaining perturbed inputs are passed through a sentence encoder (USE) which returns a vector representation V j for each perturbed input. Then we use LSH as described above to compute the hash of each vector. The perturbed inputs having the same hash are mapped to same bucket of the hash table.
were T are the number of perturbed inputs obtained for each word and B = {b 0 , b 1 ...b K } are the buckets obtained after LSH in a hash table, K being the number of buckets.
3. As each bucket contains similar perturbed inputs, a perturbed input is sampled randomly from each bucket and is passed to the target model F. The maximum change in the confidence score of the target model among all these fed inputs is the score for that index.
The steps 1 to 3 are repeated for all the indices in X . LSH maps highly similar perturbed inputs to the same bucket, and as all such inputs are similar, they will impact the target model almost equally. Therefore, instead of querying the target model F for every input in the bucket, we sample an input randomly from each bucket and observe its impact on the target model F. This will reduce the query count from being proportional to number of synonyms of each word to minimum number of buckets obtained after LSH.

False Negative error rate of LSH
Although LSH is efficient for finding nearest neighbour, there is still a small probability that similar perturbed inputs get mapped to different buckets. Therefore to reduce this probability, we conduct multiple rounds of hashing, L = 15, each round with a different family of hash functions and choose the round which has the most collisions i.e. round having minimum buckets. (Charikar, 2002) establishes an upper bound on the false negative error rate of LSH i.e two highly similar vectors are mapped to different buckets. The upper bound on the false negative error rate of LSH given by (for more details refer (Charikar, 2002)) This shows that for given values of L and d, LSH maps similar vectors to the same bucket with very high probability. As LSH maps similar inputs to the same bucket with high probability, it cuts down the synonym search space drastically, thus reducing the number of queries required to attack. The dimension of the hash function d used in equation 2 is set to 5 and is same across all rounds.

Final Score Calculation
After obtaining the attention scores α i and the scores from synonym words P i for each index (calculated using LSH), we multiply the two to get the final score score i = α i * P i for each word. All the words are sorted in descending order based upon the score score i . The algorithm of the word ranking step is provided in the appendix A.

Word Substitution
We generate the final adversarial example for the input text by perturbing the words in the order retrieved by the word ranking step. For each word x i in W, we follow the following steps.
The perturbed texts which do not satisfy the constraints imposed on the search space (Table 1) are filtered (Algorithm 1, lines 1 − 7).
2. The remaining perturbed sequence(s) are fed to the target model F to get the class label y new and its corresponding probability score P new . The perturbed sequence(s) which alters the original class label y orig is chosen as the final adversarial example X ADV . In case the original label does not change, the perturbed sequence which has the minimum P new is chosen (Algorithm 1, lines 7 − 14).
The steps 1 − 2 are repeated on the chosen perturbed sequence for the next ranked word. Note, we only use LSH in the word ranking step, because in ranking we need to calculate scores for all the words in the input and so we need to iterate over all possible synonyms of every input word. However, in the word substitution step we replace only one word at a time and the substitution step stops when we get an adversarial example. As the number of substitutions are very less (see perturbation rate in Table 3) to generate adversarial example, the substitution step iterate over very less words when compared to ranking step.

Algorithm 1 Word Substitution
Input: Test sample X , Ranked words W Output: Adversarial text X ADV 1: X ADV ← X 2: y orig , P orig ← F(X ADV ) 3: P best ← P orig 4: for (score i , x i ) in W do

6:
for w j in S do 7: X j ← Replace x i with w j 8: if y new = y orig then 10: X ADV ← X j 11: return X ADV

12:
if P new < P best then 13: P best ← P new 14: X ADV ← X j 15: return X ADV

Datasets and Target Models
We use IMDB -A document level sentiment classification dataset for movie reviews (Maas et al., 2011) and Yelp Reviews -A restaurant review dataset (Zhang et al., 2015), for classification task.
We use MultiNLI -A natural language inference dataset (Williams et al., 2017) for entailment task. We attacked WordLSTM (Hochreiter and Schmidhuber, 1997) and BERT-base-uncased (Devlin et al., 2018) for evaluating our attack strategy on text classification and entailment tasks. For WordL-STM, we used a single layer bi-directional LSTM with 150 hidden units, a dropout of 0.3 and 200 dimensional GloVe (Pennington et al., 2014) vectors. Additional details are provided in appendix A.

Search Spaces and Baselines
We compare our search method with four baselines across three different search spaces. Also, while comparing our results with each baselines we use the same search space as used in that baseline paper. The details of search spaces are shown in Table 1. PSO: (Zang et al., 2020) It uses particle swarm optimization algorithm as a search method and uses HowNet (Dong and Dong, 2006)

Experimental Settings
In a black box setting, the attacker has no access to the training data of the target model. Therefore, we made sure to train the attention models on a different dataset. For attacking the target model trained on IMDB, we trained our attention model on the Yelp Reviews and vice-versa. For entailment, we trained the attention model on SNLI (Bowman et al., 2015) and attacked the target model trained on MNLI. Following (Jin et al., 2019), the target models are attacked on same 1000 samples, sampled from the test set of each dataset. The same set of samples are used across all baselines when evaluating on a single dataset. For entailment task we only perturb the premise and leave the hypothesis unchanged. We used spacy for POS tagging and filtered out stop words using NLTK. We used Universal Sentence Encoder (Cer et al., 2018) to encode the perturbed inputs while performing LSH. The hyperparameters d and L are tuned on the val-idation set (10% of each dataset). Additional details regarding hyperparameter tuning and attention models can be found in appendix.

Evaluation Metrics
We use (1) attack success rate -the ratio of the successful attacks to total number of attacks, (2) query count -the number of queries, (3) perturbation rate -the percentage of words substituted in an input and (4) grammatical correctness -the average grammatical error increase rate (calculated using Language-Tool 2 ) to verify the quality of generated adversarial examples. For all the metrics except attack success rate, lower the value better the result. Also, for all metrics, we report the average score across all the generated adversarial examples on each dataset. Further, we also conducted human evaluation to assess the quality of generated adversarial examples.

Results
Table 2 and 4 shows the comparison of our proposed method with each baseline across all evaluation metrics. On an average across all baselines, datasets and target models we are able to reduce the query count by 75%. The PSO and Genetic attack takes atleast 50x and 20x more queries respectively than our attack strategy. Also, when compared to PWWS and TF we are able to reduce the query count by atleast 65% and 33% respectively while compromising the success rate by less than 2.0%. The perturbation rate and the grammatical correctness is also within 1% of the best baseline. In comparison to PSO and Genetic Attack, we are able to achieve even a lower perturbation rate and grammatical error rate with much lesser queries across some datasets and target models. Similarly, our attack outperforms TextFooler almost on all evaluation metrics. The runtime comparison and the anecdotes from generated adversarial text are provided in the appendix.

Ablation Study
We study the effectiveness of attention and LSH component in our method by doing a three way ablation. We observe the change in success rate, perturbation rate and queries when both or either one of the two ranking components are removed.
No LSH and attention: First, we remove both the attention and LSH scoring steps and rank the   words in random order. Table 4 shows the results obtained on BERT across all three datasets. On an average the attack success rate drops by 7%, the perturbation rate increases drastically by 6%. This shows that although the query count reduces, substituting words in random order degrades the quality of generated adversarial examples and is not effective for attacking target models. Attention and no LSH: We remove the LSH component of our ranking step and rank words based upon only the attention scores obtained from the attention model. Table 4 shows the results on BERT across all datasets. On an average the attack success rate drops by 2.5%, the perturbation rate increases by 3% and the query increases by 37%. Therefore, LSH reduces the queries significantly by eliminating near duplicates in search space.
LSH and no Attention: We remove the attention component and rank words using only LSH. Results in Table 4 shows that on an average, with-out attention scoring the attack success rate drops by 2%, the perturbation rate increases by 0.5% and the query increases by 20%. Therefore, attention is important as it not only reduces queries but also enables the ranking method to prioritize important words required in target model prediction.
With LSH and Attention: Finally in Table 4 we observe that, using both LSH and attention in our ranking our attack has a much better success rate, a lower perturbation rate in much lesser queries. This shows that both the components are necessary to do well across all evaluation metrics. We obtained similar results on LSTM when evaluating across different datasets and search spaces.
6 Quantitative Analysis

Limited Query Setting
In this setting, the attacker has a fixed query budget L, and can generate an attack in L queries or   less. To demonstrate the efficacy of our attack under this constraint, we vary the query budget L and observe the attack success rate on BERT and LSTM across IMDB and Yelp datasets. We vary the query budget from 0 to 2500 and observe how many adversarial examples can be generated successfully on a test set of 500 samples. We keep the search space same (used in PWWS) across all the search methods. The results in Figure 7 shows that with a query budget of 1000, our approach generates atleast 200 (44.4%) more adversarial samples against both BERT and LSTM on IMDB when compared to the best baseline. Similarly, on Yelp our method generates atleast 100 (25%) more adversarial samples on BERT and LSTM when compared to the best baseline. This analysis shows that our attack has a much higher success rate in a limited query setting, thus making it extremely useful for real world applications.

Input Length
To demonstrate that how our strategy scales with change in the input length (number of words in the input) compared to other baselines, we attacked BERT on Yelp. We selected inputs having number of words in the range of 10 to 250 and observed the number of queries taken by each attack method. Results in figure 3 shows that our attack takes the least number of queries across all input lengths. Further, our attack scales much better on longer inputs (> 250 words) as it is 2x faster than PWWS and TextFooler, 13x faster than Genetic attack and 133x faster than PSO.

Transferability
An adversarial example is said to be transferable, if it is generated against one particular target model but is able to fool other target models as well. We implemented transferability on IMDB and MNLI datasets across two target models. The results are shown in Table 5. Our transferred examples dropped the accuracy of other target models on an average by 16%.

Adversarial Training
We randomly sample 10% samples from the training dataset of MNLI and IMDB and generate adversarial examples using our proposed strategy.  We augmented the training data with the generated adversarial examples and re-trained BERT on IMDB and MNLI tasks. We then again attacked BERT with our proposed strategy and observed the changes. The results in Figure 4 shows that as we add more adversarial examples to the training set, the model becomes more robust to attacks. The after attack accuracy and perturbation rate increased by 35% and 17% respectively and required higher queries to attack.
7 Qualitative Analysis

Human Evaluation
We verified the quality of generated adversarial samples via human based evaluation as well. We asked the evaluators to classify the adversarial examples and score them in terms of grammatical correctness (score out of 5) as well as its semantic similarity compared to the original text. We randomly sampled 25% of original instances and their corresponding adversarial examples generated on BERT for IMDB and MNLI datasets on PWWS search space. The actual class labels of adversarial examples were kept hidden and the human judges were asked to classify them. We also asked the human judges to evaluate each sample for its semantic similarity and assign a score of 0, 0.5 or 1 based on how well the adversarial examples were able to retain the meaning of their original counterparts. We also asked them to score each example in the range 1 to 5 for grammatical correctness. Each adversarial example was evaluated by 3 human evaluators and the scores obtained were averaged out. The outcome is in Table 6.

Future Work
Our proposed attack provides a strong baseline for more query efficient black box attacks. The existing word level scoring methods can be extended to sentence level. Also, the attention scoring model used can be trained on different datasets to observe how the success rate and the query efficiency gets affected. Furthermore, existing attack methods can be evaluated against various defense methods to compare the effectiveness of different attacks.

Acknowledgement
We would like to thank all the reviewers for their critical insights and positive feedback. We would also like to thank Riyaz Ahmed Bhat for having valuable discussions which strengthen our paper and helped us in responding to reviewers.

A Appendix
Algorithm 2 Word Ranking Input: Test sample X Output: W containing score of each word x i 1: F attn ← HAN () or DA() 2: α ← F attn (X ) 3: for x i in X do 4: for w j in S do 6: for k = 1 to K do 10:

A.2 Hyperparameter Study
Figure 5 and 6 shows the variation of attack success rate and queries taken to attack the target model as d increases. With increase in d the number of collisions decreases and therefore the number of buckets obtained K increases. This increases the overall query count. Also, with increase in d the success rate first increases and then remains unchanged. Therefore, we use d = 5 because after that the success rate is almost the same but the query count increases drastically. Figure 7a and 7b shows the variation of attack success rate and queries taken to attack the target model as the rounds of hashing L increases. Conducting multiple rounds of hashing reduces the probability that that similar perturbed text inputs are mapped to different buckets. We choose L = 15 as after it the attack success rate and the queries remain almost unchanged. The values of d, L are kept same across all datasets and target models.

Examples Prediction
The movie has an excellent screenplay (the situation is credible, the action has pace), first-class [fantabulous] direction and acting (especially the 3 leading actors but the others as well -including the mobster, who does not seem to be a professional actor). I wish [want] the movie, the director and the actors success.

Positive − → Negative
Let me start by saying I don't recall laughing once during this comedy. From the opening scene, our protagonist Solo (Giovanni Ribisi) shows himself to be a self-absorbed, feeble, and neurotic loser completely unable to cope with the smallest responsibilities such as balancing a checkbook, keeping his word, or forming a coherent thought. I guess we're supposed to be drawn to his fragile vulnerability and cheer him on through the process of clawing his way out of a deep depression. I actually wanted [treasured] him to get his kneecaps busted at one point. The dog was not a character in the film. It was simply a prop to be used, neglected, scorned, abused, coveted and disposed of on a whim. So be warned.

Negative − → Postive
Local-international gathering [assembly] spot [stain] since the 1940s. One of the coolest pubs on the planet. Make new friends from all over the world, with some of the best [skilful] regional and imported beer selections in town.

Postive − → Negative
This film is strange, even for a silent movie. Essentially, it follows the adventures about a engineer in post-revolutionary Russia who daydreams about going to Mars. In this movie, it seems like the producers KNOW the Communists have truly screwed up the country, but also seems to want to make it look like they've accomplished something good. Then we get to the "Martian" scenes, where everyone on Mars wears goofy hats. They have a revolution after being inspired by the Earth Men, but are quickly betrayed by the Queen who sides with them. Except it's all a dream, or is it. (And given that the Russian Revolution eventually lead to the Stalin dictatorship, it makes you wonder if it was all allegory.) Now [Nowdays], I've seen GOOD Russian cinema. For instance, Eisenstein's Battleship Potemkin is a good movie. This is just, well, silly.  Demonstrates adversarial examples generated after attacking BERT on classification task. The actual word is highlighted green and substituted word is in square brackets colored red. Prediction shows before and after labels marked green and red respectively.