Are Synonym Substitution Attacks Really Synonym Substitution Attacks?

In this paper, we explore the following question: Are synonym substitution attacks really synonym substitution attacks (SSAs)? We approach this question by examining how SSAs replace words in the original sentence and show that there are still unresolved obstacles that make current SSAs generate invalid adversarial samples. We reveal that four widely used word substitution methods generate a large fraction of invalid substitution words that are ungrammatical or do not preserve the original sentence's semantics. Next, we show that the semantic and grammatical constraints used in SSAs for detecting invalid word replacements are highly insufficient in detecting invalid adversarial samples.


Introduction
Deep learning-based natural language processing models have been extensively used in different tasks in many domains and have shown strong performance in different realms.However, these models seem to be astonishingly vulnerable in that their predictions can be misled by some small perturbations in the original input (Gao et al., 2018;Tan et al., 2020).These imperceptible perturbations, while not changing humans' predictions, can make a well-trained model behave worse than random.
One important type of adversarial attack in natural language processing (NLP) is the synonym substitution attack (SSA).In SSAs, an adversarial sample is constructed by substituting some words in the original sentence with their synonyms (Alzantot et al., 2018;Ren et al., 2019;Garg and Ramakrishnan, 2020;Jin et al., 2020;Li et al., 2020;Maheshwary et al., 2021).This ensures that the adversarial sample is semantically similar to the original sentence, thus fulfilling the imperceptibility requirement of a valid adversarial sample.While substituting words with their semantic-related counterparts can retain the semantics of the original sentence, these attacks often utilize constraints to further guarantee that the generated adversarial samples are grammatically correct and semantically similar to the original sentence.These SSAs have all been shown to successfully bring down well-trained text classifiers' performance.
However, some recent works observe, by human evaluations, that the quality of the generated adversarial samples of those SSAs is fairly low and is highly perceptible by human (Morris et al., 2020a;Hauser et al., 2021).These adversarial samples often contain grammatical errors and do not preserve the semantics of the original samples, making them difficult to understand.These characteristics violate the fundamental criteria of a valid adversarial sample: preserving semantics and being imperceptible to humans.This motivates us to investigate what causes those SSAs to generate invalid adversarial samples.Only by answering this question can we move on to design more realistic SSAs in the future.
In this paper, we are determined to answer the following question: Are synonym substitution attacks in the literature really synonym substitution attacks?We explore the answer by scrutinizing the key components in several important SSAs and why they fail to generate valid adversarial samples.Specifically, we conduct a detailed analysis of how the word substitution sets are obtained in SSAs, and we look into the semantic and grammatical constraints used to filter invalid adversarial samples.We have the following astonishing observations: • When substituting words by WordNet synonym sets, current methods neglect the word sense differences within the substitution set.(Section 3.1) • When using counter-fitted GloVe embedding space or BERT to generate the substitution set, the substitution set only contains a teeny-tiny fraction of synonyms.(Section 3.2) • Using word embedding cosine similarity or sentence embedding cosine similarity to filter words in the substitution set does not necessarily exclude semantically invalid word substitutions.(Section 4.1 and Section 4.2) • The grammar checker used for filtering ungrammatical adversarial samples fails to detect most erroneous verb inflectional forms in a sentence.(Section 4.3)

Backgrounds
In this section, we provide an overview of SSAs and introduce some related notations that will be used throughout the paper.

Synonym Substitution Attacks (SSAs)
Given a victim text classifier trained on a dataset D train and a clean testing data x ori sampled from the same distribution of D train ; x ori = {x 1 , • • • , x T } is a sequence with T tokens.An SSA attacks the victim model by constructing an adversarial sample T } by swapping the words in x ori with their semantic-related counterparts.For x adv to be considered as a valid adversarial sample of x ori , a few requirements must be met (Morris et al., 2020a): (0) x adv should make the model yield a wrong prediction while the model can correctly classify x ori .(1) x adv should be semantically similar with x ori .(2) x adv should not induce new grammar errors compared with x ori .
(3) The word-level overlap between x adv and x ori should be high enough.(4) The modification made in x adv should be natural and non-suspicious.In our paper, we will refer to the adversarial samples that fail to meet the above criteria as invalid adversarial samples.
SSAs rely on heuristic procedures to ensure that x adv satisfies the preceding specifications.Here, we describe a canonical pipeline of generating x adv from x ori (Morris et al., 2020b).Given a clean testing sample x ori that the text classifier correctly predicts, an SSA will first generate a candidate word substitution set S x i for each word x i .The process of generating the candidate set S x i is called transformation.Next, the SSA will determine which word in x ori should be substituted first, and which word should be the next to swap, etc.After the word substitution order is decided, the SSA will iteratively substitute each word x i in x ori using the candidate words in S x i according to the predetermined order.In each substitution step, an x i is replaced by a word in S x i , and a new x swap is obtained.When an x swap is obtained, some constraints are used to verify the validity of x swap .The iterative word substitution process will end if the model's prediction is successfully corrupted by a substituted sentence that sticks to the constraints, yielding the desired x adv eventually.
Clearly, the transformations and the constraints are critical to the quality of the final x adv .In the remaining part of the paper, we will look deeper into the transformations and constraints used in SSAs and their role in creating adversarial samples1 .Next, we briefly introduce the transformations and constraints that have been used in SSAs.

Transformations
Transformation is the process of generating the substitution set S x i for a word x i in x ori .There are four representative transformations in the literature.
WordNet Synonym Transformation constructs S x i by querying a word's synonym using Word-Net (Miller, 1995;University, 2010), a lexical database containing the word sense definition, synonyms, and antonyms of the words in English.This transformation is used in PWWS (Ren et al., 2019) and LexicalAT (Xu et al., 2019).
Word Embedding Space Nearest Neighbor Transformation constructs S x i by looking up the word embedding of x i in a word embedding space, and finding its k nearest neighbors (kNN) in the word embedding space.Using kNN for word substitution is based on the assumption that semantically related words are closer in the word embedding space.Counter-fitted GloVe embedding space (Mrkšić et al., 2016) is the embedding space obtained from post-processing the GloVe embedding space (Pennington et al., 2014).Counter-fitting refers to the process of pulling away antonyms and narrowing the distance between synonyms.This transformation is adopted in TextFooler (Jin et al., 2020), Genetic algorithm attack (Alzantot et al., 2018), and TextFooler-Adj (Morris et al., 2020a).
Masked Language Model (MLM) Mask-Infilling Transformation constructs S x i by masking x i in x ori and asking an MLM to predict the masked token; MLM's top-k prediction of the masked token forms the word substitution set of x i .Widely adopted MLMs includes BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019).Using MLM mask-infilling to generate a candidate set relies on the belief that MLMs can generate fluent and semantic-consistent substitutions for x ori .This method is used in BERT-ATTACK (Li et al., 2020) and CLARE (Li et al., 2021).
MLM Reconstruction Transformation also uses MLMs.When using MLM reconstruction transformation to generate the candidate set, one just feeds the MLM with the original sentence x ori without masking any tokens in the sentence.Here, the MLM is not performing mask-infilling but reconstructs the input tokens from the unmasked inputs.For each word x i , one can take its top-k token reconstruction prediction as the candidates.This transformation relies on the intuition that reconstruction can generate more semantically similar words than using mask-infilling.This method is used in BAE (Garg and Ramakrishnan, 2020).

Constraints
When an x ori is perturbed by swapping some words in it, we need to use some constraints to check whether the perturbed sentence, x swap , is semantically or grammatically valid or not.We use x swap instead of x adv here as x swap does not necessarily flip the model's prediction and thus not necessarily an adversarial sample.
Word Embedding Cosine Similarity requires a word x i and its perturbed counterpart x ′ i to be close enough in the counter-fitted GloVe embedding space, in terms of cosine similarity.A substitution is valid if its word embedding's cosine similarity with the original word's embedding is higher than a pre-defined threshold.This is used in Genetic Algorithm Attack (Alzantot et al., 2018) and TextFooler (Jin et al., 2020).
Sentence Embedding Cosine Similarity demands that the sentence embedding cosine similarity between x swap and x ori are higher than a pre-defined threshold.Most previous works (Jin et al., 2020;Li et al., 2020;Garg and Ramakrishnan, 2020;Morris et al., 2020a) use Universal Sentence Encoder (USE) (Cer et al., 2018) as the sentence encoder; A2T (Yoo and Qi, 2021) use a Distil-BERT (Sanh et al., 2019) fine-tuned on STS-B (Cer et al., 2017) as the sentence encoder.
In some previous work (Li et al., 2020), the sentence embedding is computed using the whole sentence x ori and x swap .But most previous works (Jin et al., 2020;Garg and Ramakrishnan, 2020) only extract a context around the currently swapped word in x ori and x swap to compute the sentence embedding.For example, if x i is substituted in the current substitution step, one will compute the sentence embedding between x ori [i − w : i + w + 1] and x adv [i − w : i + w + 1], where w determines the window size.w is set to 7 in Jin et al. (2020) and Garg and Ramakrishnan (2020).
LanguageTool (language-tool python, 2022) is an open-source grammar tool that can detect spelling errors and grammar mistakes in an input sentence.It is used in TextFooler-Adj (Morris et al., 2020a) to evaluate the grammaticality of the adversarial samples.

Problems with the Transformations in SSAs
In this section, we show that the transformations introduced in Section 2.2 are largely to blame for the invalid adversarial samples in SSAs.This is because the substitution set S x i for x i is mostly invalid, either semantically or grammatically.

WordNet Synonym Substitution Set Ignores Word Senses
In WordNet, each word is associated with one or more word senses, and each word sense has its corresponding synonym sets.Thus, the substitution set S x i proposed by WordNet is the union of the synonym sets of different senses of x i .When swapping x i with its synonym using WordNet, it is more sensible to first identify the word sense of x i in x ori , and use the synonym set of the very sense as the substitution set.However, current attacks using WordNet synonym substitution neglect the sense differences within the substitution set (Ren et al., 2019), which may result in adversarial samples that semantically deviate from the original input.As a working example, consider a movie review that reads "I highly recommend it".The word "recommend" here corresponds to the word sense of "express a good opinion of " according to WordNet and has the synonym set {recommend, commend}.Aside from the above word sense, "recommend" also have another word sense: "push for something", as in "The travel agent recommends not to travel amid the pandemic".This second word sense has the synonym set {recommend, urge, ad-vocate} 2 .Apparently, the only valid substitution is "commend", which preserves the semantics of the original movie review.While "urge" is the synonym of "recommend", it obviously does not fit in the context and should not be considered as a possible substitution.We call substituting x i with a synonym that matches the word sense of x i in x ori a matched sense substitution, and we use mismatched sense substitution to refer to swapping words with the synonym which belongs to the synonym set of a different word sense.

Experiments
To illustrate that mismatched sense substitution is a problem existing in practical attack algorithms, we conduct the following analysis.We examine the adversarial samples generated by PWWS (Ren et al., 2019), which substitutes words using Word-Net synonym set.We use a benchmark dataset (Yoo et al., 2022) that contains the adversarial samples generated by PWWS against a BERT-based classifier fine-tuned on AG-News (Zhang et al., 2015).AG-News is a news topic classification dataset, which aims to classify a piece of news into four categories: world, sports, business, and sci/tech news.The attack success rate on the testing set composed of 7.6K samples is 57.25%.More statistics about the datasets can be found in Appendix B. We categorize the words replaced by PWWS into three disjoint categories: matched sense substitution, mismatched sense substitution, and morphological substitution.The last category, morphological substitution, refers to substituting words with a word that only differs in inflectional morphemes 3 or derivational morphemes 4 with the original word.We specifically isolate morphological substitution since it is hard to categorize it into either matched or mismatched sense substitution.
The detailed procedure of categorizing a replaced word's substitution type is as follows: Given 2 The word senses and synonyms are from WordNet. 3 Inflectional morphemes are the suffixes that change the grammatical property of a word but do not create a new word, such as a verb's tense or a noun's number.For example, recommends→recommend.
4 Derivational morphemes are affixes or suffixes that change the form of a word and create a new word, such as changing a verb into a noun form.For example, recommend→recommendation. a pair of (x ori , x adv ), we first use NLTK (Bird et al., 2009) to perform word sense disambiguation on each word x i in x ori .We use LemmInflect and NLTK to generate the morphological substitution set ML x i of x i .The matched sense substitution set M x i is constructed using the WordNet synonym set of the word sense of x i in x ori ; since this synonym set includes the original word x i and may also include some words in the ML x i , we remove x i and words that are already included in the ML x i from the synonym set, forming the final matched sense substitution set, M x i .The mismatched sense substitution set MM x i is constructed by first collecting all synonyms of x i that belong to the different word sense(s) of x i in x ori using WordNet, and then removing all words that have been included in ML x i and M x i .
After inspecting 4140 adversarial samples produced by PWWS, we find that among 26600 words that are swapped by PWWS, only 5398 (20.2%) words fall in the category of matched sense substitution.A majority of 20055 (75.4%) word substitutions are mismatched sense substitutions, which should be considered invalid substitutions since using mismatched sense substitution cannot preserve the semantics of x ori and makes x adv incomprehensible.Last, about 3.8% of words are substituted with their morphological related words, such as converting the part of speech (POS) from verb to noun or changing the verb tense.These substitutions, while maintaining the semantics of the original sentence and perhaps human readable, are mostly ungrammatical and lead to unnatural adversarial samples.The aforementioned statistics illustrate that only about 20% word substitutions produced by PWWS are real synonym substitutions, and thus the high attack success rate of 57.25% should not be surprising since most word replacements are highly questionable.

Experiments
To understand what those substitution sets are like, we conduct the following experiment.We use the benchmark dataset generated by Yoo et al. (2022) that attacks 7.6k samples in the AG-News testing data using TextFooler.For each word x i in x ori that is perturbed into another x ′ i in x adv , we use the following three transformations to obtain the candidate substitution set: counter-fitted GloVe embedding space, BERT mask-infilling, and BERT reconstruction. 5We only consider the substitution set of x i that are perturbed in x adv because not all words in x ori will be perturbed by an SSA, and it is thus more reasonable to consider only the words that are really perturbed by an SSA.We set the k in kNN of counter-fitted GloVe embedding space transformation and top-k prediction in BERT mask-infilling/reconstruction to 30, a reasonable number compared with many previous works.
We categorize the candidate words into five disjoint word substitution types.Aside from the three word substitution types discussed in Section 3.1.1,we include two other substitution types.The first one is antonym substitution, which is obtained by querying the antonyms of a word x i using WordNet.Different from synonym substitutions, we do not separate antonyms into antonyms that matched the word sense of x i in x ori and the sense-mismatched antonyms, since neither of them should be considered a valid swap in SSAs.The other substitution type is others, which simply consists of the candidate words not falling in the category of synonyms, antonyms, or morphological substitutions.
In Table 1, we show how different substitution types comprise the 30 words in the candidate set for different transformations on average.It is easy to tell that only a slight proportion of the substitution set is made up of synonym substitution for all three transformation methods, with counter-fitted GloVe embedding substitution containing the most synonyms among the three methods, but still only a sprinkle of about 1 word on average.Moreover, synonym substitution is mostly composed of mismatched sense substitution.When using BERT mask-infilling as a transformation, there are only 0.08 matched sense substitutions in the top 30 predictions.While using BERT reconstruction for producing the candidate set, the matched sense substitution slightly increases, compared with maskinfilling, it still only accounts for less than 1 word in the top-30 reconstruction predictions of BERT.Within the substitution set, there is on average about 1 word which is the morphological substitution of the original word.Surprisingly, using MLM mask-infilling or reconstruction as transformation, there is a slight chance that the candidate set consists of antonyms of the original word.It is highly doubtful whether the semantics is preserved when swapping the original sentence with antonyms.
The vast majority of the substitution set composes of words that do not fall into the previous four categories.We provide examples of how the substitution sets proposed by different transformations are like in Table 6 in the Appendix, showing that the candidate words in the others substitution types are mostly unrelated words that should not be used for word replacement.It is understandable that words falling to the other substitution types are invalid candidates; this is because the core of SSAs is to replace words with their semantically close counterparts to preserve the semantics of the original sentence.If a substitution word does not belong to the synonym set proposed by Word-Net, it is unlikely that swapping the original word with this word can preserve the semantics of x ori .We also show some randomly selected adversarial samples generated by different SSAs that use different transformations in Table 5 in the Appendix, which also show that when a word substitution is not a synonym nor a morphological swap, there is a high chance that it is semantically invalid.Hauser et al. (2021) uses human evaluation to show that the adversarial samples generated from TextFooler, BERT-Attack, and BAE do not preserve the meaning of x ori , which also backs up our statement.
When decreasing the number of k, the number of invalid substitution words may possibly be reduced.However, a smaller k often leads to lower attack success rates, as shown in Li et al. (2020), so it is not very common to use a smaller k to ensure the validity of the words in the candidate sets.In practical attacks, whether these words in the candidate sets can be considered valid depends on the constraints.But can those constraints really filter invalid substitutions?We show in the next section that, sadly, the answer is no.

Problems with the Constraints in SSAs
In this section, we show that the constraints commonly used in SSAs cannot fully filter invalid word substitutions proposed by the transformations.

Word Embedding Similarity Cannot Distinguish Valid/Invalid Swaps Well
Setting a threshold on word embedding cosine similarity to filter invalid word substitutions relies on the hypothesis that valid word swaps indeed have higher cosine similarity with the word to be substituted, compared with invalid word replacements.We investigate whether the hypothesis holds with the following experiment.We reuse the 7.6K AG-News testing samples attacked by TextFooler used in Section 3.2, and we gather all pairs of (x ori , x adv ).For each word x i in x ori that is perturbed in x adv , we follow the same procedure in Section 3.2 to obtain the morphological substitution set, matched sense substitution set, mismatched sense substitution set, and the antonym set.
We then query the counter-fitted GloVe embedding space to obtain the word embeddings of all those words and calculate their cosine similarity with the word embedding of x i .As a random baseline, we also randomly sample high-frequency words and low-frequency words in the training dataset of AG-News, and compute the cosine similarity between those words and x i .How these high-frequency and low-frequency words are sampled is detailed in Appendix D.2.
To quantify how hard it is to use the word em- bedding cosine similarity to distinguish a valid substitution (the matched sense substitution) from another type of invalid substitution, we calculate the area under the precision-recall curve (AUPR) of the threshold-based detector that predicts whether a perturbed x ′ i is a valid substitution based on its cosine similarity with x i .Given an x i and a perturbed x ′ i , a threshold-based detector measures the word embedding cosine similarity between x i and x ′ i , and assigns it as positive (valid substitution) if the cosine similarity is higher than the threshold.A perfect detector should have an AUPR of 1.0, while a random detector will have an AUPR of 0.5.Note that the detector we discuss here will only be presented with two types of substitution, one is the matched sense substitution and the other is a substitution type other than the matched sense substitution.
We show the AUPR in Table 2. First, we notice that when using the word embedding cosine similarity to distinguish matched sense substitutions from mismatched ones, the AUPR is as low as 0.627.While this is better than random, this is far from a useful detector, showing that word embedding cosine similarity constraints are not useful to remove invalid substitutions like unmatched sense words.The AUPR for morpheme substitutions is even lower than 0.5, implying that the word embedding cosine similarity between x i and its morphological similar words is higher than the similarity score between matched sense synonyms.This means that when we set a higher cosine similarity threshold, we are keeping more morphological swaps instead of valid matched sense substitutions.While morphological substitutions have meanings similar to or related to the original word, as we previously argued, they are mostly ungrammatical.
The AUPR when using a threshold-based detector to separate matched sense substitutions from antonym substitutions is almost perfect, which is 0.980.This should not be surprising since the counter-fitted word embedding is designed to make synonyms and antonyms have dissimilar word embeddings.Last, the AUPR of separating random substitutions from matched sense substitutions is also high, meaning that it is possible to use a detector to remove random and unrelated substitutions based on word embedding cosine similarity.Based on the result in Table 2, setting a threshold on wordembedding cosine similarity may only filter out the antonyms and random substitutions but still fails to remove the other types of invalid substitutions.

Sentence Encoder Is Insensitive to Invalid Word Substitutions
To test if sentence encoders really can filter invalid word substitutions in SSA, we conduct the following experiment.We use the same attacked AG-News samples that were used in Section 3.2.1.
For each pair of (x ori , x adv ) in that dataset, we first collect the swapped indices set I = {i|x i ̸ = x ′ i } that represents the positions of the swapped words in x adv .We shuffle the elements in I to form an ordered list O. Using x ori and O, we construct a sentence x n swap by swapping n words in x ori .The n positions where the substitutions are made in x n swap are the first n elements in the ordered list O; at each substitution position, the word is replaced by a word randomly selected from a type of candidate word set.All the n replaced words in x n swap are the same type of word substitution.We conduct experiments with six types of candidate word substitution sets: matched sense, mismatched sense, morphological, antonym, random high-frequency, and random low-frequency word substitutions.After obtaining x n swap , we compute the cosine similarity between the sentence embedding between x n swap with x ori using USE and set the window size w to 7, following Jin et al. (2020) and Garg and Ramakrishnan (2020).We vary the number of replaced words from 1 to 10. 6 This experiment helps us know how the cosine similarity changes when the words are swapped using different types of candidate word sets.More details on this experiment are in Appendix D.3 and Figure 2 in the Appendix.
The results are shown in Figure 1.While replacing more words in x ori does decrease its cosine similarity with x ori , the cosine similarity when substituting random high-frequency words is still 6 Attacking AG-News using TextFooler perturbs about 9 out of 38.6 words in a benign sample on average.roughly higher than 0.80.Considering that practical SSAs often set the cosine similarity threshold to around 0.85 or even lower7 , depending on the SSAs and datasets, it is suspicious whether the constraint and threshold can really filter invalid word substitution.We can also observe that when substituting words with antonyms, the sentence embedding cosine similarity with the original sentence closely follows the trend of substituting words using a synonym, regardless of whether the synonym substitution matches the word sense or not.Recalling that we have revealed that the candidate set proposed by BERT can contain antonyms in Table 1, the results here indicates that sentence embedding similarity constraint cannot filter this type of faulty word substitution.For the two different types of synonym substitutions, only matched sense substitutions are valid replacement that follows the semantics of the original sentence.However, the sentence embedding of x ori and the sentence embedding of the two types of different synonym substitutions are equally similar.The highest cosine similarity is obtained when the words in x ori are swapped using their morphological substitutions, and this is expected since morphological substitutions merely change the semantics.
In Figure 1, we only show the average cosine similarity and do not show the variance of the cosine similarity of each substitution type.In Figure 3 in the Appendix, we show the distribution of the cosine similarity of different substitution types.The main observation from Figure 3 is that the cosine similarity distributions of different substitution types (for the same n) are highly overlapped, and it is impossible to distinguish valid word swaps from the invalid ones simply by using a threshold on the sentence embedding cosine similarity.
Overall, the results in Figure 1 demonstrate that USE tends to generate similar sentence embeddings when two sentences only differ in a few tokens, no matter whether the replacements change the sentence meaning or not.While we only show the result of USE, we show in Appendix E that different sentence encoders have similar behavior.Moreover, when we use the whole sentence instead of a windowed subsentence to calculate the sentence embedding, the cosine similarity is even higher than that shown in Figure 1, as shown in Appendix E. Again, these sentence encoders fail to separate invalid word substitutions from valid ones.While frustrating, this result should not be surprising, since most sentence encoders are not trained to distinguish sentences with high word overlapping.

LanguageTool Cannot Detect False Verb Inflectional Form
LanguageTool is used in TextFooler-Adj (TF-Adj) (Morris et al., 2020a) to prevent the attack to induce grammar errors.TF-Adj also uses stricter word embedding and sentence embedding cosine similarity constraints to ensure the semantics in x ori are preserved in x adv .However, when browsing through the adversarial samples generated by TF-Adj, we observe that the word substitutions made by TF-Adj are often ungrammatical morphological swaps that convert a verb's inflectional form.This indicates that LanguageTool may not be capable of detecting a verb's inflectional form error.
To verify this hypothesis, we conduct the following experiment.For each sample in the test set of AG-News that LanguageTool reports no grammatical errors, we convert the inflectional form of the verbs in the sample by a hand-craft rule that will always make a grammatical sentence ungrammatical; this rule is listed in Appendix D.4.We then use LanguageTool to detect how many grammar errors are there in the verb-converted sentences.
We summarize the experiment results as follows.For the 1039 grammatical sentences in AG-News, the previous procedure perturbed 4.37 verbs on average.However, the average number of grammar errors identified by LanguageTool is 0.97, meaning that LanguageTool cannot detect all incorrect verb forms.By this simple experiment and the results from Table 2 and Figure 1, we can understand why the attack results of TF-Adj are often ungrammatical morphological substitutions: higher cosine similarity constraints prefer morphological substitutions, but those often ungrammatical substitutions cannot be detected by LanguageTool.Thus, aside from showing that the text classifier trained on AG-News is susceptible to inflectional perturbations, TF-Adj actually exposes that LanguageTool itself is vulnerable to inflectional perturbations.

Related Works
Some prior works also discuss a similar question that we study in this paper.Morris et al. (2020a) uses human evaluation to reveal that SSAs sometimes produce low-quality adversarial samples.They attribute this to the insufficiency of the constraints and use stricter constraints and Lan-guageTool to generate better adversarial samples.Our work further points out that the problem is not only in the constraints; we show that the transformations are the fundamental problems in SSAs.We further show that LanguageTool used by Morris et al. (2020a) cannot detect ungrammatical verb inflectional forms, and reveal that the adversarial samples generated by TF-Adj exploit the weakness of LanguageTool and are often made up of ungrammatical morphological substitutions.Hauser et al. (2021) uses human evaluations and probabilistic statements to show that SSAs are low quality and do not preserve original semantics.Our work can be seen as an attempt to understand the cause of the observations in Hauser et al. (2021).
Morris (2020) also questions the validity of using sentence encoders as semantic constraints.They attack sentence encoders by swapping words in a sentence with their antonyms and the attack goal is to maximally preserve the swapped sentence's sentence embedding cosine similarity with the original sentence.This is related to our experiments in Section 4.2.The main differences between our experiments and theirs are: (1) When swapping words, we only swap the words that are really swapped by TextFooler; on the contrary, the words swapped in Morris (2020) are not necessarily words that are actually substituted in an SSA.The words swapped when attacking a sentence encoder and attacking a text classifier can be significantly different.Since our goal is to verify how sentence encoders behave when used in SSAs, it makes more sense to only swap the words that are really replaced by an SSA.(2) Morris (2020) only uses antonyms for word substitution.

Discussion and Conclusion
This paper discusses how the elements in SSAs lead to invalid adversarial samples.We highlight that the candidate word sets generated by all four different word substitution methods contain only a small fraction of semantically matched and grammatically correct word replacements.While these transformations produce inappropriate candidate words, this alone will not contribute to the invalid adversarial samples.The inferiority of those adversarial samples should be largely attributed to the deficiency of the constraints that ought to guarantee the quality of the perturbed sentences: word embedding cosine similarity is not always larger for valid word substitutions, sentence encoder is insensitive to invalid word swaps, and LanguageTool fails to detect grammar mistakes.These altogether bring about the adversarial samples that are human distinguishable, unreasonable, and mostly inexplicable.These adversarial samples are not suitable for evaluating the vulnerability of NLP models because they are not reasonable inputs.
The results and observations shown in the main content of our paper are not unique for BERT finetuned on AG-News, which is the only attacked model shown in Section 3 and Section 4. We include supplementary analyses in Appendix F for different model types and datasets, which supports all the claims and observations in the main contents.In this paper, we follow previous papers on SSAs to only show the result of attacking the victim model once and not reporting the performance variance due to random seed and hyperparameters used during the fine-tuning of victim model (Ren et al., 2019;Li et al., 2020;Jin et al., 2020).This is because conducting SSA is very time-consuming.In our preliminary experiments, we used TextAttack to attack three BERT models fine-tuned on AG-News and we crafted the adversarial samples for 100 samples in the testing data for each model The three models were fine-tuned with three different sets of hyperparameters.We find that our observation in Section 3.2 and Section 4 do not change for the three victim models.Overall, the observation shown in the paper is not an exception but rather a general phenomenon in SSAs.
By the analyses in the paper, we show that we may still be far away from real SSAs, and how to construct valid synonym substitution adversarial samples remains an unresolved problem in NLP.While there is still a long way to go, it is essential to recognize that the prior works have contributed significantly to constructing valid SSAs.Although prior SSAs may not always produce reasonable adversarial samples, they are still valuable since they pave the way for designing better SSAs and help us uncover the inadequacy of the transformations and constraints for constructing real synonym substitution adversarial samples.As an initiative to stimulate future research, we provide some possible directions and guidelines for constructing better SSAs, based on the observation in our paper.
1. Simply consider the word senses when making a replacement with WordNet.
2. Use better sentence encoders that are sensitive to token replacements that change the semantics of the original sentence.For example, DiffCSE (Chuang et al., 2022) is shown to be able to distinguish the tiny differences between sentences.The problems outlined in this paper may be familiar to those with experience in lexical substitution (Melamud et al., 2015;Zhou et al., 2019), but they have not yet been widely recognized in the field of SSAs.Our findings on why SSAs fail can serve as a reality check for the field, which has been hindered by overestimating prior SSAs.We hope our work will guide future researchers in cautiously building more effective SSAs.

Limitations
In this paper, we only discuss the SSAs in English, as this has been the most predominantly studied in adversarial attacks in NLP.The authors are not sure whether SSAs in a different language will suffer from the shortcomings discussed in this paper.However, if an SSA in a non-English language uses the transformations or constraints discussed in this paper, there is a high chance that this attack will produce low-quality results for the same reason shown in this paper.Still, the above claim needs to be verified by extensive human evaluation and further language-specific analyses.
In our paper, we use WordNet as the gold standard of the word senses since WordNet is a widely adopted and accepted tool in the NLP community.Chances are that some annotations in WordNet, while very scarce, are not perfect, and this may be a possible limitation of our work.It is also possible that the matched sense synonyms found by WordNet may not always be a valid substitution even if the annotation of WordNet is perfect.For example, the collocating words of the substituted word may not match that of the original word, and the substitution word may not fit in the original context.However, if a word is not even a synonym, it is more unlikely that it is a valid substitution.Thus, being a synonym in WordNet is a minimum requirement and we use WordNet synonym sets to evaluate the validity of a word substitution.
Last, we do not conduct human evaluations on what the other substitution types in Table 1 are.As stated in Section 3.2.1, while we do not perform human evaluations on this, the readers can browse through Table 6

Ethics Statement and Broader Impacts
The goal of our paper is to highlight the overlooked details in SSAs that cause their failures.By mitigating the problems pointed out in our paper, there are two possible consequences: 1.One may find that there exist no real synonym substitution adversarial samples, and the NLP models currently used are robust.This will cause no ethical concerns since this indicates that no harm will be caused by our work.Previous observations on the vulnerability are just the product of low-quality adversarial samples.
2. There exists real synonym substitution adversarial samples, and excluding the issues mentioned in this paper will help malicious users easier to find those adversarial samples.This will become a potential risk in the future.The best way to mitigate the above issue is to construct new defenses for real SSAs.
While our goal is to raise attention to whether SSAs are really SSAs, we are not advocating malicious users to attack text classifiers using better SSAs.Instead, we would like to highlight that there is still an unknown risk, the real SSAs, against text classifiers, and we researchers should devote more to studying this topic and developing defenses against such attacks before they are adopted by adversarial users.
Another major ethical consideration in our paper is that we challenge prior works on the quality of the SSAs.While we reveal the shortcomings of previously proposed methods, we still highly acknowledge their contributions.As emphasized in Section 6, we do not and try not to devalue those works in the past.We scientifically and objectively discuss the possible risks of those transformations and constraints, and our ultimate goal is to push the research in adversarial attacks in NLP a step forward; from this perspective, we believe that we are in common with prior works.
In the main content of our paper, we only use two datasets: the adversarial samples obtained using PWWS to attack BERT fine-tuned on AG-News, and the adversarial samples obtained by attacking TextFooler on BERT fine-tuned on AG-News.The testing set of AG-News contains 7.6K samples; the adversarial samples obtained by attacking these datasets will be less than 7.6K since the attack success rates of the two SSAs are not 100%.We summarize the detail of these two datasets in Table 3.
The models they used as victim model to generate classifiers are the fine-tuned by the TextAttack (Morris et al., 2020b) toolkit and are publicly available at https://textattack.readthedocs.io/en/latest/3recipes/models.html and Huggingface models.For example, the BERT finetuned on AG-News is at https://huggingface. co/textattack/bert-base-uncased-ag-news.The hyperparameters used to fine-tune those models can be found from the model cards and config.jsonand we do not list them here to save the space.

C Synonym Substitution Attacks
We list the transformations and constraints of the SSAs that are discussed or mentioned in our paper in Table 4.We only include the semantic and grammaticality constraints in Table 4 and omit other constraints such as the word-level overlap constraints.The "window" in the sentence encoder cosine similarity constraint indicates whether use a window around the current substitution word or use the whole sentence.The "compare with x ori " indicates that x n swap will be compared against the sentence embedding of x ori , and "compared with x n−1 swap " means that x n swap will be compared against the sentence embedding of x n−1 swap , that is, the sentence before the current substitution step.

C.1 Random Adversarial Samples
To illustrate that the adversarial samples generated by SSAs are largely made up of invalid word replacements, we randomly sample two adversarial samples generated by PWWS (Ren et al., 2019), TextFooler (Jin et al., 2020), BAE (Garg and Ramakrishnan, 2020), and TextFooler-Adj (Morris et al., 2020a).To avoid the suspicion of cherrypicking the adversarial samples to support our claims, we simply select the first and the last successfully attacked samples in AG-News using the four SSAs in the dataset generated by Yoo et al. (2022).Since the dataset is not generated by us, we cannot control which sample is the first one and which sample is the last one in the dataset, meaning that we will not be able to cherry-pick the adversarial samples that support our claims.
The adversarial samples are listed in Table 5.The blue words in x ori are the words that will be perturbed in x adv .The red words are the swapped words.The readers can verify the claims in our paper using those adversarial samples.We recap some of our claims as follows: • PWWS uses mismatched sense substitution: This can be observed in all the word substitutions of PWWS in Table 5.For example, the word "world" in the second example of PWWS have the word sense "the 3rd planet from the sun; the planet we live on".But it is swapped with the word "cosmos", which is the synonym of the word sense "everything that exists anywhere".
• Counter-fitted embedding substitution set contains a large proportion of others substitution types, which are mostly invalid: This can be observed in literally all word substitutions in TextFooler.
• BERT reconstruction substitution set contains a large proportion of others substitution types, which are mostly invalid: This can be observed in literally all word substitutions in BAE.
• Morphological substitutions are mostly ungrammatical: This can be observed in the first adversarial sample of TextFooler-Adj.• TextFooler-Adj prefers morphological swap due to its strict constraints: This can be observe in almost all substitutions in TextFooler-Adj, excluding goods→wares.

C.1.1 Example of the Word Substitution Sets of Different Transformations
In this section, we show the substitution sets using different transformations.We only show one example here, and this example is the second successful attack example in the adversarial sample datasets (Yoo et al., 2022) that attacks a BERT fine-tuned classifier trained on AG-News using TextFooler.We do not use the first sample in Table 5 because we would like to show the readers a different adversarial sample in the datasets.
x ori : The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com)SPACE.com-TORONTO, Canada -A second team of rocketeers competing for the #36;10 million Ansari X Prize, a contest for privately funded suborbital space flight, has officially announced the first launch date for its manned rocket.
x adv : The Race is Around: Second Privy Remit Set Lanza Timeline for Hummanitarian Spaceflight (SEPARATION.com)SEPARATION.com -CANADIENS, Countries -para second squad of rocketeers suitors for the #36;10 billion Ansari X Nobel, a contestant for convertly championed suborbital spaceship plane, had solemnly proclaim the first began timeline for its desolate bomb.
We show the substitution set for the first four words that are substituted by TextFooler in Table 6.We do not show that substitution set for all the attacked words simply because it will occupy too much space, and our claim in the main content that "others substitution sets of counterfitted embedding substitution and BERT maskinfilling/reconstruction mostly consist of invalid swaps" can already be observed in Table 6.

D.1 Experiment Details of Section 3
In this section, we give details on how we obtain different word substitution types for a x ori .The whole process is summarized in Algorithm 1.In Algorithm 1, the reader can also find how the perturbed indices list I used in Section 4.2 is obtained.
An important detail that is not mentioned in the main content is that when computing how many synonyms are in the substitution set of BERT MLM substitution, we actually perform lemmatization on the top-30 predictions of BERT.This is because, for example, if BERT proposes to use the word "defines" to replace the original word "sets" (the third person present tense of the verb "set") in the original sentence, and the word "define" happens to a synonym according to WordNet; in this case, the word "defines" will not be considered as a synonym substitution.But "defines" should be considered as a synonym substitution since it is the third person present tense of "define".Lemmatizing the prediction of BERT can partially solve the problem.However, if the lemmatized word is already in the top-30 prediction of BERT, we do not perform lemmatization.This process is detailed on Line 6 on Algorithm 2. This can ensure that words can be considered as synonyms while words that should be considered as morphological swaps are mostly not affected.

D.2 Experiment Details of Section 4.1
Here, we explain how the random high/lowfrequency words are sampled in Section 4.1.First, we use the tokenizer of BERT-base-uncased to tokenize all the samples in the training dataset of AG-News.Next, we count the occurrence of each token in the vocabulary of the BERT-base-uncased, and sort the tokens based on their occurrence in the training set in descending order.The vocabulary size of BERT-base-uncased is 30522, including five special tokens, some subword tokens, and some unused tokens.We define the high-frequency Attack Transformation Constraints Genenetic Algorithm Attack (Alzantot et al., 2018) Counter-fitted GloVe embedding kNN substitution with k = 8 Word embedding mean square error distance with threshold 0.5; language model perplexity (as a grammaticality constraint) PWWS (Ren et al., 2019) WordNet synonym set substitution None TextFooler (Jin et al., 2020) Counter-fitted GloVe embedding kNN substitution with k = 50 USE sentence embedding cosine similarity with threshold 0.878, window size w = 7, compare with x ori ; word embedding cosine similarity with threshold 0.5; disallow swapping words with different POS but allow swapping verbs with nouns or the reverse BERT-Attack (Li et al., 2020) BERT mask-infilling substitution with k = 48 Sentence embedding cosine similarity with different thresholds for different dataset, and the highest threshold is 0.7, no window, compare with x ori BAE (Garg and Ramakrishnan, 2020) BERT reconstruction substitution USE sentence embedding cosine similarity with threshold 0.936, window size w = 7, compare with x n−1 swap TextFooler-Adj (Morris et al., 2020a) Counter-fitted GloVe embedding kNN substitution with k = 50 USE sentence embedding cosine similarity with threshold 0.98, window size w = 7, compare with x ori ; word embedding cosine similarity with threshold 0.9; disallow swapping words with different POS but allow swapping verbs with nouns or the reverse; adversarial sample should not introduce new grammar errors, checked by LanguageTool A2T (Yoo and Qi, 2021) Counter-fitted GloVe embedding kNN substitution with k = 20 or BERT reconstruction with k = 20 Word embedding cosine similarity with threshold 0.8; DistilBERT fine-tuned on STS-B sentence embedding cosine similarity with threshold 0.9, window size w = 7, compare with x ori ; disallow swapping words with different POS CLARE (Li et al., 2021) DistilRoBERTa mask-infilling substitution, instead of using top-k, they select the predictions whose probability is larger than 5 × 10 −3 ; this set contains 42 tokens on average USE sentence embedding cosine similarity with threshold 0.7, window size w = 7, compare with x ori Table 4: Detailed transformations and constraints of different SSAs mentioned in our paper.

Attack
x ori x adv PWWS Ky.Company Wins Grant to Study Peptides (AP) AP -A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of producing better peptides, which are short chains of amino acids, the building blocks of proteins.
Ky.Company profits yield to bailiwick Peptides (AP) AP -amp company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of producing better peptides, which are short chains of amino acids, the building blocks of proteins.

PWWS
Around the world Ukrainian presidential candidate Viktor Yushchenko was poisoned with the most harmful known dioxin, which is contained in Agent Orange, a scientist who analyzed his blood said Friday.
Around the cosmos Ukrainian presidential candidate Viktor Yushchenko was poisoned with the most harmful known dioxin, which is contained in Agent Orange, a scientist who analyzed his lineage said Friday.

Text-Fooler
Fears for T N pension after talks Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.

Fears for T percent pension after debate
Syndicates portrayal worker at Turner Newall say they are 'disappointed' after chatter with bereaved parenting corporations Canada Mogul.Text-Fooler 5 of arthritis patients in Singapore take Bextra or Celebrex &lt; b&gt;...&lt;/b&gt; SINGAPORE : Doctors in the United States have warned that painkillers Bextra and Celebrex may be linked to major cardiovascular problems and should not be prescribed.Venezuela Prepares for Chavez Recall Voted Supporters and rivals warn of possible fraud; government says Chavez's defeat could produce turmoil in world oil marketed.Text-Fooler -Adj EU to Lift U.S. Sanctions Jan. 1 BRUS-SELS (Reuters) -The European Commission is sticking with its plan to lift sanctions on $4 billion worth of U.S. goods on Jan. 1 following Washington's repeal of export tax subsidies in October, a spokeswoman said on Thursday.EU to Lift U.S. Sanctions Jan. 1 BRUS-SELS (Reuters) -The European Commission is sticking with its plan to lift sanctions on $4 billion worth of U.S. wares on Jan. 1 following Washington's repeal of export taxation subsidies in October, a spokeswoman said on Thursday.Yoo and Qi (2021).

of bursitis patients in
Algorithm 1 Process of obtaining the substitution set Require: x ori , x adv 1: I ← [] ▷ Initialize the perturbed indices list 2: for x i ∈ x ori do 3: ▷ Get the lower case of x i 7: x S ms ← GetMatchedSense(x i , x ori ) ▷ Get matched sense synonym by first using word sense disambiguation then WordNet synonym sets 10: S mms ← GetMismatchedSense(x i , x ori ) ▷ Get mismatched sense synonym by first using word sense disambiguation then WordNet synonym sets 11: ▷ Get antonyms by WordNet 12: S ml ← S ml \ {x i } 13: S ms ← S ms \ S ml \ {x i } 14: S mms ← S mms \ S ms \ S ml \ {x i } ▷ Remove overlapping elements to make S ml , S ms , S mms disjoint 15: The substitution is a morphological substitution 20: The substitution is a matched sense substitution 22: The substitution is a mismatched sense substitution 24: The substitution is an antonym substitution New_Candidates.append(w)10: end if 11: end for 12: return New_Candidates words as the top-50 to top-550 words in the training dataset.The reason that we omit the top 50 words as the high-frequency token is that these words are often stop words, and they are seldom used as word substitutions in SSAs.The low-frequency words are the top-10K to top-10.5Koccurring words in AG-News' training set.

D.3 Experiment Details of Section 4.2
Here, we give more details on the sentence embedding similarity experiment in Section 4.2.The readers can refer to Algorithm 1 to see how we obtain the different types of word substitution sets, the substituted indices set I and the ordered list O from a pair of (x ori , x adv ).
We also use a figurative illustration to show how we obtain x n swap in Figure 2. In Figure 2, we show how to use the same sense substitution set to replace the words in x ori based on the ordered list O.As can be seen in the figure, we swap the words in x ori according to the order determined by O; since the first element in O is 5, we will first replace x 5 in x ori with one of the same sense synonyms of x 5 .We thus obtain the x 1 swap .In order to compute the sentence embedding similarity between x 1 swap and x ori , we extract a context around the word just replaced; in this case, we will extract the context around the fifth word in x 1 swap and x ori .Different from what we really use in our experiment, we set the window size w to 1 in Figure 2; this is because using w = 7 is too large for this example.Thus, we should extract x 1 swap [4 : 7] and x ori [4 : 7]; however, since the sentences only have 5 words, the context to be extracted will exceed the length of the sentences.In this case, we simply extract the context until the end of both sentences. 8The parts that will be used for computing the sentence embeddings in each sentence are outlined with a dark blue box in Figure 2. Next, we follow a similar process to obtain x 2 swap and x 3 swap and compare their sentence embedding cosine similarity with x ori .

D.4 Experiment Details of Section 4.3
In this experiment, we usethe POS tagger in NLTK to identify the verb form of the verbs.The inflectional form of the verbs are obtained using Lem-mInflect.Here, we list the verb inflectional form conversion rules: • For each third-person singular present verb, it is converted to the verb's base form.
• For each third past tense verb, it is converted to the verb's gerund or present participle form (V+ing).
• For all verbs whose form is not third-person singular present and is not past tense verb, we convert them into the third-person singular present.We provide three random examples from the test set in AG-News that are perturbed using the above rules in Table 7.It can be easily seen that all the perturbed sentences are ungrammatical.Interestingly, Language-Tool detects no grammar errors in all the six samples in Table 7.
I highly recommend the movie I inordinately advocate the picture Here, the substitution type used for constructing x n swap is the matched sense synonyms.The subsentences outlined by dark blue in the bottom three x ori and x n swap are the parts that are used to compute the sentence embedding by the sentence encoder.In the figure, we set the window size w of the sentence encoder to 1 for the ease of illustration.

E Supplementary Materials for
Experiments of Sentence Encoders E.1 Distribution of the Sentence Embedding Cosine Similarity of Different Substitution Types In Figure 3, we show the distribution of the USE sentence embedding cosine similarity of different word replacement types using different numbers of word replacements n.The left subfigure shows the distribution of the cosine similarity between x ori and x 1 swap and the right subfigure is the similarity distribution between x ori and x 8 swap .While in Figure 1, we can see that the sentence embedding cosine similarity of different word substitution types is sometimes separable on average, we still cannot separate valid and invalid word substitution simply using one threshold.This is because the word embedding cosine similarity scores of different word substitution types are highly overlapped, which is evident from Figure 3.This is true for different n of x n swap , and we only show n = 1 and n = 8 for simplicity.

E.2 Different Methods For Computing
Sentence Embedding Similarity In this section, we show some supplementary figures of the experiments in Section 4.2.Recall that in the main content, we only show the sentence embedding cosine similarity results when we compare x n swap with x ori around a 15-word window around the n-th substituted word.But we have mentioned in Section 2.3 that this is not what is always done.In Figure 4, we show the result when we compare x n swap with x ori using the whole sentence.It can be easily observed that it is still difficult to separate valid swaps from the invalid ones using a threshold on the cosine similarity.One can also observe that the similarity in Figure 4 is a lot higher than that in Figure 1.
Another important implementation detail about sentence encoder similarity constraint is that some previous work does not calculate the similarity of the current x swap with x ori .Instead, they calculate the similarity between the current x swap and the x swap in the previous substitution step (Garg and Ramakrishnan, 2020).That is, if in the previous substitution step, 6 words in x ori are swapped, and in this substitution step, we are going to make the 7th substitution.Then the sentence embedding similarity is computed between the 6-word substituted Original sentence Verb-perturbed sentence Storage, servers bruise HP earnings update Earnings per share rise compared with a year ago, but company misses analysts' expectations by a long shot.
Storage, servers bruises HP earnings update Earnings per share rise compares with a year ago, but company miss analysts' expectations by a long shot.IBM to hire even more new workers By the end of the year, the computing giant plans to have its biggest headcount since 1991.
IBM to hires even more new workers By the end of the year, the computes giant plans to has its biggest headcount since 1991.Giddy Phelps Touches Gold for First Time Michael Phelps won the gold medal in the 400 individual medley and set a world record in a time of 4 minutes 8.26 seconds.
Giddy Phelps Touches Gold for First Time Michael Phelps winning the gold medal in the 400 individual medley and sets a world record in a time of 4 minutes 8.26 seconds.sentence and the 7-word substituted sentence.
In Figure 5, we show the result when we we compare x n swap with x n−1 swap around a 15-word window around the n − th substituted word.This is adopted in Garg and Ramakrishnan (2020), according to TextAttack (Morris et al., 2020b).Last, we show the result when we compare x n swap with x n−1 swap with the whole sentence; this is not used in any previous works, and we include this for completeness of the results.All the sentence encoders used in Figure 1, 4, 5, 6 are USE.

E.3 Different Sentence Encoders
We show in Figure 7 the result when we compare x n swap with x ori around a 15-word window around the n-th substituted word using a DistilBERT finetuned on STS-B, which is the sentence encoder used in Yoo and Qi (2021).Figure 7 shows that DistilBERT fine-tuned model better distinguishes between antonyms and synonym swaps, compared with the USE in Figure 1.However, it still cannot distinguish between the matched and mismatched synonym substitutions very well.Interestingly, this model is flagged as deprecated on huggingface for it produces sentence embeddings of low quality.We also show the result when we use a DistilRoBERTa fine-tuned on STS-B in Figure 8.Interestingly, this sentence encoder can also better distinguish antonym substitutions and synonym substitutions on average.This might indicate that the models only fine-tuned on STS-B can have the ability to distinguish valid and invalid swaps.
In Figure 9, we show the result when we compare x n swap with x ori around a 15-word window around the n − th substituted word using sentence-transformers/all-MiniLM-L12-v2.This model has 110M parameters and is the 4th best sentence encoder in the pre-trained models on sentencetransformer package (Reimers and Gurevych, 2019).It is trained on 1 billion text pairs.We report the result when using this sentence encoder because it is the best model that is smaller than USE, which has 260M parameters.We can see that the trend in Figure 9 highly resembles that in Figure 1, indicating that even a very strong sentence encoder is not suitable to be used as a constraint in SSAs.
We also include the result when we use the best sentence encoder on sentence-transformer package, the all-mpnet-base-v2.It has 420M parameters.The result is in Figure 10, and it is obvious that it is still quite impossible to use this sentence encoder to filter invalid swaps.

F Statistics of Other Victim Models and Other Datasets
In this section, we show some statistics on adversarial samples in the datasets generated by Yoo et al. (2022).The main takeaway in this is part is: Our observation in Section 3 holds across different types of victim models (LSTM, CNN, BERT, RoBERTa), different SSAs, and different datasets.

F.1 Proportion of Different Types of Word Replacement
First, we show how different word substitution types consist of the adversarial samples of AG-News.We show the result of four models and four SSAs in

Substitution Type 8th Substitution
Figure 3: The USE sentence embedding cosine similarity distribution between x ori and the series of sentences obtained by replacing words in x ori with one type of word substitution.The window size is the same as Figure 1.The left subfigure shows the distribution of the cosine similarity between x ori and x 1 swap and the right subfigure is the similarity distribution between x ori and x 8 swap .

F.2 Statistics of Different Datasets
In this section, we show the statistics of types of word substitution of another two datasets in Yoo et al. (2022).The result is in Table 12.Clearly, our observation that valid word substitutions are scarce can also be observed in both SST-2 and IMDB.swap for n ≥ 2. The sentence embedding is calculated using a 15-word window around the n-th substituted word, as in Figure 1.D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?No response.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?No response.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?No response.

Figure 1 :
Figure 1: The USE sentence embedding cosine similarity between x ori and the series of sentences x n swap obtained by replacing words in x ori with one type of word substitution.

Figure 2 :
Figure 2: An example for illustrating process obtaining I, O and x nswap from a pair of (x ori , x adv ).Here, the substitution type used for constructing x n swap is the matched sense synonyms.The subsentences outlined by dark blue in the bottom three x ori and x n swap are the parts that are used to compute the sentence embedding by the sentence encoder.In the figure, we set the window size w of the sentence encoder to 1 for the ease of illustration.

Figure 5 :
Figure5: The USE sentence embedding cosine similarity between x ori and the series of sentences obtained by replacing words in x ori with one type of word substitution.Different from Figure1, we compare x n swap with x n−1 swap for n ≥ 2. The sentence embedding is calculated using a 15-word window around the n-th substituted word, as in Figure1.

Figure 6 :
Figure6: The USE sentence embedding cosine similarity between x ori and the series of sentences obtained by replacing words in x ori with one type of word substitution.The sentence embedding similarity shown in this figure is calculated by the whole sentence without windowing and the cosine similarity is calculated between x n swap and x n−1 swap .

Figure 7 :
Figure 7: Using the DistilBERT fine-tuned on STS-B as the sentence encoder.Sentence embedding cosine similarity between x ori and the series of sentences obtained by replacing words in x ori with one type of word substitution.

Figure 8 :
Figure 8: Using the DistilRoBERTa fine-tuned on STS-B as the sentence encoder.Sentence embedding cosine similarity between x ori and the series of sentences obtained by replacing words in x ori with one type of word substitution.

Figure 9 :
Figure9: The sentence-transformers/all-MiniLM-L12-v2 as the sentence encoder.Sentence embedding cosine similarity between x ori and the series of sentences obtained by replacing words in x ori with one type of word substitution.

Figure 10 :
Figure10: The sentence-transformers/all-mpnet-base-v2 as the sentence encoder.Sentence embedding cosine similarity between x ori and the series of sentences obtained by replacing words in x ori with one type of word substitution.

Table 1 :
The average words of different substitution types in the candidate word set of k =30 words.Syn. is short for Synonym.

Table 2 :
The AUPR when using a threshold-based detector to separate matched sense synonyms from another type of invalid substitution.
in the Appendix to see what the others substitutions are.It will be interesting to see what human evaluators think about the other substitutions in the future.

Table 3 :
Details of the adversarial sample datasets obtained by attacking a BERT fine-tuned on AG-News using PWWS and TextFooler.

Table 5 :
Adversarial samples from the benchmark dataset generated by Check the substitution types of each word in S embed by comparing with S ml , S ms , S mms , A Check the substitution types of each word in S M LM by comparing with S ml , S ms , S mms , A Check the substitution types of each word in S recons by comparing with S ml , S ms , S mms , A 32:if S ml , S ms , S mms , A ̸ = ∅ then Algorithm 2 GetMLMSwapsx i , x ori Require: x i , x ori , BERT, Lemmatizer 1: x mask ← {x 1 , • • • , x i−1 , [MASK], x i+1 , • • • , x n }▷ Get masked input 2: Candidates← Top-k prediction of x mask using BERT 3: New_Candidates ← [] 4: for w ∈Candidates do

Table 7 :
Examples of the verb-perturbed sentences.The perturbed verbs are highlighted in red, and their unperturbed counterparts are highlighted in blue.

Table 8
, 9, 10, 11.This is done by a similar procedure as in Section 3.1.1.

Table 8 :
Attack statistics of other models on AG-News.The SSA use to attack the models is PWWS.

Table 9 :
Attack statistics of other models on AG-News.The SSA use to attack the models is TextFooler.

Table 10 :
Attack statistics of other models on AG-News.The SSA use to attack the models is BAE.

Table 11 :
Attack statistics of other models on AG-News.The SSA use to attack the models is TextFooler-Adj.

Table 12 :
Attack statistics of other BERT fine-tuned on other datasets.The SSA use to attack the models is TextFooler.
The average words of different substitution types in the candidate word set with 50 words for each transformation.If the average number of words of a substitution type is less than 1.7, we do not show the average number in the bar.C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Not applicable.Left blank.C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Not applicable.Left blank.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?App E D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? No response.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?No response.