Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversarial NLP

Textual adversarial samples play important roles in multiple subfields of NLP research, including security, evaluation, explainability, and data augmentation. However, most work mixes all these roles, obscuring the problem definitions and research goals of the security role that aims to reveal the practical concerns of NLP models. In this paper, we rethink the research paradigm of textual adversarial samples in security scenarios. We discuss the deficiencies in previous work and propose our suggestions that the research on the Security-oriented adversarial NLP (SoadNLP) should: (1) evaluate their methods on security tasks to demonstrate the real-world concerns; (2) consider real-world attackers’ goals, instead of developing impractical methods. To this end, we first collect, process, and release a security datasets collection Advbench. Then, we reformalize the task and adjust the emphasis on different goals in SoadNLP. Next, we propose a simple method based on heuristic rules that can easily fulfill the actual adversarial goals to simulate real-world attack methods. We conduct experiments on both the attack and the defense sides on Advbench. Experimental results show that our method has higher practical value, indicating that the research paradigm in SoadNLP may start from our new benchmark. All the code and data of Advbench can be obtained at https://github.com/thunlp/Advbench.


Introduction
Natural language processing (NLP) models based on deep learning have been employed in many realworld applications (Badjatiya et al., 2017;Zhang et al., 2018;Niklaus et al., 2018;Han et al., 2021).Meanwhile, there is a concurrent line of research on textual adversarial samples that are intentionally crafted to mislead models' predictions (Samanta Role Explanation

Security
Adversarial samples can reveal the practical concerns of NLP models deployed in security situations.

Evaluation
Adversarial samples can be employed to benchmark models' robustness to out-of-distribution data (diverse user inputs).

Explainability
Adversarial samples can explain part of the models' decision processes.

Augmentation
Adversarial training based on adversarial samples augmentation can improve performance and robustness.
Table 1: Roles of textual adversarial samples.
and Mehta, 2017; Papernot et al., 2016).Previous work shows that textual adversarial samples play important roles in multiple subfields of NLP research.We categorize and summarize the roles in Table 1.We argue that the problem definitions, including priorities of goals and experimental settings, are different, considering the different roles of adversarial samples.However, most previous work in adversarial NLP mixes all different roles, including the security role of revealing real-world concerns of NLP models deployed in security scenarios.This leads to inconsistent problem definitions and research goals with real-world cases.As a consequence, although most existing work on textual adversarial attacks claims that their methods reveal the security issues, they often follow a security-irrelevant research paradigm.To fix this problem, we focus on the security role and try to refine the research paradigm for future work in this direction.
There are two core issues about why previous textual adversarial attack work can hardly help realworld security problems.First, most work don't consider security tasks and datasets (Ren et al., 2019;Zang et al., 2020b) (See Table 7).Some irrelevant tasks like sentiment analysis and natural language inference are often involved in the evaluation instead.Second, they don't consider real-world attackers' goals and make unrealistic assumptions or add unnecessary restrictions (e.g., imperceptible requirement) to the adversarial perturbations (Li Original I was all over the fucking place because the toaster had tits. PWWS (Ren et al., 2019) I was all over the bally topographic because the wassailer have breast.

Real-World Attack
I was all over the fuc king place because the toaster had tits.!!!peace peace peace Table 2: Comparison between the real-world attack and the method proposed in the NLP community.Obviously, the real-world attack method is easier to implement and preserves the adversarial meaning better.et al., 2020;Garg and Ramakrishnan, 2020).Consider the case where attackers want to bypass the detection systems to send an offensive message to the web.They can only access the decisions (e.g., pass or reject) of the black-box detection systems without the concrete confidence scores.And their adversarial goals are to convey the offensive meaning and bypass the detection systems.So, there is no need for them to make the adversarial perturbations imperceptible, as supposed in previous work.See Table 2 for an example.Besides, most methods have the inefficiency problem (i.e.high query times and long-running time), which makes them less practical and may not be a good choice for attackers in the real world.We refer readers to Section 6 for a further discussion about previous work.
To address the issue of security-irrelevant evaluation benchmark, we first summarize five security tasks and search corresponding open-source datasets.We collect, process, and release these datasets as a collection named Advbench to facilitate future research.To address the issue of ill-defined problem definition, we refer to the intention of real-world attackers to reformalize the task of textual adversarial attack and adjust the emphasis on different adversarial goals.Further, to simulate real-world attacks, we propose a simple attack method based on heuristic rules that are summarized from various sources, which can easily fulfill the actual attackers' goals.
We conduct comprehensive experiments on Advbench to evaluate methods proposed in the NLP community and our simple method.Experimental results overall demonstrate the superiority of our method, considering the attack performance, the attack efficiency, and the preservation of adversarial meaning (validity).We also consider the defense side and show that the SOTA defense method cannot handle our simple heuristic attack algorithm.The overall experiments indicate that the research paradigm in SoadNLP may start from our new benchmark.
To summarize, the main contributions of this paper are as follows: • We collect, process, and release a security datasets collection Advbench.
• We reconsider the attackers' goals and reformalize the task of textual adversarial attack in security scenarios.
• We propose a simple attack method that fulfills the actual attackers' goals to simulate real-world attacks, which can facilitate future research on both the attack and the defense sides.

Motivation
We first survey previous works of adversarial attacks in NLP about the tasks and datasets they consider in their experiments (See Table 7).We find that most tasks consider in their work are not security-relevant (e.g., sentiment analysis).So, the real-world concerns revealed in their experiments are not well reflected in reality when there is a lack of security evaluation benchmark.
To this end, we suggest future researchers evaluate their methods on security tasks to demonstrate real-world harmfulness and practical concerns.Thus, a security datasets collection is needed to facilitate future research.

Tasks
We summarize 5 security tasks, including misinformation, disinformation, toxic, spam, and sensitive information detection.The task descriptions and our motivation to choose these tasks are given in Appendix B. Due to the label-unbalanced issue of some datasets, we will release both our processed balanced and unbalanced datasets.The datasets statistics are listed in Table 8.All datasets are processed through the general pipeline including the removal of duplicate, missing, and unusual values.

Misinformation
LUN.Our LUN dataset is built on the Labeled Unreliable News Dataset (Rashkin et al., 2017) consisting of articles from news media and human annotations of fact-checking.We merge the satirical news from the Onion, hoax from the American News, and propaganda from the Activist Report into one category labeled as untrusted.And the articles collected from Gigaword News are labeled as trusted.Considering there is too little data in the original testing set, we mix the original training and testing set and re-partition by 7:3.
SATNews.The Satirical News Dataset (Yang et al., 2017) is a collection of satirical and verified news.The satirical news articles are collected from 14 websites that explicitly declare that they are offering satire.The verified news articles are collected from major news outlets1 and Google News using FLORIN (Liu et al., 2015).The original training set and validation set are merged as our training set and the testing set remains unchanged.

Disinformation
Amazon-LB.The Amazon Luxury Beauty Review dataset is a review collection of the Luxury Beauty category in Amazon with verification information in Amazon Review Data (2018) (Ni et al., 2019).The Amazon Review Data ( 2018) is an updated version of the Amazon Review Dataset (He and McAuley, 2016;McAuley et al., 2015) released in 2014, which contains 29 types of data for different scenarios.We extract the Luxury Beauty data from "small" subsets that are reduced from full sets due to the appropriate quantity and diversity of this category.We only keep content and label (whether the content is verified or not) of the review and split the data into training and testing set with a ratio of 7:3.
CGFake.The Computer-generated Fake Review Dataset (Salminen et al., 2022) Metsis et al. (2006).We mix all the datasets and split them into training and testing sets.We only keep the content of each email without other information such as subject and address.
SpamAssassin.The SpamAssassin6 is a collection of emails consisting of three categories: easyham, hard-ham, and spam.We merge easy-ham and hard-ham as the ham class.Then we mix all samples and split them equally into training and testing sets because of the lack of data.For each email, we preprocess it the same as Enron.
2.2.5 Sensitive Information EDENCE.EDENCE (Neerbek, 2019a) contains samples with auto-generated parsing-tree structures in the Enron corpus.The annotated labels come from the TREC LEGAL (Tomlinson, 2010;Cormack et al., 2010) labels for Enron documents.We restore the tree-structured samples to normal texts and map sensitive information labels back to each sample.Then we combine the training and validation sets as our training set, and the testing set remains unchanged.
FAS. FAS (Neerbek, 2019b) also contains samples with parsing-tree structures built from Enron corpus and is modified for sensitive information detection by using TREC LEGAL labels annotated by domain experts.The samples in FAS are compliant with Financial Accounting Standards 3 and are preprocessed in the same way as EDENCE in our work.

Motivation
In our survey, we find that the current problem definition and research goals considering the security role of adversarial samples to reveal practical concerns are ill-defined and ambiguous.We attribute this to the failure of distinguishing several roles of adversarial samples (See Table 1).The problem definitions are different considering the different roles of adversarial samples.For example, when adversarial samples are adopted to augment existing datasets for adversarial training, we may aim for high-quality samples.Thus, the minor perturbations restriction is important.On the contrary, when it comes to the security side, we should focus more on the preservation of adversarial meaning and attack efficiency instead of the imperceptible perturbations.See section 6 for a further discussion.Thus, we need to separate the research on different roles of adversarial samples.On the security side, most work doesn't consider realistic situations and the actual adversarial goals, which may result in unrealistic assumptions or unnecessary restrictions when developing attack or defense methods.
To make the research in this field more standardized and in-depth, reformalization of this problem needs to be conducted.Note that we focus on the security role of textual adversarial samples in this paper.

Formalization
Overview.Without loss of generality, we consider the text classification task.Given a classifier f : X → Y that can make correct prediction on the original input text x: arg max where y true is the golden label of x.The attackers will make perturbations δ to craft an adversarial sample x * that can fool the classifier: arg max Refinement.The core part of adversarial NLP is to find the appropriate perturbations δ.We identify four deficiencies in the common research paradigm on SoadNLP.
(1) Most attack methods iteratively search for better δ relying on the accessibility to the victim models' confidence scores or gradients (Alzantot et al., 2018;Ren et al., 2019;Zang et al., 2020b;Li et al., 2020).However, this assumption is unrealistic in real-world security tasks (e.g., hate-speech detection).We argue that the research in adversarial NLP considering the practical concerns should focus on the decision-based setting, where only the decisions of the victim models can be accessed.
(2) Previous work attempts to make δ imperceptible by imposing some restrictions on the searching process, like ensuring that the cosine similarity of adversarial and original sentence embeddings is higher than a threshold (Li et al., 2020;Garg and Ramakrishnan, 2020), or considering the adversarial samples' perplexity (Qi et al., 2021).However, why should adversarial perturbations be imperceptible?The goals of attackers are to (1) bypass the detection systems and (2) convey the malicious meaning.So, the attackers only need to preserve the adversarial contents (e.g., the hate speech in messages) no matter how many perturbations are added to the original sentence to bypass the detection systems (Consider Table 2).Thus, we argue that these constraints are unnecessary and the quality of adversarial samples is a secondary consideration.
(3) Adversarial attack based on word substitution or sentence paraphrase is the most widely studied.However, current attack algorithms are very inefficient and need to query victim models hundreds of times to craft adversarial samples, which makes them unlikely to happen in reality7 .We argue that adversarial attacks should be computation efficient, both in the running time and the query times to the victim models, to better simulate the practical situations.
(4) There is a bunch of work assuming that the attackers are experienced NLP practitioners and incorporate external knowledge base (Ren et al., 2019;Zang et al., 2020b) or NLP models (Li et al., 2020;Qi et al., 2021) into their attack algorithms.However, everyone can be an attacker in reality.Consider the hate-speechers in social platforms.They often try different heuristic strategies to es-  cape detection without any knowledge in NLP (See Appendix D for cases).Besides, the research in the security community confirms that real-world attackers only use some simple heuristic attack methods to propagate illicit online promotion instead of the complicated ones proposed in the computer vision domain (Yuan et al., 2019).We argue that besides the professional approaches that have been extensively studied, the research on adversarial attack and defense should also pay some attention to simple and heuristic methods that many real-world attackers are currently employing.
In general, we make two suggestions for future research, including considering the decision-based experimental setting and the attack methods that are free of expertise.Besides, we adjust the emphasis on different adversarial goals, corresponding to the real-world attack situations (See Table 3).Note that the validity requirement (preservation of adversarial meaning) of adversarial samples is task-specific and we discuss it in Appendix C. Compared to previous work, we set different priorities for different goals and put more emphasis on the preservation of adversarial meaning and the computation efficiency, while down-weighting the attention to minor perturbations and sample quality.
Note that we don't convey the meaning that the quality of adversarial samples is not important.For example, spam emails and fake news will obtain more attacker-expected feedback if they are more fluent and look more natural.Our intention in this paper is to decrease the priority of the secondary adversarial goals when there exists a trade-off among all adversarial goals, to better simulate real-world attack situations.

Our Method
To simulate the adversarial strategies employed by real-world attackers, we also propose a simple method named ROCKET (Real-wOrld attaCK based on hEurisTic rules) that can fulfill the actual adversarial goals.Our algorithm can be divided into two parts, including heuristic perturba-  tion rules and the black-box searching algorithm.
Perturbation Rules.To make our heuristic perturbation rules better simulate real-world attackers, we survey and summarize common perturbations rules from several sources, including (1) real adversarial user data (some cases are shown in Appendix D), (2) senior practitioners' experience, (3) papers in the NLP community (Jia and Liang, 2017;Ebrahimi et al., 2017), (4) reports of adversarial competitions, and (5) our intuition from the attackers' point of view.We filter the rules and retain only those that are common, computation efficient, and easy to implement without any external knowledge (See Table 4).The big difference between ROCKET and previous methods (e.g., DeepWordBug) is its easy-to-implement property, which allows it to be actually employed by realworld attackers without any external knowledge.
We now specify how we find distracting words (rule-6).For each task, we first gather some realistic data and obtain the words that occur relatively more in attacker-specified labeled samples (e.g., non-spam in the spam detection task) by calculating word frequency.Then we heuristically select distracting words that will not interfere with the original task.Finally, we add an appropriate amount of selected words at the beginning or end of the original sentence, ensuring that the semantics of the sentence will not be affected.
Searching Algorithm.We need to heuristically apply perturbations rules to search adversarial samples in the black-box setting because only victim models' decisions are available.We first apply rule-6 to the original sentence and filter stop words to get the semantic word list L of the modified sentence.Then we repeat the word perturbation process while not fooling the victim model.Specifically, one iteration of the word perturbation process starts by first sampling a batch of words w from L. Repeat the process of sampling actions r from rule-1 to rule-5 for each word in w and query the victim model until the threshold is reached or the attack succeeds.Then w is removed from L. (3) Attack efficiency (Query) is defined as the average query times to the victim models when crafting adversarial samples.(4) Perturbation degree is measured by Levenstein distance.(5) Quality is measured by the relative increase of perplexity and absolute increase of grammar errors when crafting adversarial samples.

Experimental Results
The experimental details can be found in Appendix F.
First Priority Metrics.We list the results of attack success rate and average query times in Table 5.Our findings are as follows: • Considering all previous attack methods, we find that it's extremely hard to craft adversarial samples in some tasks (e.g., Misinformation, Spam).
And the attack performances of all methods drop compared to the results in original papers8 .We attribute this to the tough decision-based attack setting and the distinct features in these security tasks (the victim model achieves high accuracy on all these datasets).
• Most previous methods are inefficient when launching adversarial attacks.Usually, they need to query the victim model hundreds of times to craft a successful adversarial sample.
• Our simple ROCKET shows superiority overall considering the attack performance and attack efficiency on Advbench.To further demonstrate the efficiency of ROCKET, we restrict the maximum query times to the victim model and test the attack success rate on Amazon-LB, HSOL, and EDENCE.The results are shown in Figure 2. We conclude that ROCKET shows stronger attack performance when the query time is restricted, which is more consistent with real-world situations.
We also conduct a human evaluation on the validity of adversarial samples (See Table 6).The details of the human evaluation process are described in Appendix G.We conclude that character-level perturbations (e.g., DeepWordBug) can preserve adversarial meaning to the greatest extent possible while strong word-level attacks (e.g., BERT-Attack) seriously destroy the original adversarial meaning, which we suspect is caused by very uncommon words substitution (See Table 2).Besides, ROCKET achieves overall great validity compared to baselines.
Note that ROCKET is designed to better simulate real-world adversarial attacks.The results of first priority metrics and the simple and easyto-implement features prove that this method has higher practical value.Thus, ROCKET can be treated as a simple baseline to facilitate future research in this direction.
Secondary Priority Metrics.We evaluate secondary priority metrics on Disinformation, Toxic, and Sensitive tasks because successful adversarial samples on other tasks are limited, which will result in inaccurate measures.We list the results in Table 9.Our findings are as follows: • Considering all attack methods, previously overlooked character-level attacks (e.g., DeepWord-Bug) achieve great success considering perturbation degree (Levenstein distance) and grammaticality (∆I).• While achieving superiority in first priority metrics, ROCKET adds more violent perturbations and breaks the grammaticality more severely.However, as we argue, it's reasonable to tradeoff these secondary priority metrics for the first ones.• Surprisingly, we find that ROCKET crafts more fluent adversarial samples according to the perplexity scores calculated by the language model.We suspect that the pretraining data that large language models fit on contains so much informal text (e.g., Twitter), which may resemble adversarial samples crafted by ROCKET.

Evaluation on the Defense Side
We give the details and results of experiments on the defense side in Appendix E. Table 10 shows that DeepWordBug and ROCKET consistently outperform word-level attack methods, indicating that methods on adversarial defense still need to be improved to tackle real-world harmfulness.
5 Related Work

Adversarial Attack
Textual adversarial attack methods can be roughly categorized into character-level, word-level, and sentence-level perturbation methods.
Character-level attacks make small perturbations to the words, including swapping, deleting, and inserting characters (Karpukhin et al., 2019;Gao et al., 2018;Ebrahimi et al., 2018).These kinds of perturbations are indeed most employed by real-world attackers because of their free of external knowledge and ease of implementation.Word-level attacks can be modeled as a combinatorial optimization problem including finding substitution words and searching adversarial samples.Previous work make different practices in these two stages (Ren et al., 2019;Alzantot et al., 2018;Zang et al., 2020b;Li et al., 2020).These methods mostly rely on external knowledge bases and are inefficient, rendering them rarely happen in reality.Sentence-level attacks paraphrase original sentences to transform the syntactic pattern (Iyyer et al., 2018), the text style (Qi et al., 2021), or the domain (Wang et al., 2020c).These kinds of methods rely on a paraphrasing model.Thus, they are also unlikely to happen in reality.
There also exists some work that cannot be categorized in each of these categories, including multigranularity attacks (Wang et al., 2020a;Chen et al., 2021b), token-level attacks (Yuan et al., 2021), and universal adversarial triggers (Wallace et al., 2019;Xu et al., 2022).

Security NLP
The research on security NLP is not only about adversarial attacks in the inference time, but also include several other topics that have broad and significant impact in this filed, including privacy attacks (Shokri et al., 2017;Pan et al., 2020), backdoor learning (Kurita et al., 2020;Chen et al., 2021a;Cui et al., 2022), data poisoning attacks (Wallace et al., 2021;Marulli et al., 2021), outlier detection (Hendrycks et al., 2020;Arora et al., 2021), and so on.Our Advbench can also be employed by some research on security NLP to better reveal the security issues and highlight the practical significance.

Discussion
Research on Adversarial Attack.Note that we don't discredit previous work in this paper.Most previous methods are very useful considering different roles of adversarial samples except the security role.For example, although synonym substitutionbased methods may not be actually employed by real-world attackers (Ren et al., 2019;Zang et al., 2020b;Li et al., 2020), the adversarial samples, if crafted properly, are very useful for evaluating models' robustness to out-of-distribution data, explaining models' behaviors, and adversarial training.
But from the perspective of separating roles of adversarial samples, the research significance of adversarial attack methods that assume only the accessibility to the confidence scores of the victim models may be limited.When adversarial samples are employed to reveal the security issues, they can only access the models' decisions.When adversarial samples are used for other purposes, their roles are to help to improve the models at hand.In this case, these methods should be granted to have access to the victim model's parameters (i.e.white-box attack) 9 . 9Some methods employ "behavioral testing" (black-box testing) even if permission is granted for model parameters Here we only give our considerations of this problem.Future research and discussion should go on to refine the problem definition in this field.
Research on Adversarial Defense.Adversarial defense methods have two functions, namely making models more robust to out-of-distribution data and resisting malicious adversarial attacks.Also, we recommend researchers study these two different functions separately.For improving models' out-of-distribution robustness, existing work has made many good attempts (Si et al., 2021;Wang et al., 2021c).However, the impact of existing work on real-world adversarial concerns may be limited because they mostly consider synonym substitution-based attacks that may be less practical in reality (Wang et al., 2021b;Zhou et al., 2021).Thus, we recommend future research on adversarial defense in the security side to consider attack methods that are actually employed by real-world attackers, like the simple ROCKET proposed in this paper.
Research on Security NLP.We also conduct a pilot survey on research on the security community.We find that there exists a research gap between the NLP and the security communities in security research topics.While the NLP community puts more emphasis on the methods' novelty, work in the security community usually revolves around actual security scenarios (Liao et al., 2016;Yuan et al., 2018;Wang et al., 2020b).Both directions are significant and impactful but a more accurate claim is needed.We recommend future research on adversarial NLP state clearly what actual goal they aim to achieve (e.g., reveal security concerns or evaluate models' robustness) and develop methods under a reasonable problem definition.

Conclusion
In this paper, we rethink the research paradigm in SoadNLP.We identify two major deficiencies in previous work and propose our refinements.Specifically, we propose an security datasets collection Advbench.We then reconsider the actual adversarial goals and reformalize the task.Next, we propose a simple method summarized from different sources that fulfills real-world attackers' goals.We conduct comprehensive experiments on Advbench on both the attack and the defense sides.Experimental results show the superiority of our (Ribeiro et al., 2020;Goel et al., 2021).method considering the first priority adversarial goals.The overall experimental results indicate that the current research paradigm in SoadNLP may need to be adjusted to better cope with realworld adversarial challenges.
In the future, we will reconsider and discuss other roles of textual adversarial samples to make this whole story complete.

Ethical Consideration
In this section, we discuss the potential wider implications and ethical considerations of this paper.
Intended Use.In this paper, we construct a security benchmark, and propose a simple method that can effectively attack real-world SOTA models.Our motivation is to better simulate real-world adversarial attacks and reveal the practical concerns.This simple method can serve as a simple baseline to facilitate future research on both the attack and the defense sides.Future work can start from our benchmark and propose methods to address real-world security issues.
Broad Impact.We rethink the research paradigm in adversarial NLP from the perspective of separating different roles of adversarial samples.Specifically, in this paper, we focus on the security role of adversarial samples and identify two major deficiencies in previous work.For each deficiency, we make some refinements to previous practices.In general, our work makes the problem definition in this direction more standardized and better simulate real-world attack situations.
Energy Saving.We describe our experimental details in Appendix F to prevent people from making unnecessary hyper-parameter adjustments and to help researchers quickly reproduce our results.

Limitation
In experiments, we employ BERT-base as the testbed and evaluate existing textual adversarial attack methods and our proposed ROCKET in our constructed benchmark datasets.We only consider one victim model in our experiments because our benchmark includes up to ten datasets and our computing resources are limited.Thus, more comprehensive experiments spanning different model architectures and training paradigms are left for future work.

A Survey on Previous Work
We conduct a survey on previous adversarial attack methods about the specific tasks and datasets they employ in their evaluation.The results are listed in Table 7.

B Task Description
The task statistics are listed in Table 8.We give the task descriptions and our motivations to choose these tasks below.

B.1 Misinformation
Words in news media and political discourse have considerable power in shaping people's beliefs and opinions.As a result, their truthfulness is often compromised to maximize the impact on society (Zhang and Ghorbani, 2020;Zhou et al., 2019a;Fonseca et al., 2016).We generally believe that fake news is caused by objective factors such as misdeclarations, misdescriptions, or misuse of terminology.And this task is to detect misinformation that contains deceived or unverified information, including rumors, misreported, and satirical news.

B.2 Disinformation
In addition to misinformation caused by objective reasons, there is also a type of fake information caused by subjectively distorting facts.This type of information mainly concentrates on online comments and reviews in online shopping malls and online restaurant/hotel reservation websites to lure customers into consumption (Mukherjee et al., 2013;Sun et al., 2016;Patel and Patel, 2018).We define such a task as disinformation detection.In general, this task is dedicated to identifying deliberate fabrication of facts, including (1) Artificial comments reversing the black and white; (2) Generated nonexistent information.

B.3 Toxic
The rapid growth of information in social networks such as Facebook, Twitter, and blogs makes it challenging to monitor what is being published and spread on social media.Abusive comments are widespread on social networks, including cyberbullying, cyberterrorism, sexism, racism, and hate-speech.Thus, the primary objective of toxic detection is to identify toxic contents in the web, which is an essential ingredient for anti-bullying policies and protection of individual rights on social media (Pereira-Kohatsu et al., 2019;Bosco et al., 2018;Watanabe et al., 2018;Risch and Krestel, 2020).

B.4 Spam
In recent years, unwanted commercial bulk emails have become a huge problem on the internet.Spam emails prevent the user from making good use of time.More importantly, some spam emails contain fraud and phishing messages that can also cause financial damage to users (Fonseca et al., 2016).The Spam Classification Tasks is to detect spam information including scams, harassment, advertising, and promotion in Emails, SMS, and even chat messages to avoid unnecessary losses for users (Cormack et al., 2008).

B.5 Sensitive Information
Text documents shared across third parties or published publicly contain sensitive information by nature.Detecting sensitive information in unstructured data is crucial for preventing data leakage.This task is to detect sensitive information including intellectual property and product progress from companies, trading and strategic information of public institutions and organizations, and private information of individuals (Berardi et al., 2015;Chow et al., 2008;Grechanik et al., 2014).

C Definition of Validity
In general, the validity metric is to measure the preservation of adversarial meaning in the crafted adversarial samples.The adversarial meaning is task-specific and should be considered differently.So, the validity definition is relevant to the specific adversarial goal in the specific security task.In our Advbench, the adversarial meanings are exaggerated and satirical contents (Misinformation), inauthentic and untrue comments (Disinformation), abusive language (Toxic), illegal or time-wasting messages (Spam), and sensitive embedded in common comments (Sensitive Information).So, the ultimate goal of attackers is to spread the adversarial meaning, no matter how many perturbations attackers introduce to other unrelated content.

D Real-world Adversarial Attack
We give some real-world adversarial cases collected from social media in Figure 1.Although these cases are written in Chinese, the perturbation rules are general and widely applicable.We can see that case-1, case-2, and case-5 also employ character-level perturbations, including substitution, deletion, and insertion.Besides, case-3 and case-4 employ the strategy of adding irrelevant and distracting words to the original sample.These samples can be easily comprehended by humans but easily fool the detection system.We employ these strategies in our simple method to simulate real-world adversarial attacks.

E.1 Attack Efficiency
Figure 2 shows the results of the attack success rate under the restriction of maximum query times.

E.2 Secondary Priority Metrics
We list the results of secondary priority metrics in Table 9.

E.3 Evaluation on the Defense Side
The results are shown in Table 10.We employ the SOTA defense method proposed in the NLP community (Mozes et al., 2021).This method identifies word substitutions by the frequency difference between the substituted word and its corresponding substituted word.The frequency distribution of words is obtained on the training set, and the detector is tuned on the validation set.Then, the detector can be employed to identify and restore adversarial samples in the inference time.
For each attack method, we input N adversarial samples (successfully attack the model) to the trained detector to obtain the number of samples detected as adversarial samples (n det ) and the number of samples successfully restored (n res ).Then the detection rate (R det ) and restored rate (R res ) are calculated according to the formula:

F Experimental Details
For the sake of calculation speed and fairness, we truncate all sentences to the first 480 words.Then, we empirically set the hyper-parameters including distracting words, the insertion number of distracting words at the beginning and end of sentences, perturbation batch size, and perturbation epochs according to the attack performance and preservation of adversarial meaning.We only attack the original content in sentences, leaving out adversarial content introduced by our perturbations.The comprehensive settings of hyper-parameters are shown in Table 11.Here we give our intuition for choosing distracting words for each task.For misinformation detection, we find that newspaper names often appear at the beginning or end of the news.So, we insert a few "Reuters" before and after the sentence without affecting the validity of the main content.For Disinformation detection, adding several encouraging words "up" does not affect the judgment of the authenticity of the comments, so we use a number of "up" as parenthetical words.For toxic detection, we need to use friendly and harmonious words to fool detectors.So, we insert many "peace" to the sentences.For spam detection, we find that ">" sometimes appears in emails to separate the text.So, we use a large number of them as inserted words, which doesn't affect the nature of the original sentence.For sensitive information detection, we employ "any" as we find that samples that are often classified as non-sensitive contain adverbs at the end.

G Human Evaluation Details
We set up a human evaluation to further evaluate the validity of adversarial samples.We choose the disinformation and toxic detection tasks because the validity definitions are clear and can be easily understood by annotators.For each task, we consider 2 corresponding datasets and sample 100 original and adversarial samples pairs for each attack method.For each pair, we ask 3 human annotators to evaluate whether the adversarial meaning is preserved in the adversarially crafted sample (validity).They need to give a validity score from 0-2 for each pair, where 2 means that the adversarial meaning has been perfectly preserved, 1 means that the sentence meaning is ambiguous but may still preserve some adversarial meaning, and 0 means that the crafted adversarial sample don't preserve any adversarial meaning in the original sample.We use the voting strategy to produce the annotation results of validity for each adversarial sample.Then we average the scores for all 100 samples in each task as the final validity score for each attack method.The results are shown in Table 6.

Figure 2 :
Figure 2: Attack success rate under the restriction of maximum query times.

Table 3 :
The priority of adversarial goals and corresponding evaluation metrics.

Table 5 :
(Devlin et al., 2019)rity metrics considering the attack performance and the attack efficiency.We choose BERTbase(Devlin et al., 2019)as the victim model and evaluate attack methods on our Advbench.

Table 6 :
The validity scores.The upper bound is 2, which means that all selected adversarial samples preserve adversarial meaning.

Table 7 :
Survey on previous work.SA stands for sentiment analysis.NC stands for news classification.Adversarial-oriented tasks and datasets are highlighted in red.

Table 8 :
Datasets statistics.The ratio refers to the proportion of fake/hate/spam/sensitive samples in corresponding datasets.

Table 10 :
Results on the defense side.

Table 11 :
Hyper-parameters of ROCKET on each task.Figure 1: Real-world cases of adversarial attacks.Adversarially modified content is highlighted in red.