OpenAttack: An Open-source Textual Adversarial Attack Toolkit

Textual adversarial attacking has received wide and increasing attention in recent years. Various attack models have been proposed, which are enormously distinct and implemented with different programming frameworks and settings. These facts hinder quick utilization and fair comparison of attack models. In this paper, we present an open-source textual adversarial attack toolkit named OpenAttack to solve these issues. Compared with existing other textual adversarial attack toolkits, OpenAttack has its unique strengths in support for all attack types, multilinguality, and parallel processing. Currently, OpenAttack includes 15 typical attack models that cover all attack types. Its highly inclusive modular design not only supports quick utilization of existing attack models, but also enables great flexibility and extensibility. OpenAttack has broad uses including comparing and evaluating attack models, measuring robustness of a model, assisting in developing new attack models, and adversarial training. Source code and documentation can be obtained at https://github.com/thunlp/OpenAttack.


Introduction
Deep neural networks (DNNs) have been found to be susceptible to adversarial attacks (Szegedy et al., 2014;Goodfellow et al., 2015). The attacker uses adversarial examples, which are maliciously crafted by imposing small perturbations on original input, to fool the victim model. With the wide application of DNNs to practical systems accompanied by growing concern about their security, research on adversarial attacking has become increasingly important. Moreover, adversarial attacks are also * Indicates equal contribution † Work done during internship at Tsinghua University ‡ Corresponding author. Email: liuzy@tsinghua.edu.cn helpful to improve robustness and interpretability of DNNs (Wallace et al., 2019a).
In the field of natural language processing (NLP), diverse adversarial attack models have been proposed . These models vary in accessibility to the victim model (ranging from having full knowledge to total ignorance) and perturbation level (character-, word-or sentence-level). In addition, they are originally proposed to attack different victim models on different NLP tasks under different evaluation protocols.
This immense diversity causes serious difficulty for fair and apt comparison between different attack models, which is unfavourable to the development of textual adversarial attacking. Further, although most attack models are open-source, they use different programming frameworks and settings, which lead to unnecessary time and effort when implementing them.
To tackle these challenges, a textual adversarial attacking toolkit named TextAttack (Morris et al., 2020) has been developed. It implements several textual adversarial attack models under a unified framework and provides interfaces for utilizing existing attack models or designing new attack models. So far, TextAttack has attracted considerable attention and facilitated the birth of new attack models such as BAE (Garg and Ramakrishnan, 2020).
In this paper, we present OpenAttack, which is also an open-source toolkit for textual adversarial attacking. Similar to TextAttack, OpenAttack adopts modular design to assemble various attack models, in order to enable quick implementation of existing or new attack models. But OpenAttack is different from and complementary to TextAttack mainly in the following three aspects: (1) Support for all attacks. TextAttack utilizes a relatively rigorous framework to unify different attack models. However, this framework is naturally not suitable for sentence-level adversarial attacks,  (Iyyer et al., 2018) Blind Sentence Paraphrasing GAN (Zhao et al., 2018) Decision Sentence Text generation by encoder-decoder TextFooler  Score Word Greedy word substitution PWWS (Ren et al., 2019) Score Word Greedy word substitution Genetic (Alzantot et al., 2018) Score Word Genetic algorithm-based word substitution SememePSO (Zang et al., 2020) Score Word Particle swarm optimization-based word substitution BERT-ATTACK  Score Word Greedy contextualized word substitution BAE (Garg and Ramakrishnan, 2020) Score Word Greedy contextualized word substitution and insertion FD (Papernot et al., 2016b) Gradient Word Gradient-based word substitution TextBugger  Gradient, Score Word+Char Greedy word substitution and character manipulation UAT (Wallace et al., 2019a) Gradient Word, Char Gradient-based word or character manipulation HotFlip (Ebrahimi et al., 2018) Gradient Word, Char Gradient-based word or character substitution VIPER (Eger et al., 2019) Blind Char Visually similar character substitution DeepWordBug (Gao et al., 2018) Score Char Greedy character manipulation Table 1: Textual adversarial attack models involved in OpenAttack, among which the three sentence-level models SEA, SCPN and GAN together with FD, UAT and VIPER are not included in TextAttack for now. "Accessibility" is the accessibility to the victim model, and "Perturbation" refers to perturbation level. "Sentence", "Word" and "Char" denote sentence-, word-and character-level perturbations. In the columns of Accessibility and Perturbation, "A, B" means that the attack model supports both A and B , while "A+B" means that the attack model conducts A and B simultaneously.
an important and typical kind of textual adversarial attacks. Thus, no sentence-level attack models are included in TextAttack. In contrast, OpenAttack adopts a more flexible framework that supports all types of attacks including sentence-level attacks.
(2) Multilinguality. TextAttack only covers English textual attacks while OpenAttack supports English and Chinese now. And its extensible design enables quick support for more languages.
(3) Parallel processing. Running some attack models maybe very time-consuming, e.g., it takes over 100 seconds to attack an instance with the SememePSO attack model (Zang et al., 2020). To address this issue, OpenAttack additionally provides support for multi-process running of attack models to improve attack efficiency.
Moreover, OpenAttack is fully integrated with HuggingFace's transformers 1 and datasets 2 libraries, which allows convenient adversarial attacks against thousands of NLP models (especially pre-trained models) on diverse datasets. OpenAttack also has great extensibility. It can be easily used to attack any customized victim model, regardless of the used programming framework (PyTorch, TensorFlow, Keras, etc.), on any customized dataset.
OpenAttack can be used to (1) provide vari-1 https://github.com/huggingface/ transformers 2 https://github.com/huggingface/ datasets ous handy baselines for attack models; (2) comprehensively evaluate attack models using its thorough evaluation metrics; (3) assist in quick development of new attack models; (4) evaluate the robustness of an NLP model against various adversarial attacks; and (5) conduct adversarial training (Goodfellow et al., 2015) to improve model robustness by enriching the training data with generated adversarial examples.
Recent years have witnesses the rapid development of adversarial attacks in computer vision (Akhtar and Mian, 2018), which is promoted by many visual attack toolkits such as CleverHans (Papernot et al., 2018), Foolbox (Rauber et al., 2017), AdvBox (Goodman et al., 2020), etc. We hope OpenAttack, together with TextAttack and other similar toolkits, can play a constructive role in the development of textual adversarial attacks.

Formalization and Categorization of Textual Adversarial Attacking
We first formalize the task of textual adversarial attacking for text classification, and the following formalization can be trivially adapted to other NLP tasks. For a given text sequence x that is correctly classified as its ground-truth label y by the victim model F , the attack model A is supposed to transform x intox by small perturbations, whose ground-truth label is still y but classification result given by F isŷ = y. Next, we introduce the catego-rization of textual adversarial attack models from three perspectives.
According to the attack model's accessibility to the victim model, existing attack models can be categorized into four classes, namely gradientbased, score-based, decision-based and blind models. First, gradient-based attack models are also called white-box attack models, which require full knowledge of the victim model to conduct gradient update. Most of them are inspired by the fast gradient sign method (Goodfellow et al., 2015) and forward derivative method (Papernot et al., 2016a) in visual adversarial attacking.
In contrast to white-box attack models, blackbox models do not need to have complete information on the victim model, and can be subcategorized into score-based, decision-based and blind models. Blind models are ignorant of the victim model at all. Score-based models require the prediction scores (e.g., classification probabilities) of the victim model, while decision-based models only need the final decision (e.g., predicted class).
According to the level of perturbations imposed on original input, textual adversarial attack models can be classified into sentence-level, wordlevel and character-level models. Sentence-level attack models craft adversarial examples mainly by adding distracting sentences (Jia and Liang, 2017), paraphrasing (Iyyer et al., 2018;Ribeiro et al., 2018) or text generation by encoder-decoder (Zhao et al., 2018). Word-level attack models mostly conduct word substitution, namely substituting some words in the original input with semantically identical or similar words such as synonyms Ren et al., 2019;Alzantot et al., 2018). Some word-level attack models also use operations including deleting and adding words (Zhang et al., 2019;Garg and Ramakrishnan, 2020). Characterlevel attack models usually carry out character manipulations including swap, substitution, deletion, insertion and repeating (Eger et al., 2019;Ebrahimi et al., 2018;Belinkov and Bisk, 2018).
Finally, adversarial attack models can also be categorized into targeted and untargeted models based on whether the wrong classification result given by the victim model (ŷ) is pre-specified (mainly for the multi-class classification models). Most existing attack models support (by minor adjustment) both targeted and untargeted attacks, and we give no particular attention to this attribute of attack models in this paper. Currently OpenAttack includes 15 different attack models, which cover all the victim model accessibility and perturbation level types. Table 1 lists the attack models involved in OpenAttack.

Toolkit Design and Architecture
In this section, we describe the design philosophy and modular architecture of OpenAttack.
We extract and properly reorganize the commonly used components from different attack models, so that any attack model can be assembled by them. Considering the significant distinctions among different attack models, especially those between the sentence-level and word/char-level attack models, it is hard to embrace all attack models within a unified framework like TextAttack. Therefore, we leave considerable freedom for the skeleton design of attack models, and focus more on streamlining the general processing of adversarial attacking and providing common components used in attack models. Next we introduce the modules of OpenAttack one by one, and Figure 1 illustrates an overview of all the modules.
• TextProcessor. This module is aimed at processing the original input so as to assist attack models in generating adversarial examples. It consists of several functions used for tokenization, lemmatization, delemmatization, word sense disambiguation (WSD), named entity recognition (NER) and dependency parsing. Currently it supports English and Chinese, and support for other languages can be realized simply by rewriting the TextProcessor base class.
• Victim. This module wraps the victim model. It supports both neural network-based model implemented by any programming framework (especially the HuggingFace's transformers) and traditional machine learning model. It is mainly composed of three functions that are used to obtain the gradient with respect to the input, prediction scores and predicted class of a victim model.
• Attacker. This is the core module of OpenAttack. It comprises various attack models and can generate adversarial examples for given input against a victim model.
• AttackAssist. This is an assistant module of Attacker. It mainly packs different word and character substitution methods that are widely used in word-and character-level attack models. Attacker queries this module to get substitutions for a word or character. Now it includes word embedding-based (Alzantot et al., 2018;, synonym-based (Ren et al., 2019) and sememe-based (Zang et al., 2020) word substitution methods, and visual character substitution method (Eger et al., 2019). In addition, some useful components used in sentencelevel attack models are also included, such as paraphrasing based on back-translation.
• Metric. This module provides several adversarial example quality metrics which can serve as either the constraints on the adversarial examples during attacking or evaluation metrics for evaluating adversarial attacks. It currently includes following metrics: (1) language model prediction score for a given word in a context given by Google one-billion words language model (Jozefowicz et al., 2016) (this metric can be used as the constraint on adversarial examples only); (2) word modification rate, the percentage of modified words of an adversarial example compared with the original example; (3) formal similarity between the adversarial example and original example, which is measured by Levenshtein edit distance (Levenshtein, 1966), character-and word-level Jaccard similarity (Jaccard, 1912) and BLEU score (Papineni et al., 2002); (4) semantic similarity between the adversarial example and original example measured by Universal Sentence Encoder (Cer et al., 2018)   "Higher" and "Lower" mean the higher/lower the metric is, the better an attack model performs.
• AttackEval. This module is used to evaluate textual adversarial attacks from different perspectives including attack effectiveness, adversarial example quality and attack efficiency: (1) the attack effectiveness metric is attack success rate, the percentage of the attacks that successfully fool the victim model; (2) adversarial example quality is measured by the last five metrics in the Metric module; and (3) attack efficiency has two metrics including average victim model query times and average running time of attacking one instance. Table 2 lists all the evaluation metrics in OpenAttack.
The realization of multi-processing is incorporated in this module, with the help of Python multiprocessing library. In addition, this module can also visualize and save attack results, e.g., display original input and adversarial examples and emphasize their differences.
• DataManager. This module manages all the data as well as saved models that are used in other modules. It supports accessing and downloading data/models. Specifically, it deals with the data used in the AttackAssist module such as character embeddings, word embeddings and WordNet synonyms, the models used in the TextProcessor module such as NER model and dependency parser, the built-in trained victim models, and auxiliary models used in Attacker such as the paraphrasing model for the paraphrasing-based attack models. This module helps efficiently and handily utilize data.

Toolkit Usage
OpenAttack provides a set of easy-to-use interfaces that can meet almost all the needs in textual adversarial attacking, such as preprocessing text, generating adversarial examples to attack a victim model and evaluating attack models. Moreover, OpenAttack has great flexibility and extensibility and supports easy customization of victim models and attack models. Next, we showcase some basic usages of OpenAttack.

Built-in Attack and Evaluation
OpenAttack builds in some commonly used NLP models such as LSTM (Hochreiter and Schmidhuber, 1997) and BERT (Devlin et al., 2019) that have been trained on commonly used NLP datasets. Users can use the built-in victim models to quickly conduct adversarial attacks. The following code snippet shows how to use Genetic (Alzantot et al., 2018) to attack BERT on the test set of SST-2 (Socher et al., 2013) with 4-process parallel running:

Customized Victim Models
It is very common for users to launch attacks against their own models that have been trained on specific datasets, particularly when evaluating the robustness of a victim model. It is impossible to exhaustively build in all victim models. Thus, easy customization for victim models is very important.
OpenAttack provides simple and convenient interfaces for victim model customization. For a trained model implemented with whichever programming framework, users just need to configure some model access interfaces that provide accessibility required for the attack model under the Victim class. The following code snippet shows how to use Genetic to attack a customized sentiment analysis model, a statistical model in NLTK (Bird et al., 2009), on the test set of SST.

Evaluation
In this section, we conduct evaluations for all the attack models included in OpenAttack.
We use SST-2 as the evaluation dataset and choose BERT, specifically BERT BASE , as the victim model. After fine-tuning on the training set, BERT achieves 90.31 accuracy on the test set. Due to great diversity of attack models, it is hard to impose many constraints on attacks like previous work that focuses on a specific kind of attack. We only restrict the maximum victim model query times to 500. In addition, to improve evaluation efficiency, we randomly sample 1, 000 correctly classified instances from the test set as the original input to be perturbed. We use the original default hyper-parameter settings of all attack models. Table 3 shows the evaluation results. By comparison with originally reported results, we confirm the correctness of our implementation. We also observe that multi-processing can effectively improve attack efficiency of most attack models (the speedup is greater than 1). For some very efficient attack models whose average running time is quite 4 https://github.com/thunlp/OpenAttack/ tree/master/examples short (like GAN), the additional time cost from multi-processing may reduce efficiency instead.

Related Work
There have been quite a few open-source libraries of generating adversarial examples for continuous data, especially images, such as CleverHans (Papernot et al., 2018), Foolbox (Rauber et al., 2017), Adversarial Robustness Toolbox (ART) (Nicolae et al., 2018) and AdvBox (Goodman et al., 2020). These libraries enable practitioners to easily make adversarial attacks with different methods and have greatly facilitated the development of adversarial attacking for continuous data.
As for discrete data, particularly text, there exist few adversarial attack libraries. As far as we know, TextAttack (Morris et al., 2020) is the only such library. It utilizes a relatively rigorous framework to unify many attack models and provides interfaces for using the existing attack models or designing new attack models. As mentioned in Introduction, our OpenAttack is mainly different from and complementary to TextAttack in allattack-type support, multilinguality and parallel processing.
There are also some other toolkits concerned with textual adversarial attacking. TEAPOT (Michel et al., 2019) is an open-source toolkit to evaluate the effectiveness of textual adversarial examples from the perspective of preservation of meaning. It is mainly designed for the attacks against sequence-to-sequence models, but can also be geared towards text classification models. Al-lenNLP Interpret (Wallace et al., 2019b) is a framework for explaining the predictions of NLP models, where adversarial attacking is one of its interpretation methods. It focuses on interpretability of NLP models and only incorporates two attack models.

Conclusion and Future Work
In this paper, we present OpenAttack, an opensource textual adversarial attack toolkit that provides a wide range of functions in textual adversarial attacking. It is a great complement to existing counterparts because of its unique strengths in allattack-type support, multilinguality and parallel processing. Moreover, it has great flexibility and extensibility and provides easy customization of victim models and attack models. In the future, we will keep OpenAttack updated to incorporate more up-to-date attack models and support more functions to facilitate the research on textual adversarial attacks.

Broader Impact Statement
There is indeed a probability that OpenAttack is misused to maliciously attack some NLP systems. But we believe that we should face up to the potential risks of adversarial attacks rather than pretend not to notice them. As the development of adversarial learning in computer vision, the studies on adversarial attacks actually promote the studies on adversarial defenses and model robustness. We hope more people in the NLP community can realize the robustness issue and OpenAttack can play a constructive role.