SHIELD: Defending Textual Neural Networks against Multiple Black-Box Adversarial Attacks with Stochastic Multi-Expert Patcher

Even though several methods have proposed to defend textual neural network (NN) models against black-box adversarial attacks, they often defend against a specific text perturbation strategy and/or require re-training the models from scratch. This leads to a lack of generalization in practice and redundant computation. In particular, the state-of-the-art transformer models (e.g., BERT, RoBERTa) require great time and computation resources. By borrowing an idea from software engineering, in order to address these limitations, we propose a novel algorithm, SHIELD, which modifies and re-trains only the last layer of a textual NN, and thus it “patches” and “transforms” the NN into a stochastic weighted ensemble of multi-expert prediction heads. Considering that most of current black-box attacks rely on iterative search mechanisms to optimize their adversarial perturbations, SHIELD confuses the attackers by automatically utilizing different weighted ensembles of predictors depending on the input. In other words, SHIELD breaks a fundamental assumption of the attack, which is a victim NN model remains constant during an attack. By conducting comprehensive experiments, we demonstrate that all of CNN, RNN, BERT, and RoBERTa-based textual NNs, once patched by SHIELD, exhibit a relative enhancement of 15%–70% in accuracy on average against 14 different black-box attacks, outperforming 6 defensive baselines across 3 public datasets. All codes are to be released.


Introduction
Adversarial Text Attack and Defense. After being trained to maximize prediction performance, textual NN models frequently become vulnerable to adversarial attacks (Papernot et al., 2016;. In the NLP domain, in general, adversaries utilize different strategies to perturb an input sentence such that its semantic meaning is preserved while successfully letting a target NN model output a desired prediction. Text perturbations are typically generated by replacing or inserting critical words (e.g., HotFlip (Ebrahimi et al., 2018), TextFooler (Jin et al., 2019)), characters (e.g., DeepWordBug (Gao et al.), TextBugger (Li et al., 2018)) in a sentence or by manipulating a whole sentence (e.g., SCPNA (Iyyer et al., 2018), GAN-based(Zhao et al., 2018)).
Since many recent NLP models are known to be vulnerable to adversarial black-box attacks (e.g., fake news detection (Le et al., 2020;Zhou et al., 2019b), dialog systems (Cheng et al., 2019), and so on), robust defenses for textual NN models are required. Even though several papers have proposed to defend NNs against such attacks, they were designed for either a specific type of attack (e.g., word or synonym substitution (Wang et al., 2021;Dong et al., 2021;Mozes et al., 2020;, misspellings (Pruthi et al., 2019), characterlevel (Pruthi et al., 2019), or word-based (Le et al., 2021)). Even though there exist some general defensive methods, most of them enrich NN models by re-training them with adversarial data augmented via known attack strategies (Miyato et al., 2016;Pang et al., 2020) or with external information such as knowledge graphs (Li and Sethy, 2019).
However, these augmentations often induce substantial overhead in training or are still limited to only a small set of predefined attacks (e.g., (Zhou et al., 2019a)). Hence, we are in search of defense algorithms that directly enhance NN models' structures (e.g., (Li and Sethy, 2019)) while achieving higher generalization capability without the need of acquiring additional data. Motivation (Fig. 1). Different from white-box attacks, black-box attacks do not have access to a target model's parameters, which are crucial for achieving effective attacks. Hence, attackers often Figure 1: Motivation of SHIELD: An attacker optimizes a step objective function (score) to search for the best perturbation by iteratively replacing each of the original 5 tokens with a perturbed one. (A) The attacker assumes the model remains unchanged and (B) gives coherent signal during the iteration search, resulting in the true best attack: "dirty"→"dirrty". (C) A model patched with SHIELD utilizes a weighted ensemble of 3 diverse heads depending on the input. Therefore, the ensemble weights keep changing over time during adversaries' perturbation search processes -the line width represents the ensemble weights. (D) SHIELD confuses the attacker with 3 varying distributions of the score, resulting in a sub-optimal attack "people"→"pe0ple". query the target model repeatedly to acquire the necessary information for optimizing their strategy. From our analyses of 14 black-box attacks published during 2018-2020 (Table 1), all of them, except for SCPNA (Iyyer et al., 2018), rely on a searching algorithm (e.g., greedy, genetic) to iteratively replace each character/word in a sentence with a perturbation candidate to optimize the choice of characters/words and how they should be crafted to attack the target model (Fig. 1A). Even though this process is effective in terms of attack performance, they assume that the model's parameters remain "unchanged" and the model outputs "coherent" signals during the iterative search ( Fig. 1A and 1B). Our key intuition is, however, to obfuscate the attackers by breaking this assumption. Specifically, we want to develop an algorithm that automatically utilizes a diverse set of models during inference. This can be done by training multiple submodels instead of a single prediction model and randomly select one of them during inference to obfuscate the iterative search mechanism. However, this then introduces impractical computational overhead during both training and inference, especially when one wants to maximize prediction accuracy by utilizing complex SOTA sub-models such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019b). Moreover, it also does not guarantee that trained models are sufficiently diverse to fool attackers. Furthermore, applying this strategy to existing NN models would also require re-training everything from the scratch, rendering the approach impractical.
Proposal. To address these challenges, we borrow ideas from software engineering where bugs can be readily removed by an external installation patch. Specifically, we develop a novel neural patching  Table 1: Different attack methods with i) how they search for adversarial perturbations, ii) their attack level, and iii) whether they maintain the original semantics (Sem. Presv.), pursue the naturalness of the perturbed sentence (Natr. Presv.), or both of them. algorithm, named as SHIELD, which patches only the last layer of an already deployed textual NN model (e.g., CNN, RNN, transformers (Vaswani et al., 2017;Bahdanau et al.)) and transforms it into an ensemble of multi-experts or prediction heads (Fig. 1C). During inference, then SHIELD automatically utilizes a stochastic weighted ensemble of experts for prediction depending on inputs. This will obfuscate adversaries' perturbation search, making black-box attacks much more difficult regardless of attack types, e.g., character or word level attacks ( Fig. 1C,D). By patching only the last layer of a model, SHIELD also introduces lightweight computational overhead and requires no additional training data. In summary, our contributions are as follows: • We propose SHIELD, a novel neural patching algorithm that transforms a already-trained NN model to a stochastic ensemble of multi-experts with little computational overhead.
• We demonstrate the effectiveness of SHIELD. CNN, RNN, BERT, and RoBERTa-based textual models patched by SHIELD achieve an increase of 15%-70% on their robustness across 14 different black-box attacks, outperforming 6 defensive baselines on 3 public NLP datasets.
• To the best of our knowledge, this work by far includes the most comprehensive evaluation for the defense against black-box attacks.

The Proposed Method: SHIELD
We introduce Stochastic Multi-Expert Neural Patcher (SHIELD) which patches only the last layer of an already trained NN model f (x, θ) and transforms it into an ensemble of multiple expert predictors with stochastic weights. These predictors are designed to be strategically selected with different weights during inference depending on the input. This is realized by two complementary modules, namely (i) a Stochastic Ensemble (SE) module that transforms f (·) into a randomized ensemble of different heads and (ii) a Multi-Expert (ME) module that uses Neural Architecture Search (NAS) to dynamically learn the optimal architecture of each head to promote their diversity.

A Stochastic Ensemble (SE) Module
This module extends the last layer of f (·), which is typically a fully-connected layer (followed by a softmax for classification), to an ensemble of K prediction heads, denoted H={h(·)} K j . Each head h j (·), parameterized by θ h j , is an expert predictor that is fed with a feature representation learned by up to the second-last layer of f (·) and outputs a prediction logit score: where θ * L−1 are fixed parameters of f up to the last prediction head layer, Q is the size of the feature representation of x generated by the base model f (x, θ * L−1 ), and M is the number of labels. To aggregate all logit scores returned from all heads, then, a classical ensemble method would average them as the final prediction:ŷ * = 1 K K jỹ j . However, this simple aggregation assumes each h j (·) ∈ H learns from very similar training signals.
Hence, when θ * L−1 already learns some of the taskdependent information, H will eventually converge not to a set of experts but very similar predictors. To resolve this issue, we introduce stochasticity into the process by assigning prediction heads with stochastic weights during both training and inference. Specifically, we introduce a new aggregation mechanism:ŷ where w j weightsỹ j according to head j's expertise on the current input x, and α j ∈ [0, 1] is a probabilistic scalar, representing how much of the weight w j should be accounted for. Let us denote w, α ∈ R K as vectors containing all scalars w j and α j , respectively, andỹ ∈ R (K×M ) as the concatenation of all vectorsỹ j returned from each of the heads. We calculate w and α as follows: where W ∈ R (K×M +Q)×K , b ∈ R K are trainable parameters, g ∈ R K is a noise vector sampled from the Standard Gumbel Distribution and therefore, probability vector α is sampled by a technique known as Gumbel-Softmax (Jang et al., 2016) controlled by the noise vector g and the temperature τ . Unlike the standard Softmax, the Gumbel-Softmax is able to learn a categorical distribution (over K heads) optimized for a downstream task (Jang et al., 2016). Annealing τ →0 encourages a pseudo one-hot vector (e.g., [0.94, 0.03, 0.01, 0.02] when K=4), which makes Eq.
(2) a mixture of experts (Avnimelech and Intrator, 1999). Importantly, α is sampled in an inherently stochastic way depending on the gumbel noise g. While W, b is learned to deterministically assigns more weights w to heads that are experts for each input x (Eq. (3)), α introduces stochasticity into the final logits. The multiplication of α j w j in Eq. (2) then enables us to use different sets of weighted ensemble models while still maintaining the ranking of the most important head. Thus, this further diversifies the learning of each expert and confuse attackers when they iteratively try different inputs to find good adversarial perturbations.
Finally, to train this module, we use Eq.
(2) as the final prediction and train the whole module with Algorithm 1 Training SHIELD Algorithm. Negative Log Likelihood (NLL) loss following the objective:

A Multi-Expert (ME) Module
While the SE module facilitates stochastic weighted ensemble among heads, the ME module searches for the optimal architecture for each head that maximizes the diversity in how they make predictions. To do this, we utilize the DARTS algorithm (Liu et al., 2019a) as follows. Let us denote O j ={o j (·)} T t where T is the number of possible architectures to be selected for h j ∈ H. We want to learn a one-hot encoded selection vector β j ∈ R T that assigns h j (·) ← o j,argmax(β j ) (·) during prediction. Since argmax(·) operation is not differentiable, during training, we relax the categorical assignment of the architecture for h j (·) ∈ H to a softmax over all possible networks in O j : However, the original DARTS algorithm only optimizes prediction performance. In our case, we also want to promote the diversity among heads.
To do this, we force each h j (·) to specialize in different features of an input, i.e., in how it makes predictions. This can be achieved by maximizing the difference among the gradients of the wordembedding e i of input x i w.r.t to the outputs of each h j (·) ∈ H. Hence, given a fixed set of parameters θ O of all possible networks for every heads, we train all selection vectors {β} K j by optimizing the objective:  where d(·) is the cosine-similarity function, and J j is the NLL loss as if we only use a single prediction head h j . In this module, however, not only do we want to maximize the differences among gradients vectors, but also we want to ensure the selected architectures eventually converge to good prediction performance. Therefore, we train the whole ME module with the following objective:

Overall Framework
To combine the SE and ME modules, we replace Eq. (6) into Eq.
(1) and optimize the overall objective: We employ an iterative training strategy (Liu et al., 2019a) with the Adam optimization algorithm (Kingma and Ba, 2013) as in Alg. 1. By alternately freezing and training W, b, θ O and {β} K j using a training set D train and a validation set D val , we want to (i) achieve high quality prediction performance through Eq. (5) and (ii) select the optimal architecture for each expert to maximize their specialization through Eq. (7).

Set-up
Datasets & Metric. Table 2 shows the statistics of all experimental datasets: Clickbait detection (CB) (Anand et al., 2017), Hate Speech detection (HS) (Davidson et al.) and Movie Reviews classification (MR) (Pang and Lee, 2005). We split each dataset into train, validation and test set with the ratio of 8:1:1 whenever standard public splits are not available. To report prediction performance on clean examples, we use the weighted F1 score to take the distribution of prediction labels into consideration.
To report the robustness, we report prediction accuracy under adversarial attacks (Morris et al., 2020), i.e., # of failed attacks over total # of examples. A failed attack is only counted when the attacker fails to perturb (i.e., fail to flip the label of a correctly predicted clean example).
Defense Baselines. We want to defend four textual NN models (base models) of different architectures, namely RNN with GRU cells (Chung et al.), transformer-based BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019b). We compare SHIELD with the following six defensive baselines: • Ensemble (Ens.) is the classical ensemble of 5 different base models. We use the average of all NLL losses from the base models as the final training loss.
•  (e.g., through fixed templates, greedy, geneticbased search). Apart from lexical constraints such as limiting # or % of words to manipulate in a sentence, ignoring stop-words, etc., many of them also preserve the semantic meanings of a generated adversarial text via constraining the l2 distance between its representation vector and that of the original text produced by either Universal Sentence Encoder (USE) (Cer et al., 2018) or GloVe embeddings (Pennington et al., 2014). Moreover, to ensure that the perturbed texts still look natural, a few of the attack methods employ an external pretrained language model (e.g., BERT (Devlin et al., 2019), L2W (Holtzman et al., 2018)) to optimize the log-likelihood of the adversarial texts.  ↑22.5% ↑29.03% ↑35.38% ↑32.35% Bold, Red: no worse and decreased results from the base models Table 4: Accuracy under adversarial attacks before (Bef.) and after (Aft.) patched with SHIELD. The results of CNN-based models are presented in the Appendix.
(K=5) with γ=0.5. For each expert, we set O j to 3 (T =3) possible networks: FCN with 1, 2 and 3 hidden layer(s). For each dataset, we use grid-search to search for the best τ value from {1.0, 0.1, 0.01, 0.001} based on the averaged defense performance on the validation set under TextFooler (Jin et al., 2019) and DeepWord-Bug (Gao et al.). We use 10% of the training set as a separate development set during training with early-stop to prevent overfitting. We report the performance of the best single model across all attacks on the test set. The Appendix includes all details on all models' parameters and implementation.

Results
Fidelity We first evaluate SHIELD's prediction performance without adversarial attacks. Table 3 shows that all base models patched by SHIELD still maintain similar F1 scores on average across all datasets. Although SHIELD with RNN has a slightly decrease in fidelity on Hate Speech dataset, this is negligible compared to the adversarial robustness benefits that SHIELD will provide (More below).
Computational Complexity Regarding the space complexity, SHIELD can extend a NN into an ensemble model with a marginal increase of # of parameters. Specifically, with B denoting # of parameters of the base model, SHIELD has a space com-plexity of O(B+KU ) while both Ensemble, DT and ADP have a complexity of O(KB) and U ≪B. In case of BERT with K=5, SHIELD only requires an additional 8.3%. While traditional ensemble methods require as many as 4 times additional parameters. During training, SHIELD only trains O(KU ) parameters, while other defense methods, including ones using data augmentation, update all of them. Specifically, with K=5, SHIELD only trains 8% of the parameters of the base model and 1.6% of the parameters of other BERT-based ensemble baselines. During inference, SHIELD is also 3 times faster than ensemble-based DT and ADP on average. Table 4 shows the performance of SHIELD compared to the base models. Overall, SHIELD consistently improves the robustness of base models in 154/168 ( 92%) cases across 14 adversarial attacks regardless of their attack strategies. Particularly, all CNN, RNN, BERT and RoBERTa-based textual models that are patched by SHIELD witness relative improvements in the average prediction accuracy from 15% to as much as 70%. Especially in the case of detecting clickbait, SHIELD can recover up to 5% margin within the performance on clean examples in many cases. This demonstrates that SHIELD provides a versatile neural patching mechanism that can quickly and effectively defends against black-box adversaries without making any assumptions on the attack strategies.  We then compare SHIELD with all defense baselines under TextFooler (TF), DeepWordBug (DW), and PWWS (PS) attacks. These attacks are selected as (i) they are among the strongest attacks and (ii) they provide foundation mechanisms upon which other attacks are built. Table 5 shows that SHIELD achieves the best robustness across all attacks and datasets. On average, SHIELD observes an absolute improvement from +9% to +18% in accuracy over the second-best defense algorithms (DT in case of RNN, and AdvT in case of BERT, RoBERTa). Moreover, SHIELD outperforms other ensemblebased baselines (DT, ADP), and can be applied on top of a pre-trained BERT or RoBERTa model with only around 8% additional parameters. However, that # would increase to 500% (K←5) in the case of DT and ADP, requiring over half a billion # of parameters.

Discussion
Performance under Budgeted Attacks. SHIELD not only improves the overall robustness of the patched NN model under a variety of black-box attacks, but also induces computational cost that can greatly discourage malicious actors to exercise adversarial attacks in practice. We define computational cost as # of queries on a target NN model that is required for a successful attack. Since adversaries usually have an attack budget on # of model queries (e.g. a monetary budget, limited API access to the black-box model), the higher # of queries required, the less vulnerable a target model is to adversarial threats. A larger budget is crucial for genetic-based attacks because they usually require larger # of queries than greedy-based strategies. We have demonstrated in Sec. 3.2 that SHIELD is robust even when the attack budget is unlimited. Fig. 2 shows that the performance of RoBERTa after patched by SHIELD also reduces at a slower rate compared to the base RoBERTa model when the attack budget increases, especially under greedy-based attacks.
Effects of Stochasticity on SHIELD's Performance. Stochasticity in SHIELD comes from two parts, namely (i) the assignment of the main prediction head during each inference call and (ii) the randomness in the Gumbel-Softmax outputs. Regarding (i), it happens because during a typical iterative black-box process, an attacker tries different manipulations of a given text. When the attacker does so, the input text to the model changes at every iterative step. This then leads to the changes of prediction head assignment because each prediction head is an expert at different features-e.g., words or phrases in an input sentence. Thus, given an input, the assignment of the expert predictors for a specific set of manipulations stays the same. Therefore, even if an attacker repeatedly calls the model with a specific changes on the original sentence, the attacker will not gain any additional information. Regarding (ii), even though Gumbel-Softmax outputs are not deterministic, it always maintains the relative ranking of the expert predictor during each inference call with a sufficiently small value of τ . In other words, it will not affect the fidelity of the model across different runs.

Parameter Sensitivity Analyses.
Training SHIELD requires hyper-parameter K, T, γ and τ . We observe that arbitrary value γ=0.5, K=5, T =3 works well across all experiments. Although we did not observe any patterns on the effects of K on the robustness, a K≥3 performs well across all attacks. On the contrary, different pairs of the temperature τ during training and inference witness varied performance w.r.t to different datasets. τ gives us the flexibility to control the sharpness of the probability vector α. When τ →0, α to get closer to one-hot encoded vector, i.e., use only one head at a time.
Ablation Tests. This section tests SHIELD with only either the SE or ME module. Table 6 shows that SE and ME performs differently across different datasets and models. Specifically, we observe that ME performs better than the SE module in case of Clickbait dataset, SE is better than the ME module in case of Movie Reviews dataset and we have mixed results in Hate Speech dataset. Nevertheless, the final SHIELD model which comprises both the SE and ME modules consistently performs the best across all cases. This shows that both the ME and SE modules are complementary to each other and are crucial for SHIELD's robustness.

Limitations and Future Work
In this paper, we limit the architecture of each expert to be an FCN with a maximum of 3 hidden layers (except the base model). If we include more options for this architecture (e.g., attention (Luong et al., 2015)), sub-models' diversity will significantly increase. The design of SHIELD is modelagnostic and is also applicable to other complex and large-scale NNs such as transformers-based models. Especially with the recent adoption of transformer architecture in both NLP and computer vision (Carion et al., 2020;Chen et al., 2020), potential future work includes extending SHIELD to patch other complex NN models (e.g., T5 (Raffel et al., 2020)) or other tasks and domains such as Q&A and language generation. Although our work focus is not in robust transferability, it can accommodate so simply by unfreezing the base layers f (x, θ * L−1 ) in Eq.
(1 during training with some

Related Work
Defending against Black-Box Attacks. Most of previous works (e.g., (Le et al., 2021;Keller et al., 2021;Pruthi et al., 2019;Dong et al., 2021;Mozes et al., 2020;Wang et al., 2021;Jia et al., 2019) in adversarial defense are designed either for a specific type (e.g., word, synonymsubstitution as in certified training (Jia et al., 2019), misspellings (Pruthi et al., 2019)) or level (e.g., character or word-based) of attack. Thus, they are usually evaluated against a small subset of (≤4) attack methods. Despite there are works that propose general defense methods, they are often built upon adversarial training (Goodfellow et al., 2015) which requires training everything from scratch (e.g., (Si et al.;Miyato et al., 2016;Zhang et al., 2018) or limited to a set of predefined attacks (e.g., (Zhou et al., 2019a)). Although adversarial trainingbased defense works well against several attacks on BERT and RoBERTa, its performance is far out-weighted by SHIELD (Table 5).
Contrast to previous approaches, SHIELD addresses not the characteristics of the resulted perturbations from the attackers but their fundamental attack mechanism, which is most of the time an iterative perturbation optimization process (Fig. 1). This allows SHIELD to effectively defend against 14 different black-box attacks (Table 1), showing its effectiveness in practice. To the best of our knowledge, by far, this works also evaluate with the most comprehensive set of attack methods in the adversarial text defense literature.
Ensemble-based Defenses. SHIELD is distinguishable from previous ensemble-based defenses on two aspects. First, previous approaches such as DT (Kariyappa and Qureshi, 2019), ADP (Pang et al., 2019) are mainly designed for computer vision. Applying these models to the NLP domain faces a practical challenge where training multiple memory-intensive SOTA sub-models such as BERT or RoBERTa can be very costly in terms of space and time complexities.
In contrast, SHIELD enables to "hot-fix" a complex NN by replacing and training only the last layer, removing the necessity of re-training the entire model from scratch. Second, previous methods (e.g., DT and ADP) mainly aim to reduce the dimensionality of adversarial subspace, i.e., the subspace that contains all adversarial examples, by forcing the adversaries to attack a single fixed ensemble of diverse sub-models at the same time. This then helps improve the transferability of robustness on different tasks. However, our approach mainly aims to dilute not transfer but direct attacks by forcing the adversaries to attack stochastic, i.e., different, ensemble variations of sub-models at every inference passes. This helps SHIELD achieve a much better defense performance compared to DT and ADP across several attacks (Table 5).

Conclusion
This paper presents a novel algorithm, SHIELD, which consistently improves the robustness of textual NN models under black-box adversarial attacks by modifying and re-training only their last layers. By extending a textual NN model of varying architectures (e.g., CNN, RNN, BERT, RoBERTa) into a stochastic ensemble of multiple experts, SHIELD utilizes differently-weighted sets of prediction heads depending on the input. This helps SHIELD defend against black-box adversarial attacks by breaking their most fundamental assumption-i.e., target NN models remain unchanged during an attack. SHIELD achieves average relative improvements of 15%-70% in prediction accuracy under 14 attacks on 3 public NLP datasets, while still maintaining similar performance on clean examples. Thanks to its modeland domain-agnostic design, we expect SHIELD to work properly in other NLP domains.

Broad Impact
We address two practical adversarial attack scenarios and how SHIELD can help defend against them. First, adversaries can attempt to abuse social media platforms such as Facebook by posting ads or recruitment for human-trafficking, protests, or by spreading misinformation-e.g., vaccine-related. To do so, the adversaries can directly use one of the black-box attacks in the literature to iteratively craft a posting that will not be easily detected and removed by the platforms. In some cases, a good attack method only requires a few trials to successfully fool such platforms. Our method can help confuse the attackers with inconsistent signals, hence reduce the chance they succeed. Second, many popular services and platforms such as the NYTimes, the Southeast Missourian, OpenWeb, Disqus, Reddit, etc. rely on a 3rd party APIs such as Perspective API 1 for detecting toxic comments online-e.g., racist, offensive, personal attacks. However, these public APIs have been shown to be vulnerable against black-box attacks in literature (Li et al., 2018). The attacker can use a black-box attack method to attack these public APIs in an iterative manner, then retrieve the adversarial toxic comments and use those on these platforms without the risk of being detected and removed by the system. Since these malicious behaviors can endanger public safety and undermine the quality of online information, our work has practical values and can have broad societal impacts.

A ADDITIONAL RESULTS
•  (Kim, 2014) with three 2D CNN layers, each of which is followed by a Max-Pooling layer. Concatenation of outputs of all Max-Pooling layers is fed into a Dropout layer with 0.5 probability, then an FCN + Softmax for prediction. We use an Embedding layer of size 300 with pre-trained GloVe embedding-matrix to transform each discrete text tokens into continuous input features before feeding them into the CNN network. Each of CNN layers uses 150 kernels with a size of 2, 3, 4, respectively.
• RNN: Because the original PyTorch implementation of RNN does not support double backpropagation on CuDNN, which is required by DT and SHIELD to run the model on GPU, we use a publicly available Just-in-Time (JIT) version of GRU of one hidden layer as RNN cell. We use an Embedding layer of size 300 with pre-trained GloVe embedding-matrix to transform each discrete text tokens into continuous input features before inputting them into the RNN layer. We flatten out all outputs of the RNN layer, followed by a Dropout layer with 0.5 probability, then an FCN + Softmax for prediction.
• BERT & RoBERTa: We use the transformers library from HuggingFace to fine-tune BERT and RoBERTa model. We use the bert-baseuncased version of BERT and the RoBERTabase version of RoBERTa.

B.2.2 Vocabulary and Input Length
Due to limited GPU memory, we set the maximum length of inputs for transformer-based models, i.e., BERT and RoBERTa, to 128 during training. For CNN and RNN-based models, we use all the vocabulary tokens that can be extracted from the training set, and we use all of the vocabulary tokens provided by pre-trained models for BERT and RoBERTa-based models.

SHIELD:
For hyper-parameter γ, K and T , we arbitrarily set γ←0.5, K←5 and T ←3 and they work well across all datasets. For τ , we already described how to choose the best pair of τ during training and testing in Sec. 3.1.

Ensemble:
We train an ensemble model of 5 submodels, all of which have the same architecture as the base model. We use the average loss of all sub-models as the final loss to train the model.
3. DT: We follow the implementation described in Section 3 of the original paper (Kariyappa and Qureshi, 2019) and train an ensemble DT model with 5 sub-models, all of which have the same architecture as the base model. We set the hyper-parameter λ ← 0.5 as suggested by the original paper.
4. ADP: We follow the implementation described in Section 3 of the original paper (Pang et al., 2019) and train an ensemble ADP model with 5 sub-models, all of which have the same architecture as the base model. We set the hyperparameters required by ADP to default values (α ← 1.0 and β ← 0.5) as suggested by the original implementation.

Mix-up Training (Mix):
We sample λ ∈ Beta(1.0, 1.0) as suggested by the implementation provided by the original paper (Zhang et al., 2018).
6. Adversarial Training: We use a 1:1 ratio between original training samples and adversarial training samples as suggested by (Miyato et al., 2016). We specifically use the AT method as described in Sec. 3 of the original paper (Miyato et al., 2016).

B.4 Experimental Settings for Attack Methods
Since we use external open-source TextAttack (Morris et al., 2020) 3 and OpenAttack (Zeng et al., 2021) framework for evaluating the performance of SHIELD and all defense baselines under adversarial attacks, implementation of all the attacks are publicly available. Specifically, we use the TextAttack framework for evaluating all the word-and 3 https://github.com/QData/TextAttack character-level attacks, and use the OpenAttack for evaluating the sentence-level attack SCPNA.

B.5 Experimental Settings for Training and Evaluation
For every dataset, we train a single SHIELD model with the best τ parameters and evaluate this model with all of the adversarial attacks. In other words, since we have a total of 3 datasets (Movie Reviews, Hate Speech, Clickbait) and 4 base architectures (CNN, RNN, BERT, RoBERTa), we train a total of 12 SHIELD models for evaluation. This is done to ensure that we can evaluate the versatility of SHIELD's robustness against different types of attacks without making any assumptions on their strategies. During training, we use a batch size of 32, learning rate of 0.005, gradient clipping of 10.0. For every attack evaluation, we generate a new set of adversarial examples for every pair of attack method and target model. In other words, since we have a total of 14 different attack methods, 3 datasets, and 4 possible architectures for the base models, this results in a total of 168 different sets of adversarial examples to evaluate in Table 4.