Towards Building a Robust Toxicity Predictor

Recent NLP literature pays little attention to the robustness of toxicity language predictors, while these systems are most likely to be used in adversarial contexts. This paper presents a novel adversarial attack, \texttt{ToxicTrap}, introducing small word-level perturbations to fool SOTA text classifiers to predict toxic text samples as benign. \texttt{ToxicTrap} exploits greedy based search strategies to enable fast and effective generation of toxic adversarial examples. Two novel goal function designs allow \texttt{ToxicTrap} to identify weaknesses in both multiclass and multilabel toxic language detectors. Our empirical results show that SOTA toxicity text classifiers are indeed vulnerable to the proposed attacks, attaining over 98\% attack success rates in multilabel cases. We also show how a vanilla adversarial training and its improved version can help increase robustness of a toxicity detector even against unseen attacks.


Introduction
Deep learning-based natural language processing (NLP) plays a crucial role in detecting toxic language content (Ibrahim et al., 2018;Zhao et al., 2019;Djuric et al., 2015;Nobata et al., 2016;MacAvaney et al., 2019).Toxic content often includes abusive language, hate speech, profanity or sexual content.Recent methods have mostly leveraged transformer-based pre-trained language models (Devlin et al., 2019;Liu et al., 2019a) and achieved high performance in detecting toxicity (Zampieri et al., 2020).However, directly deploying NLP models could be problematic for real-world toxicity detection.This is because toxicity filtering is mostly needed in security-relevant industries like gaming or social networks where models are constantly being challenged by social engineering and adversarial attacks.
In this paper, we study the adversarial robustness of toxicity language predictors1 and propose a new set of attacks, we call "ToxicTrap ".ToxicTrap generates targeted adversarial examples that fool a target model towards the benign predictions.Our design is motivated by the fact that most toxicity classifiers are being deployed as API services and used for flagging out toxic samples.Figure 1 shows one ToxicTrap adversarial example.The perturbed text replaces one word with its synonym and the resulting phrase fooled the transformer based detector into a failed detection (as "benign").
We propose novel goal functions to guide greedy word importance ranking to iteratively replace each word with small perturbations.Samples generated by ToxicTrap are toxic, and can fool a victim toxicity predictor model to classify them as "benign" and not as any toxicity classes or labels.The proposed ToxicTrap attacks can pinpoint the robustness of both multiclass and multilabel toxicity NLP models.To the authors' best knowledge, this paper is the first work that introduces adversarial attacks2 to fool multilabel NLP tasks.Design multilabel ToxicTrap is challenging since coordinating multiple labels all at once and quantifying the attacking goals is tricky when aiming to multiple targeted and "toxicity prediction" interchangeably.

labels.
Empirically, we use ToxicTrap to evaluate BERT (Devlin et al., 2018) and DistillBERT (Liu et al., 2019b) based modern toxicity text classifiers on the Jigasw (Jig, 2018), and Offensive tweet (Bowman et al., 2015) datasets.We then use adversarial training to make these models more resistant to ToxicTrap adversarial attacks.In adversarial training a target model is trained on both original examples and adversarial examples (Goodfellow et al., 2014).We improve the vanilla adversarial training with an ensemble strategy to train with toxic adversarial examples generated from multiple attacks.Our contributions are as follows: • ToxicTrap reveals that SOTA toxicity classifiers are not robust to small adversarial perturbations.• Conduct a thorough set of analysis comparing variations of ToxicTrap designs.• Empirically show that greedy unk search with composite transformation is preferred.• Adversarial training can improve robustness of toxicity detector.

Method
Methods generating text adversarial examples introduce small perturbations in the input data checking if a target model's output changes significantly.These adversarial attacks help to identify if an NLP model is susceptible to word replacements, misspellings, or other variations that are commonly found in real-world data.For a given NLP classifier F : X → Y and a seed input x, searching for an adversarial example x ′ from x is: Here T (x, ∆x) denotes the transformations that perturb text x to x ′ .G(F, x ′ ) represents a goal function that defines the purpose of an attack, for instance like flipping the output.{C i (x, x ′ )} denotes a set of constraints that filters out undesirable x ′ to ensure that perturbed x ′ preserves the semantics and fluency of the original x.
To solve Equation (1), adversarial attack methods design search strategies 3 to transform x to x ′ via transformation T (x, ∆x), so that x ′ fools F by achieving the fooling goal G(F, x ′ ), and at the same time fulfilling a set of constraints {C i (x, x ′ )}.Therefore designing adversarial attacks focus on 3 Because brute force is not feasible considering the length and dictionary size of natural language text.designing four components: (1) goal function, (2) transformation, (3) search strategy, and (4) constraints between seed and its adversarial examples (Morris et al. (2020b)).
We propose a suite of ToxicTrap attacks to identify vulnerabilities in SOTA toxicity NLP detectors.ToxicTrap attacks focus on intentionally generating perturbed texts that contain the same highly abusive language as original toxic text, yet receive significantly lower toxicity scores and get predicted as "benign" by a target model.

Word transformations in ToxicTrap
For T (x, ∆x), many possible word perturbations exist, including embedding based word swap, thesaurus based synonym substitutions, or character substitution (Morris et al., 2020a).
Attacks by Synonym Substitution: Our design focuses on transformations that replace words from an input with its synonyms.
• (1) Swap words with their N nearest neighbors in the counter-fitted GloVe word embedding space; where N ∈ {5, 20, 50}.Attacks by Character Transformation: Another group of word transformations is to generate perturbed word via character manipulations.This includes character insertion, deletion, neighboring character swap and/or character substitution by Homoglyph.These transformations change a word into one that a target toxicity detection model doesn't recognize.These character changes are designed to generate character sequences that a human reader could easily correct into those original words.Language semantics are preserved since human readers can easily correct the misspellings.

Composite Transformation:
We also propose to combine the above transformations to create new composite transformations.For instance, one composite transformation can include both perturbed words from character substitution by Homoglyph and from word swaps using nearest neighbors from GloVe embedding (Pennington et al., 2014).

Novel goal functions for ToxicTrap
A goal function G(F, x ′ ) module defines the purpose of an attack.Different kinds of goal functions exist in the literature to define whether an attack is successful in terms of a victim model's outputs.
Toxicity language detection has centered on a supervised classification formulation, including binary toxicity detector, multiclass toxicity classification and multilabel toxicity detection which assigns a set of target labels for x.See Section A.2 for more details.As aforementioned, toxicity classifiers are used mostly as API services to filter out toxic text.Therefore, its main vulnerability are those samples that should get detected as toxic, however, fooling detectors into a wrong prediction as "benign".Now let us define G(F, x ′ ) for ToxicTrap attacks.We propose two choices of designs regarding the target model types.
Multiclass or Binary Toxicity: When toxicity detector F handles binary or multiclass outputs, we define ToxicTrap attacks' goal function as: Here b denotes the "0:benign" class.
Multilabel Toxicity: Next, we study how to fool multilabel toxicity predictors.In real-world applications of toxicity identification, an input text may associate with multiple labels, like identity attack, profanity, and hate speech at the same time.The existence of multiple toxic labels at the same time provides more opportunities for attackers, but also poses design challenges4 .Multilabel toxicity detection assigns a set of target labels for x.The output y = {y 1 , y 2 , ..., y L } is a vector of L binary labels and each y i ∈ {0, 1} (Zhao et al., 2019).For example in the Jigsaw dataset (Jig, 2018), each text sample associates with six binary labels per sample, namely {benign, obscene, identity attack, insult, threat, and sexual explicit}.We introduce a novel goal function for attacking such models as follows: Here, T = {y 2 , ..., y L } denotes the set of toxic labels, and b = {y 1 : Benign} is the non-toxic or benign label.{F b (x ′ ) = 1; F b (x) = 0} denotes "x ′ gets predicted as Benign, though x is toxic".And {F t (x ′ ) = 0, ∀t ∈ T} denotes x ′ is not predicted as any toxicity types.In summary, our ToxicTrap attacks focus on perturbing correctly predicted toxic samples.

Language constraints in ToxicTrap
In Equation (1), we use a set of language constraints to filter out undesirable x ′ to ensure that perturbed x ′ preserves the semantics and fluency of the original x, and with as fewer perturbations as possible.There exist a variety of possible constraints in the NLP literature (Morris et al., 2020b).
In ToxicTrap , we decide to use the following list: • Limit on the ratio of words to perturb to 10% • Minimum angular similarity from universal sentence encoder (USE) is 0.84 • Part-of-speech match.

Greedy Search strategies in ToxicTrap
Solving Equation ( 1) is a combinatorial search task searching within all potential transformations to find those transformations that result with a successful adversarial example, aka, achieving the fooling goal function and satisfying the constraints.Due to the exponential nature of search space, many heuristic search algorithms existed in the recent literature, including like greedy search, beam search, and population-based search (Zhang et al., 2019).
For a seed text x = (w 1 , . . ., w i , . . ., w n ), a perturbed text x ′ can be generated by swapping w i with altered w ′ i .The main role of a search strategy is to decide what word w i from x to perturb next.We propose to center the search design of ToxicTrap using greedy search with word importance ranking to iteratively replace one word at a time to generate adversarial examples.
The main idea is that words of x are first ranked according to an importance function.Four possible choices: (1) "unk " based: word's importance is determined by how much a heuristic score (details later) changes when the word is substituted with an UNK token.(2) "delete ": word's importance is determined by how much the heuristic score changes when the word is deleted from the original input.
(3) "weighted saliency" or wt-saliency : words are ordered using a combination of the change in score when the word is substituted with an UNK token multiplied by the maximum score gained by perturbing the word.(4) "gradient ": each word's importance is calculated using the gradient of the victim's loss with respect to the word and taking its L1 norm as the word' importance.
After words in x get sorted in the order of descending importance, word w i is then substituted with w ′ i with allowed transformation.This is until the fooling goal is achieved, or the number of perturbed words reaches upper bound.The heuristic scoring function used in the word importance ranking relates to the victim model and the fooling goal.For instance, when we work with a binary toxicity model, the heuristic score ToxicTrap equals to the model's output score for the target "0:benign" class.Algorithm 1 shows our design of the score function for the multilabel ToxicTrap attacks.In experiments, we also evaluate two other search strategies: "beam search" and "genetic search".Our empirical results indicate strong performance from the greedy search with word importance ranking over other strategies.
Algorithm 1 provides the pseudo code of how we implement Equation (3) as a new goal function in the TextAttack python library (Morris et al., 2020a).The implemented MultilabelClassification-GoalFunction extends the TextAttack library to multilabel tasks.

Harden with Adversarial Training
Our ultimate goal of designing ToxicTrap attacks is to improve toxicity NLP models' adversarial robustness.A simple strategy called Adversarial Training (AT ) has been a major defense strategy for improving adversarial robustness (Madry et al., 2018)

ToxicTrap Recipes and Extensions
The modular design of ToxicTrap allows us to implement many different ToxicTrap attack recipes in a shared framework, combining different goal functions, constraints, transformations and search strategies.In Section 3, we conduct a thorough empirical analysis to compare possible ToxicTrap recipes and recommend unk greedy search and the composite transformation for most use cases.Table 1 lists our recommended recipe.
Algorithm 1 Attack with MultilabelClassification-GoalFunction Input: An original text x, a multilabel classifier F, a set of targeted labels as T (for which scores are to be maximized), a set of other labels as N (for which scores are to be minimized); maximization threshold ϵ maximize = 0.5; search method S() := Greedy-WIR, transformations T , a set of constraints as C, and the max number of trials I.
if goal ′ > goal then end if 16: end for 17: return x ′ , Attack Failed Besides, we select to adapt another five SOTA combinations of transformation and constraints from popular general NLP adversarial example recipes in the literature to create ToxicTraps Extend attacks in Table 14, covering a good range of transformations and constraints.Table 15 shows generated adversarial examples using these attacks.

Experiments
We conducted a series of experiments covering three different toxicity classification tasks: binary, multilabel, and multiclass; over two different transformer architectures: DistillBERT and BERT; and across two datasets: the large-scale Wikipedia Talk Page dataset-Jigsaw (Jig, 2018) and the Offensive Tweet for hate speech detection dataset (Davidson  Base Toxicity Models: Our experiments work on three base models, including {Jigsaw-BL , Jigsaw-ML , HTweet-MC } to cover three types of toxicity prediction tasks.See details in Section A.5.

Implementation:
We implement all of our ToxicTrap and ToxicTraps Extend attacks using the NLP attack package TextAttack 5 (Morris et al., 2020a).When generating adversarial examples, we only attack seed samples that are correctly predicted by a victim model.(Adversarial attack does not make sense if the target model could not predict the seed sample correctly!).In our setup, this means we only use toxic seed samples when attacking three base models.This set up simulates real-world situations in which people intentionally create creative toxic examples to circumvent toxicity detection systems.
cessful each ToxicTrap attacking a victim model.
To measure the runtime of each algorithm, we use the average number of queries to the victim model as a proxy.We also report the average percentage of words perturbed from an attack.In addition, for models trained with adversarial training, we report both, the model prediction performance and model robustness (by attacking robust model again).For the second seed text, character transformation (second and third columns) generates replacement word "traitors" where the second "t" is replaced with a monospace Unicode character "t".

Comparing Constraints in ToxicTrap
Then we study the effect of the part-of-speech match (POS) constraint on the attack performance.
Table 5 shows that the use of POS constraint lowers average number of queries sent to the victim model.We observe this phenomena across all three tasks and all three search methods (gradient , delete , unk ).For example, when attacking Jigsaw-ML model using unk with and without POS constraint, average number of queries are 49.38 and 57.08, respectively.We also observe that for most of the recipes, using POS constraint slightly decreases attack success rate (ASR).Considering the empirical results in

Comparing Search in ToxicTrap
Table 5 also compares the effect of using different search methods on the attack performance.It shows that a greedy search method is preferred over genetic and beam .For example, when compared to unk , genetic and beam require almost 10x as many queries on average for all three tasks.The beam search results in higher ASR values on all three tasks, while genetic only outperforms greedy methods when attacking HTweet-MC .In addition, attacking Jigsaw-BL and Jigsaw-ML , beam only slightly outperforms greedy methods.Among the greedy search methods, unk is a good choice, as it provides consistently good ASR performance on all three tasks.It is worth noting that unk outperforms other three greedy search methods, except for wt-saliency when attacking HTweet-MC .However, attacking HTweet-MC model with wt-saliency requires more than 3x as many queries as the unk method.

Comparing Synonym Transformations
Now we compare word synonym substitutions, when the unk search method is used and the character manipulation are not (to single out the effect).Table 6 shows that glove with N = 20 nearest neighbors is an optimal choice for all three tasks.We observe that the wordnet and mlm transformations result in lower ASRs than glove .Also, glove with N = 50 only slightly lifts ASRs when compared to glove with N = 20.At the same time, using N = 50 nearest neighbors sends over 50% more queries to the victim models.We include the analysis of using different transformations with three different search methods (delete , unk , wt-saliency ) in the Appendix in Table 10.These results also confirm that the choice of glove with N = 20 is preferred.

Results from Adversarial Training
Empirically, we explore how AT1 and AT2 impact both prediction performance and adversarial robustness.Table 7 and Table 8 present AT results on the HTweet-MC task (Table 12 and Table 13 on two other tasks).In Table 7, we observe that AT1-delete and AT1-unk both maintain the regular prediction performance as the base model.Table 8 shows the attack success metrics when we use ToxicTrap to attack the retrained robust HTweet-MC models.The AT1 models trained from using AT1-delete and AT1-unk attacks show significant improvements in robustness after AT1 adversarial training.We recommend readers to use AT1-unk as their default choice for hardening general toxic language predictors, since in both tables, AT1-unk outperforms AT1-delete slightly.
For the AT2 robust model, ToxicTrap attacks are "unseen" (we used the five ToxicTraps Extend attacks from Section B to create the AT2 model in our experiments).Our results show AT2 can harden HTweet-MC model not only against attacks it is trained on (ToxicTraps Extend ) but also against attacks it has not seen before (ToxicTrap ).This could be attributed to the hypothesis that an unseen attack may share similar underlying patterns with the attack ensemble that AT2 model has used.In Table 7, AT2 slightly under-performs base on regular predictions, since it was trained with more adversarial examples from multiple attacks.

Conclusion
Text toxicity prediction models are not designed to operate in the presence of adversaries.This paper proposes a suite of ToxicTrap attacks to identify weaknesses in SOTA toxicity language predictors that could potentially be exploited by attackers.We also evaluate how adversarial training improves model robustness across seen and unseen attacks.As next steps, we plan to investigate other strategies like virtual adversarial training, disentangled representation learning or generative methods and pinpoint how they will influence the robustness of toxicity predictors (Qiu et al., 2022).

A.1 Related Works
To the authors' best knowledge, the toxicity NLP literature includes very limited studies on adversarial attacks.Only one study from Hosseini et al. (2017) tried to deceive Google's perspective API for toxicity identification by misspelling the abusive words or by adding punctuation between the letters.This paper tries to conduct a comprehensive study by introducing a wide range of novel attack recipes and improving adversarial training to enable robust text toxicity predictors.
Next, we also want to point out existing studies on text adversarial examples center in generating adversarial examples against binary and multi-class classification models (Morris et al., 2020a).To the authors' best knowledge, no multilabel adversarial attacks exist in the NLP literature.Our work is the first that designs novel attacks against the multilabel toxicity predictors.The design of multilabel adversarial examples is challenging since coordinating multiple labels all at once and quantifying the attacking goals is tricky because it is harder to achieve multiple targeted labels.Simply adapting attacks for binary or multiclass models (Morris et al., 2020a) to multilabel setup is not feasible.In multilabel prediction, each instance can be assigned to multiple labels.This is different from the multi-class setting in which classes are mutually exclusive and one sample can only associate to one class (label).The existence of multiple labels at the same time provides better opportunities for attackers, but also posts design challenges.Our design in Equation ( 3) and Algorithm 1 has paved a path for multilabel text adversarial example research.

A.2 Toxicity Detection from Text
The mass growth of social media platforms has enabled efficient exchanges of opinions and ideas between people with diverse background.However, this also brings in risk of user generated toxic contents that may include abusive language, hate speech or cyberbullying.Toxic content may lead to incidents of hurting individuals or groups (Johnson et al., 2019), calling for automated tools to detect toxicity for maintaining healthy online communities.
Automatic content moderation uses machine learning techniques to detect and flag toxic content It is critical for online platforms to prohibit toxic language, since such content makes online communities unhealthy and may even lead to real crimes (Johnson et al., 2019;Committee et al., 2017).
Past literature on toxicity language detection has centered on a supervised classification formulation (Zhao et al., 2019;Djuric et al., 2015;Nobata et al., 2016;MacAvaney et al., 2019).We denote F : X → Y as a supervised classification model, for example, a deep neural network classifier.X denotes the input language space and Y represents the output space.For a sample (x, y), x ∈ X denotes the textual content 6 and y ∈ Y denotes its toxicity label(s).The toxicity detection task varies with what y stands for.The literature has included three main cases: (1) binary toxicity detector, here y from {0: benign,1:toxic}; (2) multilabel toxicity detection which assigns a set of target labels for x.Here y = {y 1 , y 2 , ..., y L } is a vector of L binary labels and each y i ∈ {0, 1} (Zhao et al., 2019).For example in the Jigsaw dataset (Jig, 2018), each text sample associates with six binary labels per sample, namely {benign, obscene, identity attack,insult,threat, and sexual explicit}; (3) multiclass toxicity classification, and y is a discrete integer.For example, the Offensive Tweet dataset (Davidson et al., 2017) has three classes -{0:benign, 1:offensive, and 2:hate} (one class per sample).
The literature on toxicity has been mostly focused on improving accuracy, via feature engineering (Zampieri et al., 2020), deep representation learning (Wang and Zhang, 2021) and via fine tuning from pretrained large language models (Ghosh and Kumar, 2021).Recently literature has extended to investigate these classifiers' interpretability (Xiang et al., 2021) and fairness (Hartvigsen et al., 2022).

A.3 Basics of Text Adversarial Examples
Research has shown that current deep neural network models lack the ability to make correct predictions on adversarial examples (Szegedy et al., 2014).The field of investigating the adversarial robustness of NLP models has seen growing interest, (0.863) America is white trash and so are all Americans.
(0.863) America is white trash and so are all Americans.
(0.899) America is ashen trash and so are all Americans.
(0.784) America is utter trash and so are all Americans.
(0.897) America is pristine trash and so are all Americans.
Fluffy is a retard homeboy is calling a baby rattlesnake friend and kis... (0.566) Fluffy is a retarded homeboy is calling a baby rattlesnake friend and kis... (0.520) Fluffy is a moron homeboy is calling a baby rattlesnake friend and kissin... (0.520) Fluffy is a moron homeboy is calling a baby rattlesnake friend and kissin... (0.897) ugh your with Tebow?You're officially a poof now and in followed.
(0.897) ugh your with Tebow?You're officially a poof now and in followed.
(0.908) ugh your with Tebow?You're officially a fatigue now and in followed.
(0.907) ugh your with Tebow?You're officially a wolf now and in followed.
(0.907) ugh your with Tebow?You're officially a poofter now and in followed.Adversarial examples were generated using unk greedy search method.POS row marked with an "x" indicates that the part-of-speech matching constraint was used.Transformation row indicates which word substitution method was used (glove , wordnet , mlm ), and number of nearest neighbors N is specified in parenthesis.chars indicates that character transformations were applied.We used different word transformations for synonym substitution with varying number of nearest neighbors (20 or 50).Two recipes used character transformations, while the other three did not.Also, one recipe did not use part-of-speech match (POS) constraint, and it was included in the rest of recipes.All five recipes used unk greedy search method.
This dataset was derrived from the Wikipedia Talk Pages dataset published by Google and Jigsaw on Kaggle (Jig, 2018).Wikipedia Talk Page allows users to discuss improvements to articles via comments.The comments are anonymized and labeled with toxicity levels.Here "obscene", "threat", "insult" and "identity hate" are four sub-labels for "toxic" and "severe toxic" (hence may co-occur for a comment).The "toxic" comments that are not "obscene", "threat", "insult" and "identity hate" are assigned to either "toxic" or "severe toxic".Comments that are not assigned any of the six toxicity labels get int "non toxic".
The authors of (Davidson et al., 2017) used crowdsourced hate speech lexicon from Hatebase.org to collect tweets containing hate speech keywords.Then they used crowd-sourcing to label these tweet samples into three categories: those containing hate speech, only offensive language, and those with neither.
A.5 Base Model Setup: We build three base models, including {Jigsaw-BL , Jigsaw-ML , HTweet-MC } to cover three types of toxicity prediction tasks.In recent NLP studies, adversarial training (AT ) is only performed to show that such training can make models more resistant to the attack it was originally trained with.This observation is not surprising.The literature has pointed out the importance of robustness against unseen attacks and it is generally recommended to use different attacks to evaluate the effectiveness of a defense strategy (Carlini et al., 2019).
Adversarial Training with Multiple Attacks (AT2 ): A simple strategy to revise vanilla AT is to train a model using both clean examples and adversarial examples from different attacks.This, we call Adversarial Training with Multiple (AT2 ), trains a target model on a combination of adversarial examples.AT2 aims to help a model become more robust not only against attacks it is trained on but also the attack recipes it has not seen before.Algorithm 2 presents our pseudo code of AT2 .
AT1 vs AT2 : In the rest of the paper, we call vanilla adversarial training as single adversarial training (AT1 ).In Section 3 and Section B our results show that models trained with AT2 can be more effective in protecting against unseen text adversarial attacks compare with AT1 models trained on the same attack.This could be contributed to the hypothesis that an unseen attack may share similar underlying attributes and patterns with the attack ensemble that the model is trained on.
Selecting what attacks to use in AT2 is important.Part of the reason we select to adapt the five popular recipes to ToxicTraps Extend in Table 14 is because these attacks cover a good range of popular transformations and constraints.Table 14 includes three word based attacks, two character based attack, plus TT-TBug is a character and word level combination attack.In our experiments, we simulated potential AT2 use cases by leave-one attack out as "unseen" and train AT2 models using the rest.For instance, when a target model never uses examples from TT-TFool in training, the AT2 trained model may have already known certain information on similar word transformations since similar transformations have been used by other attacks in the ensemble.

Original text :
Figure 1: ToxicTrap successfully fooled a SOTA toxicity predictor by perturbing one word in the original text using word synonym perturbation.After adversarial training (AT), the improved toxicity predictor can correctly flag the perturbed text into the toxicity class.
. The vanilla adversarial training process involves augmenting the training data with adversarial examples generated from perturbing the training data in the input space.Two variations of adversarial training exist.(1) If the augmented adversarial examples are generated from a single attack approach, we name this process as AT1 .(2) If the augmented examples are generated from multiple attack methods, we call the training as AT2 .

FE
(x,y)∼D [L(F(x), y) + * L(F(x ′ = A(F, x, y)), y)] (4) Adversarial training use both clean 8 and adversarial examples to train a model.This aims to minimize both the loss on the original training dataset and the loss on the adversarial examples.

Table 2 :
Selected toxic adversarial examples generated attacking HTweet-MC model.Perturbed scores are reported in parenthesis.Adversarial examples were generated using unk search method; with and without POS constraint; and using three word synonym substitution transformations with number of nearest neighbors specified in parenthesis; chars indicates that character transformations were applied.More in Table9.etal., 2017).Table3lists two datasets' statistics.Section A.4 provides more details.

Table 3 :
Overview of the data statistics.

Table 4 :
Overview of the base model statistics.

Table 5 :
Effect of different search strategies on attack performance.Search column identifies type of search method.POS column identifies if part-of-speech matching constraint is used.The composite transformation is used: glove with N = 20 plus the character transformations.Results from rows on "unk + POS" can compare with "unk " rows in Table6.substitution is a better choice than mlm or wordnet .For the second and third seed text, mlm did not generate toxic phrases.In addition, we see that the recipe using glove (50) (last column) often generates similar examples as the glove (20) (third column).Finally, we observe that using character manipulation can generate adversarial examples with the same toxic meaning that fool the classifier.

Table 6 :
Comparing synonym transformations only.No character transformations used.Reporting attack performance when using unk greedy search.The same constraints as in Table5with POS (part-of-speech) match.

Table 7 :
Effect of adversarial training on model performance.Macro-average metrics for HTweet-MC model.
Table 5 and the anecdotal examples in Table 2, we conclude POS constraint is preferred.

Table 9 :
Selected Toxic Adversarial Examples.Here we only show adversarial examples generated by attacking base model HTweet-MC on Offensive Tweet text, since it contains much shorter messages than Jigsaw.Perturbed scores for adversarial examples are reported in parenthesis.

Table 10 :
Effect of word transformations on attack performance.Comparing synonym transformations only.No character transformations used.Reporting attack performance when using delete , unk , and wt-saliency greedy search.The same constraints as in Table5with POS (part-of-speech) match.

Table 4
Set of all attack recipes S recipes , number of attack recipes to exclude N exc , set of attack recipes to use for adversarial training S attack (created by choosing (|S recipes | − N exc ) while |D adv | < γ * |D| and i ≤ |D| do D adv 15: end for 16: Randomly shuffle D ′ 17: for adversarial epoch= 1, . . ., N adv do represent the loss function on input text x and label y.Let A(F, x, y) be the adversarial attack that produces adversarial example x ′ .Then, vanilla adversarial training objective is

Table 15 :
Selected Toxic Adversarial Examples.We show adversarial examples generated by attacking base model HTweet-MC .To conserve space, we only show results from Offensive Tweet that contain much shorter messages than Jigsaw.