Profanity-Avoiding Training Framework for Seq2seq Models with Certified Robustness

Seq2seq models have demonstrated their incredible effectiveness in a large variety of applications. However, recent research has shown that inappropriate language in training samples and well-designed testing cases can induce seq2seq models to output profanity. These outputs may potentially hurt the usability of seq2seq models and make the end-users feel offended. To address this problem, we propose a training framework with certified robustness to eliminate the causes that trigger the generation of profanity. The proposed training framework leverages merely a short list of profanity examples to prevent seq2seq models from generating a broader spectrum of profanity. The framework is composed of a pattern-eliminating training component to suppress the impact of language patterns with profanity in the training set, and a trigger-resisting training component to provide certified robustness for seq2seq models against intentionally injected profanity-triggering expressions in test samples. In the experiments, we consider two representative NLP tasks that seq2seq can be applied to, i.e., style transfer and dialogue generation. Extensive experimental results show that the proposed training framework can successfully prevent the NLP models from generating profanity.


Introductions
In the past decade, the research community has witnessed machine learning models achieving impressive performances in various sequence-to-sequence (Seq2seq) NLP tasks such as style transfer and dialogue generation. Despite their great success, recent studies and reports have shown that widely used models trained on crowd-sourced corpus like user reviews and online community discussions may produce inappropriate languages such as profanity. Such inappropriate languages may hurt the usability of these models and cause conflicts and anxiety among the users.
For instance, in 2016, Microsoft released an AI chatbot named Tay 1 , which is claimed to be able to improve itself through communicating with social media users. Nevertheless, within just 24 hours after Microsoft released the chatbot, it started to generate misogynistic and racist words. Microsoft had to suspend the chatbot account and conceded that the chatbot suffered from a "coordinated attack by a subset of people". Such an incident demonstrates the vulnerability of existing Seq2seq methods facing users' abuse.
There are two major causes that can make the Seq2seq model produce profanity. First, in the training phase, Seq2seq models can capture the language patterns within the training corpus. Similarly, languages patterns with profanity (referred to as profanity patterns) in the training corpus also can be learned and concealed in the learned model. Second, in the testing phase, some specific tokens may trigger expressions that contain profanity so that the Seq2seq model can generate inappropriate languages. Sometimes, such triggering expressions can be unnoticeable from the cognitive perspective or even beyond intuition.
In this paper, we conduct a pioneering study on designing a training framework with certified robustness to prevent Seq2seq models from generating profanity. Our paper addresses two key questions. First, existing systems typically handle profanity by creating a comprehensive list of profanity examples and removing them from the vocabulary. However, it is extremely difficult to exhaust all the possible profanity words. So the first question is: is it possible to leverage a small set of profanity examples to prevent the Seq2seq models from producing profanity? To answer this question, we propose an efficient and effective training method named profanity-eliminating training (PET) that can generalize the training loss for the small set of profanity examples to other expressions that are not covered. In particular, PET generates a set of augmented sentence pairs via perturbations and then minimizes the worst-case loss over the augmented data. Our theoretical analysis shows that PET can be regarded as gradient normalization.
Second, even though most profanity in the training set is removed in the training phase, there may still be some left. The small amount of remaining profanity in the training set, though it is unlikely to incur inappropriate outputs of the Seq2seq model with normal input sentences, may be utilized by malicious users. Existing literature (Cheng et al., 2020) shows that in the testing phase, one can make well-designed modifications on the input sentence to induce the Seq2seq model to generate specific tokens. Malicious users can leverage such a technique to manipulate input sentences to trigger the Seq2seq model's profanity output. Thus, another critical question is: in the testing phase, is it possible to ensure the output of the Seq2seq model remain unchanged when it is fed with manipulated input sentences? In this paper, we leverage random smoothing technique, which achieves state-of-theart certified robustness for deep learning models, to propose a training method named triggering avoiding training (TAT) to Seq2seq models against testing phase adversaries. In our proposed TAT, we choose to use von-Mises Fisher distribution as the random noise generator and derive new theoretical results for such a design.
We evaluate the proposed approach via text generation on a realworld dataset. The experimental results show that the proposed framework can consistently prevent Seq2seq models from generating profanity under different settings.

Preliminaries
For Seq2seq models, let us use X = [x 1 , x 2 , · · · , x M ] and Y = [y 1 , y 2 , · · · , y L ] to denote the input sentence of length M and the output sentence of length L, respectively. Each x i or y j here stands for a single token.
In a Seq2seq model, the major components are one encoder h and one decoder g. The encoder learns a hidden vector representation h enc containing the semantics and context for each token. The decoder turns the vector representation back into an output token based on the previous sequence. Formally, we denote the Seq2seq model as f (X) = g(h(X)) → Y : N M → R L×c where c denotes the vocabulary size. For a decoding step t, the model outputs a distribution over all the possible tokens, i.e., f t (X) ∈ R c where f t denotes the t-th decoding step. The model then picks the token with the largest probability as the output of step t.
In practice, Seq2seq models typically employ neural network models such as LSTM (Hochreiter and Schmidhuber, 1997), GRU  and Transformers (Vaswani et al., 2017) as the encoder and the decoder. To facilitate our discussion, we focus on one of the most representative architectures, i.e., GRU Encoder-Decoder with attention Luong et al., 2015). All the methodologies proposed in this paper are independent from particular network architectures.
With the notations and concepts defined above, we formulate the problem of profanity-avoiding training: Definition 2.1 (Profanity-Avoiding Training). Given a set of sentences pairs and a set of profanity examples (referred to as profanity seeds) S = {S 1 , S 2 , · · · , S P }. Our goal is to train a Seq2seq model that generates fluent sentences with a minimal ratio of profanity.

Methodology
As mentioned in the introduction section, there are two causes that can make the Seq2seq model produce profanity. First, in the training phase, Seq2seq models capture the language patterns within the training corpus. Thus, a Seq2seq model may also learn the profanity patterns from the training corpus. Second, in the testing phase, some manipulated expressions may be fed to the Seq2seq model to trigger profanity outputs from it.
In the rest of this section, we present a profanityavoiding training framework with certified robustness to handle these two causes that lead to profanity. The framework has two components: the pattern-eliminating training (PET) model to barrier the profanity patterns in the training phase (Section 3.1), and the trigger-resisting training (TRT) model to maintain the robustness of the generation model against triggering expressions in the testing phase (Section 3.2). Besides, we also provide theoretical analysis to estimate the robustness of the proposed TRT model, i.e., under what attack strength (in terms of the perturbation radius), the proposed TRT model would still be certifiably robust.

Pattern-Eliminating Training
Consider an input-output training corpus , the learning objective function of the Seq2seq model is: where θ denotes the vector of model parameters and l S2S denotes the loss function associated with a sentence pair (X, Y ), such as cross entropy loss. As mentioned at the beginning of this section, the profanity patterns in the training set can trigger profanity. To alleviate the effect of sentences with profanity patterns, we propose an efficient and effective training method, PET. PET includes a similarity-based loss that penalizes the cases where the generated sentence's semantics is close to the semantics of the phrases in the profanity seed set. In essence, PET first generates a set of outputs sentences by perturbing the representation of the input sentence in a sentence pair. These sentences serve as diverse variants of the original output sentence. Then PET minimizes the maximum of the similarity-based loss. These two steps enhance the generalization ability of PET. (Figure 1) To implement PET, for each sample (X i , Y i ) ∈ C, we utilize the sequence model to generate a series of output sentences PC i = {Ŷ ij } m j=1 by perturbing the encoded representation of X i . With these augmented outputs and the set of seed profanity, we define the penalty term which barriers the generated outputs from the profanity as: where: is a hinge loss and function d(·) is a distance metric.
Here, we choose cosine distance function, which is proved to be effective for quantifying the similarity of high-dimensional data samples like encoded is calculated by first transforming sentencesŶ ij and S k into their vector representationsŷ ij and s k via the encoder g; and then calculating the cosine distance betweenŷ ij and s k . This hinge loss barriers the generated samples that are within ζ distance from S k . In practice, the loss is added to the conventional training loss of L S2S as the overall objective function, i.e., Moreover, in this paper, we get the perturbed data PC i by adding i.i.d. noise vectors generated from von-Mises Fisher (vMF) distribution (Fisher et al., 1993) around the encoded representation of an input sentence, i.e., h enc . VMF distribution is a directional distribution over unit vectors in the space of R d . The probability density function of vMF distribution for the p-dimensional vector x is given by: where µ = 1 and κ(κ > 0) are the mean direction and the concentration parameter, respectively. The mean direction µ acts as a semantic focus on the unit sphere and κ describes the concentration degree of the generated relevant highdimensional representations around it. The larger κ, the higher concentration of the distribution around the mean direction µ. The normalization constant C p (κ) = The reason that we choose to use vMF distribution to generate perturbations is two-fold. First, the vMF distribution naturally describes the cosine similarity used in this paper. Second, vMF distribution models the representation vectors from an integrated perspective instead of a single-dimensional perspective. Therefore, the perturbations on h enc tend to produce augmented representation vectors with diverse overall directions rather than minor differences in every single dimension. Note that the augmented samples can also be constructed based on other transformations such as embedding dropout (Gal and Ghahramani, 2016). Different augmentation methods will not fundamentally impact the theoretical results in this paper.
Theoretical Remarks: Compared with other conventional regularization terms like minimizing the similarity expectation, optimizing Eq. (2) can be more efficient since the optimization process consists of much fewer derivative operations. Through theoretical analysis, we can regard Eq. (2) as adding a gradient-norm to the conventional expectation minimization objective. Please refer to Appendix B for the detailed derivation.
Besides, Eq. (2) should not be confused with the adversarial training objective (Madry et al., 2017). The major difference is that Eq. (2) does not really include an inner optimization objective. Instead, we simply pick the perturbed sample with the maximum similarity for subsequent optimization.

Trigger-Resisting Training
As mentioned at the beginning of this section, apart from the profanity patterns in the training set, another cause that may result in profanity is the well-designed adversarial inputs in the testing phase, like (Cheng et al., 2020). This section presents a theoretically-provable trigger-resisting training (TRT) method to enhance the robustness of Seq2seq models. We extend the randomized smoothing technique (Cohen et al., 2019) to get a smoothed model with the provable robustness guarantee given possible perturbations on the input sequence X. Particularly, we derive new theoretical results on using vMF distribution as random noise for randomized smoothing.
Typically, perturbing input sentences are done by substituting one or more tokens in the sentences. Such a process can result in changes in the encoded representation h enc . Here we certify the robust radius via the encoded representation h enc instead of the input X. The reason here is two-fold. First, Seq2seq models typically take discrete token sequences as inputs and learn word embeddings from scratch. It is difficult for us to specify a radius measure for such sparse discrete data. Second, the possible types of modifications on the input sentence X are various, such as word replacement, adding additional text, etc. Some of these modifications are difficult to be regarded as perturbing on the embedding of single words. Nevertheless, almost all the changes are reflected in the encoded representation of the entire sentence. That is why we choose to certify the robust radius of h enc .
Let us use g() to denote the decoder in the Seq2seq model. The smoothed decoder g and the base model g have the same architecture, and the parameters of their encoders are identical. Thus, given an input sentence X, their encoding result h enc are the same. Given an input X, the smoothed model outputs exactly the same sequence as g's when the modifications on the input X causes the encoded representation h enc to deviate within a radius R. Thus, a smoothed Seq2seq model enjoys certified robustness facing evasion attack samples. Formally, we can write the t-th step output from the smoothed model g (X) as: where P ( ; φ) stands for the distribution of the random noise , parameterized by φ. * is the convolution operator. g t denotes the decoding function at step t. In this section, we continue to use vMF distribution to implement the sampling distribution: where n denotes the dimension of all the input representation vectors after concatenation and t denotes the concatenated vector. S D denotes the domain of h enc and t, which are both spheres in D dimensions.
With the smoothed decoder defined, now we derive the radius in which the model's robustness is guaranteed. In particular, given vMF as the random noise distribution, we can prove the following robustness guarantee for the smoothed model. For simplicity, we narrow the discussion to the generation of one specific token, i.e., the t-th token in the output, without losing generality.
Generate noise samples Create an empty set D to store augmented samples ; Update the parameters of the decoder using the augmented samples in D; 18 end Theorem 3.1 (Certified Radius). Consider a specific decoding step t. The encoded representation of X is denoted as h enc . Let k * and k be the tokens that the generator returns with top and runnerup probability, i.e., k * = arg max k (g t (h enc )) and k = arg max k,k =k * (g t (h enc )). For any perturbations on h enc that is within radius R, the output of g t (X) is unchanged, i.e., g t (h enc ) = g t (h enc + ) for all within R from h enc . Here, R is calculated as: where g t,k * (h enc ) and g t,k (h enc ) are the probability of generating k * and k in step t, respectively.
We leave the detailed proof of Theorem 3.1 in the Appendix A.
On getting the radius R, we now follow existing work like (Cohen et al., 2019;Yang et al., 2020;Salman et al., 2019) and present the practical training method to get the smoothed model g . Since the method is tightly coupled with the PET, we illustrate the overall training framework that involves both strategies in Algorithm 1.
In Algorithm 1, we first conduct PET on by minimizing Eq. (4) (line 1-7). After that, we adopt TRT to update the smoothed Seq2seq model's de-coder, which is built upon the base model, using the augmented samples (line 8-18). The encoder here is not updated so that the encoded representations remain stable. These augmented samples are generated to imitate a testing phase attack against the smoothed Seq2seq model so that the model is trained to be more robust. (line 13-16) Here, for a sentence pair (X i , Y i ), we describe the augmented sample as (h enc i , Y i ), since we can consider h enc i as fixed in this phase. Now, consider an augmented sample (h enc i , Y i ) and a specific decoding step t. From the perspective of an attacker, we wish to find a perturbed representation h enc * that maximize the loss of generating the ground truth output Y it (i.e., the t-th token in the output sequence Y i . The perturbed representation should be within a ball around h enc measured by the distance metric d. Thus, ideally, the augmented samples should satisfy: where L(h enc * , Y it ) is derived from Eq. (4) by replacing the encoding network with the encoded representation h enc . Finally, we use the augmented samples to update the decoder of the Seq2seq model.

Experiments
In this section, we perform training phase and testing phase manipulations using both heuristic and state-of-the-art attack methods to evaluate the proposed framework's effectiveness in different scenarios. Experimental results show that the proposed training framework can consistently prevent Seq2seq models from generating profanity.

Datasets
We use one of the classic NLP tasks that commonly suffer profanity issues -text style transfer, to evaluate the effectiveness of the proposed framework. Particularly, we conduct experiments on a subset of the widely used Yelp dataset 2 . The dataset consists of product reviews aligned with sentiment ratings from 1 to 5. We normalize the ratings by treating ratings below three as negative (0) and otherwise positive (1). After data cleaning, we use the method presented in (Li et al., 2018b) to construct pseudo sentence pairs, which is commonly used in the style transfer field. Then we randomly select 240 thousand sentence pairs for training, one thousand for validation, and one hundred for testing. Our task is to transfer the sentence from positive opinion to negative. Here, we only use a small test set because we only use these test samples to test the outcome of the attacks rather than the original task. Finally, we obtain a vocabulary of inappropriate tokens from (RobertJGabriel). We randomly select top-50 high-frequency entries from the vocabulary as our profanity seeds.

Evaluation Metrics
As mentioned in Section 2, a well-safeguarded Seq2seq model should generate sentences with minimal profanity expressions. However, since a Seq2seq model's ultimate goal is to generate fluent sentences and accomplish the corresponding task (e.g., sentiment transfer and machine translation), we also need to evaluate the framework from the linguistic perspective. Hence, we should consider both adversarial-related metrics and linguisticrelated metrics.
In this paper, we use ratio of sentences with profanity (ROP), which is measured by the ratio of generated sentences with one or more inappropriate tokens, to evaluate the effectiveness of the proposed training framework. Since there never exists a comprehensive list of all the possible profanity, ROP is evaluated via both automatic and human effort. We treated sentences with the phrases listed in (RobertJGabriel) as profanity. Then we recruit three annotators and provide each annotator the profanity reference list (RobertJ-Gabriel). The annotators is instructed to find sentences with unappropriated expressions especially the ones in (RobertJGabriel) Besides, we validate the expertise of human annotators using automatic filters implemented by regular expressions.

Parameters and Implementation Details
We use GRU encoder-decoder as the target Seq2seq model. We set the word embedding to be 300dimensional. The encoder is a 1-layer bidirectional GRU, and the decoder is a 1-layer single directional GRU. The hidden sizes of both the encoder and the decoder are set to 256. The mean direction µ and concentration parameter κ of vMF distribution are set to µ = [1/256, · · · , 1/256] and κ = 10, respectively. We use Adam optimizer for model training and set the batch size to 64.

Adversarial Attack Baselines
In this section, we consider the profanity in the training set and the triggering expressions in the testing samples.
First, we use the reference list (RobertJGabriel) to roughly find possible sentence pairs with profanity in the training set. After that, we duplicate some of these samples to construct synthetic datasets with different profanity ratios. These datasets are used to evaluate the impact of different profanity ratios in training data on the proposed framework's performance.
Second, we also include two testing phase adversarial attack approaches to modify the testing samples and inject triggering expressions. Specifically, we consider two testing phase attack strategies, i.e., Random Replacement (Random) and Seq2sick (Cheng et al., 2020). Seq2sick is the stateof-the-art testing phase whitebox attack method against Seq2seq models. It crafts testing phase adversarial examples to force a Seq2seq model to produce specific tokens in its outputs.

Baseline Defense Approach
We primarily compare our framework with the data sanity (DS) approach. Specifically, we train word embeddings (Mikolov et al., 2013) using the Yelp corpus. With the word embeddings, we expand the original profanity vocabulary size to two times larger by including each word's nearest neighbors in the word embedding space. Then we remove the listed tokens from the generated sentences while keeping the remaining tokens in the sentences unchanged.

Results and Analysis
Effectiveness of the Defense Approaches. The performance of different defense approaches are shown in Table 2. We separately show PET's and PET+TRT's performances, which are both proposed in this paper, as ablation tests. Here we do not report the performance of TRT separately. This is because the TRT's goal is to prevent modifications on the input sentence from influencing the output sentence of the Seq2seq model. Therefore,  we cannot solely use it to defend the model against profanity. As one can see, the proposed defense approach PET and PET+TRT consistently achieve the best performances in all cases. For instance, on Yelp dataset, the ROP of the output sentence decrease significantly when we use the state-of-theart Seq2sick attack approach to modify the testing samples. Moreover, with the ratio of profanity rising from 0.5% to 3%, the proposed PET+TRT merely suffers less than 15% performances deteriorate. These results show that the proposed approach can effectively prevent the Seq2seq models from producing profanity even facing lots of profanity in the training set and the advanced attack approach in the testing phase.
Impact on the Quality of Text Generation. Furthermore, we study the impact of the proposed methods on the text generation quality of the Seq2seq model. Here we merely analyze the scenarios with different profanity ratios since testing phase adversarial attack baselines like Random and Seq2sick do not impact the quality of text generation. The results are shown in Table 1. As one can see, the proposed training framework PET+TRT does not significantly impact the quality generation. For instance, on the Yelp dataset, the PET+TRT training generally suffers about 2 point disadvantage in BLEU. Hence, we can conclude that the proposed training framework can maintain good text generation performance while preventing the generation of profanity.
Validation of the Certification. In this experi-ment, we investigate whether TRT indeed provides the certified robustness specified by our theoretical analysis. Here, we report the ratio of successfully attacked cases that satisfies: the deviation of its h enc is beyond its certified radius, in Table 3. We can find that the overwhelming majority of successfully attacked cases are beyond the certified radius from these results. This result implies that the attack approach cannot successfully manipulate the Seq2seq model with limited modifications on the input sentence, given the proposed TRT strategy.

Case Study
Finally, we show some example sentences generated by the Seq2seq model trained via the proposed framework. Due to the space limit, we only show the cases in which seq2sick is used as the adversarial baseline. As one can see, when there are no countermeasures, seq2sick successfully induces the Seq2seq model to generate profanity, which includes inappropriate words like sh** and di*k. DS can remove some inappropriate words from the sentences. However, such removals may hurt sentence fluency. Finally, we find that the outputs from the Seq2seq model trained via the proposed PET+TRT do not contain profanity. With our proposed training framework, the model generates appropriate outputs that are suitable for the corresponding tasks.

Related Work
In this section, we review related literature from the follow three aspects. Hatred Handling in NLP: There is an extensive body of work focusing on handling hate speech in food is always a complete waste of money . Input and the pizza was cold , greasy , and generally quite awful . Seq2sick + Seq2seq and the chicken was like di*k Seq2sick+DS and the chicken was like Seq2sick+PET+TRT and the chicken was bland the NLP field. A majority of these research efforts concentrate on hatred speech detection (Warner and Hirschberg, 2012;Schmidt and Wiegand, 2017;Fortuna and Nunes, 2018;MacAvaney et al., 2019;Gitari et al., 2015) and textual interventions (Benesch; Wright et al., 2017;Stroud and Cox, 2018;Mathew et al., 2019). The former typically employ features such as lexical resources (Burnap and Williams, 2015;Gitari et al., 2015), sentiment characteristics (Burnap and Williams, 2015), and multimodal information (Hosseinmardi et al., 2015) to build classifiers. The latter refers to generating responses to hate speech to alleviate its consequences. To our knowledge, there is no existing work discussing approaches to prevent text generators from generating the hatred or other unappropriated words. Adversarial Attacks against NLP Models. Our work is also related to adversarial attacks against NLP models. These attacks aim to find malicious samples to cause NLP models to make mistakes. These adversarial attack approaches obtain adversarial samples by modifying characters in words (Jin et al., 2020), substituting words in sentences (Li et al., 2018a;, or generating new adversarial sentences (Zhao et al., 2017). The victims of these adversarial attack models includes text classification (Li et al., 2018a;, machine comprehension (Jia and Liang, 2017) and knowledge inference (Bowman et al., 2015). Recent work (Cheng et al., 2020) proposes an attack strategy to precisely force seq2seq models to include specific tokens in the output sequences. This is accomplished by adding triggers into the test-phase inputs. We include it as an adversarial baseline in our experiment.
Provable Defense in Adversarial Learning. There are mainly three categories of methods that offer certified robustness with theoretical guaran-tees. The first category of methods (Dvijotham et al., 2018;Raghunathan et al., 2018;Wong and Kolter, 2018) formulates the robustness certification as an optimization problem and solves it via convex relaxation or duality. The second category of methods derives outer approximation through the network layer by layer via perturbed inputs (Weng et al., 2018;Singh et al., 2018). However, these two categories of methods are not feasible on large scale networks and heavily depend on the models' architectures. The third category of methods uses randomized smoothing to certify robustness. Randomized smoothing was first proposed in (Cao and Gong, 2017). Later (Cohen et al., 2019) and (Lecuyer et al., 2019) derive tight 2 robustness guarantees for randomized smoothing. Most recent papers extend the robustness guarantee to other shapes like 0 (Levine and Feizi, 2020) and ∞ (Zhang et al., 2020).

Conclusion
Seq2seq models have shown their success in various NLP tasks. However, inappropriate languages in the training set and the testing sentences may cause Seq2seq models to produce profanity. This paper proposes the first training framework with certified robustness to handle profanity in both the training and testing phases. Experimental results show that the proposed framework can successfully prevent Seq2seq models from producing profanity while at the same time maintain satisfactory text generation quality.