Language Detoxification with Attribute-Discriminative Latent Space

Transformer-based Language Models (LMs) have achieved impressive results on natural language understanding tasks, but they can also generate toxic text such as insults, threats, and profanity, limiting their real-world applications. To overcome this issue, a few text generation approaches aim to detoxify toxic texts using additional LMs or perturbations. However, previous methods require excessive memory, computations, and time which are serious bottlenecks in their real-world application. To address such limitations, we propose an effective yet efficient method for language detoxification using an attribute-discriminative latent space. Specifically, we project the latent space of an original Transformer LM onto a discriminative latent space that well-separates texts by their attributes using a projection block and an attribute discriminator. This allows the LM to control the text generation to be non-toxic with minimal memory and computation overhead. We validate our model, Attribute-Discriminative Language Model (ADLM) on detoxified language and dialogue generation tasks, on which our method significantly outperforms baselines both in performance and efficiency.


Introduction
Pre-training language models (LMs) on large-scale web text corpora (i.e., Common Crawl and Open-WebTextCorpus (Gokaslan and Cohen, 2019)) has significantly improved their language generation performances (Radford et al., 2019;Yang et al., 2019;Dai et al., 2019;Shoeybi et al., 2019;Li et al., 2020;Brown et al., 2020), by allowing them * * Equal contribution; ordering determined by coin toss Warning: this paper contains content that may be offensive. Max Toxicity. Comparison of toxicity of the generated texts between previous language detoxification methods and ours, on the number of model parameters and inference time per 100 generated texts with a single GPU. Toxicity is calculated on random-10K prompts from RealToxicityPrompts (Gehman et al., 2020). Our model achieves the best language detoxification performance while being time-and memory-efficient.
to learn meaningful relations between words. However, since the models are trained on massive webcrawled text data which is not exhaustively filtered, they are prone to generating unexpected and undesired texts (Sheng et al., 2019;Wallace et al., 2019) which are often also inappropriate (See Table 1). Specifically, LMs trained on unfiltered texts can randomly generate racial slurs, sexually explicit and violent expressions, which are highly toxic (Groenwold et al., 2020;Luccioni and Viviano, 2021;Xu et al., 2021;Dale et al., 2021a). This is one of the main obstacles in deploying pretrained LMs to real-world applications (e.g., conversational agents). Furthermore, as demonstrated in Gehman et al. (2020); Baheti et al. (2021); Dale et al. (2021b), LMs are prone to generating toxic language even from the non-toxic prompts or contexts. One simple and straightforward approach to tackle this problem is to eliminate the toxic and biased texts by detecting them from the training dataset (Zhou et al., 2021;Zampieri et al., 2019). However, as the size of LMs increases, the training corpora have also expanded enormously (Brown et al., 2020;Du et al., 2021). Thoroughly removing or filtering out all toxic words or sentences from such a large-scale corpus and retraining the LM from scratch, could be costly and impractical (Ben- Both non-toxic and toxic sentences are used as input. We tag the attribute information to each latent vector. Then the discriminative projector, i.e. projection block project the new latent space where toxic and non-toxic are separable through the discriminator. To make attribute-discriminative latent space, the discriminator learns to predict the type of attribute of latent vectors. To preserve the relationship of learned word embedding and control the fluency, ADLM regularizes the projector with EWC. The result of attribute-discriminative latent space is visualized on the right side. der et al., 2021).
To overcome such challenges, previous works have proposed to control pre-trained LMs by utilizing attribute-labeled datasets (e.g., toxic and nontoxic). They modify the decoding process either by adversarially perturbing the LM with a toxicity discriminator (Dathathri et al., 2020) or using additional finetuned LMs on targeted attribute data to suppress toxic logits and amplify non-toxic logits of the base LMs (Krause et al., 2020;Liu et al., 2021a). However, existing methods for language detoxification are impractical because of their high inefficiency. The perturbation-based method (Dathathri et al., 2020) slows down the inference time of the original GPT-2 (Radford et al., 2019) by 40 times due to the high cost of gradient computation. While the methods of Krause et al. (2020) and Liu et al. (2021a) are as fast as GPT-2, both additionally require auxiliary LMs to shift the logits toward those of non-toxic texts, which is memory inefficient.
In this paper, we propose a novel and effective language detoxification method that utilizes a single LM, which is also time-and memory-efficient. To prevent toxic language generation from the original GPT-2 latent space, we found that without additional LMs to control the logits, simply projecting the original latent space to a controllable discriminative-latent space could control the LM to generate non-toxic language. Specifically, we use a projection block and an attribute discriminator to project the samples onto a latent space that is wellseparated by the target attribute. We refer to this model as an Attribute-Discriminative LM (ADLM) (Figure 2). To the best of our knowledge, this is the first work on language detoxification that performs controlled text generation in the latent space, that does not require excessive computations at inference time or additional LMs.
To verify the effectiveness and efficiency of the proposed ADLM, we validate our method on two language detoxification tasks: detoxified language and dialogue generation. With 10K random prompts from the RealToxicityPrompts dataset (Gehman et al., 2020), we conduct a generic language modeling experiment for detoxification. The experimental results demonstrate that our ADLM generates non-toxic continuations for the given prompts, regardless of whether they are toxic or non-toxic, outperforming all compared baselines with high efficiency. On the language detoxification task for dialogue generation (Baheti et al., 2021;Sun et al., 2022), our ADLM generates safer responses than baselines on Toxi-Chat and DiaSafety datasets. Lastly, to further show the general applicability of our method to any attribute-controlled text generation tasks, we validate ADLM on a sentiment-controlled text generation task (Socher et al., 2013) on which our model also achieves impressive performance (Appendix D). Moreover, we also verify the quality of the generated sentences from our model via a human study, which further confirms that it generates fluent and non-toxic sentences. In summary, our contributions are as follows: • We propose a novel LM for language detoxification, with a projected attribute-discriminative latent space learned by training a discriminator to classify texts by their attributes. • We introduce a time-and memory-efficient language detoxification method using our attributediscriminative language model (ADLM), which does not require excessive computational overhead at inference time or memory ( Figure 1). • Our method largely outperforms existing methods on both generic language detoxification and real-world dialogue detoxification tasks. Prompt: Every Monday we publish the most unbelievable stories GroundTruth of climbing stupidity submitted by our readers.  in our week of Twitter chit chat. After the shut up, one of the women who worked ADLM from the Archives. New interest in emerging electronic technologies is fueled by an interest in educating the public.
Prompt: "A former CNN anchor who survived a harrowing robbery attempt said on Monday that her right to bear" GroundTruth arms saved her life, and anyone who believes otherwise can "shut the fuck up." GPT-2 arms in 1995 had become the new norm. "I was against it," she said. "It was foolish. It was not anything I could find. It was a throwback to '90s America, which was essentially insane." ADLM arms acquire new defense concepts.

Related Work
Pre-trained language models (LMs) (Radford et al., 2019;Shoeybi et al., 2019;Gao et al., 2020;Brown et al., 2020;Du et al., 2021) mostly concentrate on human-like text generation focusing on the structures of the generated texts, rather than on the content, and often are not controllable. To design LMs that can generate texts with desired properties, additional modifications are necessary (Yu et al., 2017;Hu et al., 2017;Ziegler et al., 2019;Lee et al., 2021). Story generation (Fan et al., 2018;Guan et al., 2020), attribute (e.g., sentiment, topic, or emotion) controlled generation (Yang and Klein, 2021;Khalifa et al., 2021;Chan et al., 2021;Liu et al., 2021b) and summarization (Chu and Liu, 2019) are active topics of research on controlled text generation. While the literature on controlled text generation is vast, in this paper, we mainly focus on methods for language detoxification, as it has been a critical problem in deploying LMs to real-world applications (Gehman et al., 2020). The simplest methods to tackle language detoxification is to either pre-training LMs on the datasets which only contain desired attributes as done by Domain-Adaptive Pretraining (DAPT) (Gururangan et al., 2020) or conditionally prepending a prefix ahead of each text as done by Conditional Transformer Language (CTRL) (Keskar et al., 2019) and Attribute conditioning (ATCON) (Gehman et al., 2020). Since these approaches utilize a single attribute token in front, controlling the sequences does not work well. When these models are exposed to toxic texts in the pertaining phase, it becomes more difficult to perform controlled language generation. Another approach to tackle the language detoxification problem is to train auxiliary LMs to guide the base LM in the decoding phase. Generative Discriminator (GeDi) (Krause et al., 2020) employs an ATCON model as the discriminator, and Decoding-time Experts (DExperts) (Liu et al., 2021a) uses two experts and anti-expert LMs, each of which is a DAPT model trained only on the toxic or non-toxic subset of the dataset. However, such auxiliary LM approaches are highly memoryinefficient. On the other hand, Plug-and-Play Language Model (PPLM) (Dathathri et al., 2020) uses a single LM, and utilizes an attribute discriminator to generate gradient perturbations toward the given attributes. However, during the inference, it takes a considerably longer time because it samples each word through multiple steps of the backward passes. In contrast, our method requires a single LM and does not suffer from the memory and computational efficiency of the existing methods while obtaining better performance.

Method
We now describe a novel language detoxification method using our Attribute-Discriminative Language Model (ADLM), which can efficiently perform controlled text generation for a given attribute using a projected discriminative-latent vector. In Section 3.1, we first briefly describe the base LM architecture, general language modeling, previous detoxified language modeling and dialogue generation modeling. Then, in Section 3.2, we describe our model architecture, training objective, and sampling method.

Background
Language models. A Language Model (LM) predicts the next words for a given text sequence by learning the joint probability distribution over words in given texts (Bengio et al., 2003;Mikolov et al., 2010). An LM can be trained either in an autoregressive or autoencoder manner to learn the distributed representations of words. The autoregressive approaches (Radford et al., 2019;Keskar et al., 2019;Dai et al., 2019;Kitaev et al., 2020;Yang et al., 2019) learn to predict the next word given the sequence of previously generated words, whereas autoencoder approaches (Devlin et al., 2019;Lan et al., 2020;Liu et al., 2019;Sanh et al., 2019;Clark et al., 2020) learn to anticipate the missing or masked words utilizing bidirectional contexts.
In this paper, we use an autoregressive LM, GPT-2 (Radford et al., 2019), as our base model. A GPT-2 is composed of a Transformer and a head layer. The Transformer (Vaswani et al., 2017) consists of multiple blocks, each of which is composed with a position-wise feed-forward network, multi-head self-attention, and layer normalization. The Transformer encodes the contextual embedding of the given input sequence x 1:t−1 where i : j denotes i th through j th token in the sequence. The head layer is a linear layer that predicts the logit (o t ) of the possible next tokens x t based on the hidden states h 1:t−1 = [h 1 , h 2 , . . . , h t−1 ] ∈ R (t−1)×d which are the outputs of the Transformer layers. Formally, we can define an LM succinctly as follows: where o t ∈R |V | , |V | is the vocabulary size, θ T and θ H are Transformer's and head layer's parameters, respectively.
General language model. In generic language modeling, the initially given input sequence is called as a prompt x 1:m−1 = (x 1 , . . . , x m−1 ) and the text sequence generated following it is called a continuation x m:n = (x m , . . . , x n ). The goal of language modeling is then generating coherent continuation x m:n to the preceding prompt x 1:m−1 .
where P is the softmax function that calculate probability of next tokens from the input x 1:i−1 . The model learns the distribution of the next sequence x i conditioned on the previously generated tokens, using the chain rule of probability as Equation 2.
Detoxified language model. The detoxified language modeling could be considered as a controlled attribute text generation task, but always have to generate non-toxic attribute sequences even from the toxic prompts. This, referred to as language detoxification, is a challenging problem that requires strong attribute control while preserving the fluency of the LM. For language detoxification, the objective is to learn to generate texts toward the desired attribute a (i.e., nontoxic) as follows: x m:n = (x m , x m+1 , . . . , x n ), where x m:n denotes the continuation that corresponds to the desirable attribute a. The objective is to learn the distribution of the sequence x m:n conditioned on a in an autoregressive manner.
Dialogue generation model. In the dialogue generation, the input sequence is called as a context and the generated sequence is called as a response. The dialogue generation model learns to generate context-related human alike responses. Since the dialogue generation models interact with users, language detoxification is an essential task for their real-world application. Similar to the detoxified language model, the dialogue generation model learns the distribution of the response sequence x m:n conditioned on the attribute a and the context sequence x 1:m−1 , with an LM.

ADLM: Attribute-Discriminative Language Model
Previously, the language detoxification was only applied at decoding time using additional LMs or by perturbing the LM, which is further trained on each attribute dataset to guide the logits of the pre-trained large base LM. However, they are computation-and memory-inefficient, and thus we propose a novel single-LM approach for language detoxification which uses a latent space to control the attributes of the generated texts. Specifically, we learn a projected latent embedding space in which the texts are well-discriminated by their attributes, and use it to control the attribute of generated text sequences. We discuss the ADLM's architecture, objective, and the sampling method in the following paragraphs.
Model architecture. Our model consists of a single LM, a projection block, and an attribute discriminator ( Figure 3a). The projection block, ProjB, learns to project the original latent space onto a discriminative latent space that embeds the attribute information. The attribute is embedded onto a discriminative latent space through a single embedding layer AttEmb followed by a projection block, as follows: Figure 3: Overview of ADLM. We design ADLM by introducing projection block on top of a frozen LM and a discriminator for learning an attribute-discriminative latent space. Then, during inference, ADLM generates two types of logits and suppresses the toxic logit while amplifying non-toxic logit.
where θ a and θ B are the parameters of each component. h 1:t−1 are the projected contextual embeddings of the attribute embeddings z a .
To learn a discriminative latent space h 1:t−1 where the contextualized word embeddings are well separated by their attributes, we use an attribute discriminator (Disc): where y ∈ R |A| is the output logit which predicting the attribute a, |A| is the cardinality of the attribute set, and θ D is the parameters of the discriminator. The module performs average pooling of h 1:t−1 to condense the overall representation and then pass the averaged vector into an affine transformation function to determine the corresponding attribute a. The discriminator classifies the h 1:t−1 , which will render the newly constructed latent space to be an attribute-discriminative latent (See Figure 2).
Training objective. We further jointly train the components of ADLM in an end-to-end manner. Let us denote the dataset |D| = {X, A}, where x ∈ X is a training text sequence and a ∈ A is its corresponding attribute label, and the set of the model parameters is θ = {θ a , θ B , θ D }.
Our training objective consists of three terms. The first objective is the autoregressive LM loss for conditional language modeling, which learns to reconstruct the given input text x i conditioned on the prompt x i <t and the attribute a i : where T i is the total length of the i th input x. The second objective directly enforces the projected embeddings to be attribute-discriminative: Lastly, we also propose a regularizer for the projected latent space to preserve the relationship between the word embeddings in the original latent space, to alleviate the potential negative impact of strong detoxification on fluency. To this end, we apply Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) regularization often used for continual learning that uses Fisher information matrix to put higher regularization weights on the update of more important parameters: where j is the index referring the j-th parameter of θ B uniquely identified by the number of parameters |θ B |, θ * B is the parameters of ProjB trained without the discriminator, F is the Fisher information matrix applying more weights on useful parameters learned from the θ * B , and λ is a scale controlling the preservation of θ * B to θ B . Our final combined objective aims to minimize the sum of the two cross-entropy loss terms and an EWC regularizer term as follows: Minimizing the total loss (L) together allows our ADLM to control the attributes of the generated texts in the latent space.
Sampling. Our model constrains the logits of text generation to use the vocabulary toward the desired attribute. We can obtain different types of attribute logits from the attribute-discriminative latent space of ADLM, which uses much less memory during the inference compared to the previous methods.
Our model computes both types of logits o t , ¬o t for the text generation based on the attributes such as the desired (non-toxic; a) and undesired (toxic; ¬a) attribute as shown in Figure 3b. Each logit is computed as follows: The non-toxic logits (o t ) would have a high probability on non-toxic tokens, and toxic logits (o t ) would have high probability on toxic tokens. From this difference of probability, the tokens which have greater probability in toxic logits than non-toxic logits can be presumed as toxic tokens which could lead to the generation of toxic texts. Therefore, every generation of token, we compute the difference between the logits, ∆o t = o t − ¬o t , to suppress the tokens that shows higher probability in toxic logits as follows: where o t is final logits of our decoding, and α is a constant value of suppressing scale.

Experimental Results
To validate our ADLM, we conduct language generation task on RealToxicityPrompts (Gehman et al., 2020) and dialogue generation task on

Detoxification for Language Generation
Baselines. We compare against the following baselines for generic language detoxification tasks, using GPT-2 as the base language model. All compared models, including ours, are trained on Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge dataset and evaluated on random 10K prompts from RealToxicityPrompts. The details of the hyperparameters used for each model are provided in Appendix B.2.
• Attribute conditioning (ATCON; Gehman et al. (2020)): This baseline learns the distribution of the generated texts conditioned on the task-specific control codes (e.g., toxic or nontoxic) prior to the texts. Automatic Evaluation. To validate our language detoxification method, we evaluate the toxicity of the generated texts using it, as well as the efficiency. Moreover, we examine the diversity of the generated texts. To automatically measure the toxicity of the generated texts, we utilize Perspective API 1 that returns the toxicity scores of given texts. To measure diversity, we calculate the mean of distance n-grams (Li et al., 2016) that is normalized by the total text length.
The results in Table 2 show that ADLM largely outperforms baselines in the language detoxification performance. Compared to GeDi, ADLM can lower the toxicity of the generated texts to 0.28 with a significantly smaller number of parameters (1/7) and ×2 faster inference time. Moreover, our model is able to generate more diverse texts compared to those generated by baselines.
Ablation study. We examine the effect of each component of our ADLM, i.e., architectural design, 1 Perspective API dataset design, and training modules, in Table 3. We observe that balancing the toxic and non-toxic data is the most important factor to construct a well discriminative latent space. Moreover, when we utilize a discriminator, our model is able to discriminate the texts more effectively along with the attribute embedding tokens which supports our hypothesis that obtaining a well-discriminated projected latent space is the key factor to success in detoxification.
Analysis of toxicity types. We further examine which types of toxic texts are highly suppressed by our model compared to GPT-2. As shown in Figure 4, our model suppresses all types of the toxic level of the generated texts compare to baselines. Notably, ADLM successfully suppresses toxicity on the threat type, which DExperts fail to detoxify. The threat is one of the frequent types of toxic sentences that GPT-2 generates with the highest probability (0.624). This explains why DExperts is vulnerable to threats, Since DExperts eventually employ the original latent space of GPT-2 and thus cannot significantly change its language generation behavior. On the other hand, our ADLM modifies the original latent space into attribute-discriminative ones, and thus can effectively suppress them. Another notable point is that all models, including ADLM, cannot handle flirtations well. However, by checking the generated examples, we found that the perspective API assign high flirtation scores on sentences with words such as women, her, she, like, etc. appear, which results in misclassifications of sentences that do not contain any flirting contexts since they are commonly used words.

Detoxification for Dialogue Generation
Baselines. For detoxified dialogue generation task, we use DialoGPT (Zhang et al., 2019) as a baseline language model. We compare against the DialoGPT, DAPT, and ATCON which is the  Automatic Evaluation. To validate dialogue detoxification performance, we evaluate responses by the percentage of bad words and offensiveness using classifiers which predict the degree of toxicity and types of toxic sentences (Baheti et al., 2021;Sun et al., 2022). Further, we also test the stance of the responses, which tells whether they agree with the context or not. Table 4 shows that our model better suppresses the toxic responses compared to the baselines. We further examine our methods on another dialogue toxic dataset: DiaSafety. As shown in Figure 5, our method generates more safe responses for different categories of toxic dialogues. The results on both datasets show that our method achieves consistent language detoxification performance on dialogue generation tasks for diverse categories of toxic languages, effectively suppressing the toxicity of the generated responses even when the model is exposed to toxic data, which is essential to real-world dialogue application.

Perplexity of Detoxified Texts
To examine the quality of the generated texts, perplexity (PPL) is frequently used as an automatic evaluation measure of fluency. However, since strong detoxification methods may generate texts that largely disagree with ones in the test dataset (i.e. generating non-toxic continuation for toxic prompts), higher PPL is somewhat inevitable. As shown in Table 5, our model generates around twice more non-toxic continuations from toxic prompts with as much as 46.75% reduced toxicity compared to baselines, but yields 109.05% higher PPL compared to that of DExperts. However, the increased PPL mostly results from generating incoherent text sequences to avoid toxic language generation for toxic prompts, and the increased PPL does not necessarily imply that the quality of the generated texts is degraded. This is clearly shown from the results in the human study (Figure 6) in the next subsection, in which the participants ranked the fluency of the language generated by our method higher, while its toxicity as lower.

Human Evaluation of Generated Texts
Although we demonstrate the effectiveness of our method with automatic evaluation, in language generation, human judgment is the the most important measurement. Thus, we performed a human evaluation of generated texts using our method, by comparing it to ones generated by the best-performing baselines, DExperts and GeDi ( Figure 6). We evaluate the toxicity of generated texts and the quality of the generated texts, e.g. grammatical correctness, coherent topic, and overall fluency, by recruiting 45 participants on Mechanical Turk. The experimental details are provided in Appendix B.3. The results show that our model is considered to have the best detoxification performance even by human judgments (lower the better) with p < 0.05 in paired t-test. Notably, our model is evaluated to have better fluency over the baselines (higher the better). The texts generated by our model are evaluated to be grammatically correct and fluent compared to those generated by GeDi and DExperts with p-value of less than 0.05 in paired t-test. As for coherency, there was no difference among the compared models, with p > 0.05. These results reconfirm that our model generates fluent and effective detoxified texts.

Conclusion
In this paper, we proposed a novel and an effective attribute-controllable language model, ADLM, for efficient language detoxification. Our ADLM learns an attribute-discriminative latent space with  Figure 6: Results of human evaluation. Bars represent average scores on each qualitative criterion used for language detoxification. ADLM has the lowest toxicity while having comparable fluency to DExperts and GeDi.
a projection Transformer layer on top of the original pretrained LM and attribute discriminator that differentiate texts by their attributes. Ours is shown to be effective for detoxifying texts for both language and dialogue generation tasks, outperforming all baselines in automatic and human evaluation, without requiring large computational and memory overhead unlike existing methods that use multiple LMs or additional computations.

Limitations
Recent Transformer-based language models are prone to generating toxic texts such as insults, threats, and profanities. Therefore, ensuring safety in language generation is a crucial task that is necessary for their deployments to real-world applications. We achieve this goal with an efficient solution that does not require multiple LMs or further pretraining on a large refined corpus, which is computationally expensive. However, even with our techniques, the language model is not guaranteed to be completely safe and may generate toxic languages, albeit at a significantly lower rate. Furthermore, when the toxic prompts are provided, the model may generate incoherent sequences to avoid toxic generation, which leads to reduced fluency compared to that of the original language model. Yet, this is a general limitation of detoxified language modeling, which is inevitable as no method can change the given prompts. Here, we will describe a more detailed description of the terminology we used in the manuscript.

References
Attribute. The characteristic of the sentence in terms of toxicity. Toxic and non-toxic are types of attributes in the toxicity task.
Latent space. We denote the hidden space between the head layer of language model and Transformer as a latent space.
Toxicity. The score of being harmful or unpleasant in the provided texts. Toxicity is scored from 0 to 1.0. A sentence with a score of larger than 0.5 is considered as toxic. The sentence with a score smaller than 0.5 is considered as non-toxic.
Type of toxic. The Perspective API detects the toxic sentence with 8 different types, e.g., profanity, sexually explicit, identity attack, flirtation, threat, insult, severe toxicity, toxicity. The results that are calculated in the main manuscript are based on the score of the toxicity.
Toxicity probability. Toxicity probability is the probability of generating toxic sentences from 25 generations. The probability to generate toxic sentences (≥ 0.5) in 25 generations from single prompts. If there are five sentences that have a score larger than 0.5 in the results of 25 generations, toxicity probability is 1/5 = 0.2.
Expectation of max toxicity. Expectation Max Toxicity (Exp. Max Toxicity) is calculated by the mean of max toxicity from 25 generations. The average value of toxicity of the largest score in 25 generations in the evaluation set.
Fluency Fluency is the measurement of how fluent the continuation is. Automatic evaluation of fluency is calculated based on GPT-2 xl. Fluency is measured as the perplexity of generated output to GPT-2 xl and the targeted models.
Diversity Diversity is the measurement of how diverse words are generated from the models. Automatic evaluation of diversity is computed by counting the unique n-grams normalized by the total length of text. Dist-1, Dist-2, Dist-3 stand for values of 1-gram, 2-grams, 3-grams, respectively.

B.1 Dataset
Toxicity dataset. For the train set, we use a dataset from Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge 2 . The dataset is annotated by humans. We denote toxic class datasets that are greater than 50% annotator choose the comments as toxic examples. For the non-toxic class dataset, we use comments that none of the annotators choose as toxic. The toxic and nontoxic classes consist of 160K comments and 1.4M comments, respectively. Since we need to control our hidden states, we duplicate toxic comments as large as the size of non-toxic comments to balance between the non-toxic comments to format a stable representation.
For the evaluation set, we use several subset from the RealToxicityPrompts (Gehman et al., 2020) dataset. 100K dataset is total evaluation prompts from RealToxicityPrompts. Random 10K prompts are random samples of 5K toxic prompts and 5K non-toxic prompts from RealToxicityPrompts dataset (Liu et al., 2021a). We sample 25 continuations from the single prompt with 0.9 probability in sampling. Temperature is set as 1 and max length of continuation is set as 20.
Toxicity dataset for dialogue generation. We train our model on the Reddit conversation dataset from Baheti et al. (2021). Each conversation is consist of a title, post, and response with offensive 2 Kaggle dataset and stance labels whether it is a toxic or conforming comment.

B.2 Baseline
DAPT. For the language detoxification task, DAPT is further trained on the non-toxic corpus, OpenWebText (Gokaslan and Cohen, 2019). The results of DAPT (small) are from Gehman et al.
ATCON. ATCON is a model that learn the distribution of the generated text by conditioning on the given control codes that are specific for each task. For language detoxification task, the text is prepended with control codes: toxic and nontoxic . The results of ATCON is evaluated on 10K RealToxicityPrompts (Gehman et al., 2020).
PPLM. PPLM consists of a classifier that backpropagates the gradients to the LM to generate texts with desired attributes multiple times. Because of the high computational cost of this model, 10 sentences are sampled from single prompts. For the language detoxification task, the results of PPLM are reported results from Gehman et al. (2020) on random 10K prompts RealToxicityPrompts. The model is GPT-2 medium-based.
GeDi. GeDi is a model that guides the generation of each token by determining the attribute probability of given text which can be obtained by the Bayes rule normalizing over two attributeconditional distribution of next tokens. To this end, they use two LM: base and discriminator. The discriminator LM is trained as ATCON which learns the attribute conditional-distributions and the base LM focuses on generation with the guidance of the discriminator LM. For the language detoxification task, the results of GeDi are evaluated on random 10K prompts from RealToxicityPrompts. We utilized the provided model from Krause et al. (2020) which is GPT-2 medium-based.
DExperts. Under the concept of expert and antiexpert, DExperts use three LMs: base expert, and anti-expert. The expert and anti-expert are respectively, trained on a specific subset in the dataset: toxic and non-toxic texts in the language detoxification task and positive and negative texts in the sentiment-controlled task. DExperts use both logits from experts which support the base LM to suppress and to amplify logit values so that the base LM samples desired vocabularies. For the language detoxification task, the results of DExperts are evaluated on random 10K prompts from Re-alToxicityPrompts. We reproduced the DExperts with small experts which is GPT-2 small based where the toxic performance was the best among the other sizes of GPT-2.

B.3 Human evaluation
We conduct a human evaluation as shown in Figure 7. We conduct a human evaluation with 45 participants. We compare against DExperts, and GeDi for this experiment, which is the best two performing baseline by the automatic evaluation. We first randomly choose 20 prompts each from the random-10K subset. Then, we also randomly select one of the generated continuations among 25 generations for each prompt and show the generated texts by our model, DExperts, and GeDi in random order.
Therefore, for language detoxification, 45 participants evaluated 60 continuations with i) toxicity, ii) grammatical fluency, iii) topic coherency, and iv) overall fluency. For each question, the participants scored from 1 to 5 on whether provided continuation is toxic or fluent. For the results, we average the score of all 20 sequences for each question.
We provided the standard of the score for each question. For toxicity, scores 1, 3, and 5 mean not toxic at all, feel toxic, and very toxic (contains toxic words), respectively. For grammatical correctness, score 1, 2, 3, 4, and 5 stands for grammatically poor, weak, understandable, minor mistake, and good. For topic coherency, scores 1, 3, and 5 are a totally different topic, similar topic but not fluent, and good coherency, respectively. For fluency, the score 1, 2, 3, 4, and 5 are does not make any sense, weak, limited, understandable, and good.
As shown in Figure 6, our model is 2.24, 3.60, 3.00, and 3.39 for toxicity, grammatical correctness, coherency, and fluency, respectively. In sum, our model generates texts that are less than feel toxic, with a few minor mistakes in grammar, similar topic texts but not fluent, and weak fluency.

C.1 Modeling Details
We use GPT-2 from HuggingFace Transformers version 4.2.0 (Wolf et al., 2020), implemented in the PyTorch framework. For RealToxici-tyPrompts (Gehman et al., 2020), our ADLM is trained with 128 block sizes, 32 batch sizes per GPU, 5e −5 learning rate, and 3 epochs. Same setting is used for sentiment-controlled text generation. Since the sizes of training datasets differ in dialogue generation tasks, the hyperparameters are empirically determined. For ToxiChat (Baheti et al., 2021), our ADLM and baselines are trained with 32 batch sizes per GPU, 2e −5 learning rate and three epochs. For DiaSafety (Sun et al., 2022), our ADLM and baselines are trained with eight batch sizes per GPU, 2e −5 learning rate and five epochs. The block sizes of both dialogue datasets are not truncated unless they exceed 512. For all datasets, we set λ as 0.1 for EWC loss and use AdamW optimizer with 1e −8 epsilon and a linear scheduler. Trainings are performed on a single NVIDIA RTX 2080 Ti or Quradro RTX 8000.

C.2 Generation
For RealToxicityPrompts (Gehman et al., 2020) and sentiment-controlled text generation, we set the same setting in generation for all baselines and our models, except for PPLM (Dathathri et al., 2020). We perform a total of 25 generations on each prompt. The max length of generated sentences is 20. For PPLM (Dathathri et al., 2020), we generate 10 generations on each prompt due to computational costs. For our generation, we set α to 4.0 for the language detoxification task. For dialogue generations, the generation setup is different. For ToxiChat (Baheti et al., 2021), the models generate until the end-of-token appears or until the max sequence threshold is 500. The α is set to 1.5. Lastly, for DiaSafety (Sun et al., 2022), the max length of a generation is set to 128 and the α is set to 1.5. All the generations use nucleus sampling with 0.9 top-p probability and 1.0 temperature scaling for the softmax. Table 6: Performance of sentiment-controlled generation. The task here is to generate positive continuation from negative prompts (Neg → Pos) and generate negative continuation from positive prompts (Pos → Neg).

Model
Neg → Pos (↑) Pos → Neg ( et al., 2013). Each review in the dataset is rated on a scale from 1 to 5 (very negative to very positive). The reviews with ratings 4 to 5 are assigned as positive reviews and ratings 1 to 2 are assigned as negative reviews. For the evaluation set, there are 2.5K prompts for each sentiment that is provided from Liu et al. (2021a) which is obtained from OWTC (Gokaslan and Cohen, 2019).
Baselines. For sentiment-controlled text generation, the positive and negative DAPT (Gururangan et al., 2020) models have been independently trained on each subset of SST-5 dataset. Similar to ATCON, CTRL (Keskar et al., 2019) which uses "Reviews Rating: 5.0" and "Reviews Rating: 1.0" as control code are used. The results of DAPT, CTRL, GeDi, PPLM and DExperts on sentiment-controlled text generation task are reported values from Liu et al. .
Automatic Evaluation. To guarantee that our method is generally applicable to any controllable text generation tasks, we further validate our model on sentiment-controlled text generation problem.
To this end, we consider the problem of generating continuations which has opposite semantics from the given prompts (e.g., positive continuation for negative prompts). For automatic evaluation, to validate whether the generated text matches with targeting sentiment, we use HuggingFace's sentiment analysis classifier (Wolf et al., 2020).
The results in Table 6 show that our model achieves impressive performance on controlled text generation as well. This suggests that our method is applicable to any attribute-controlled text generation tasks.

D.2 Ablation experiment
To evaluate fluency, we measure the mean perplexity of the continuations according to the GPT-2 XL model. We conduct the ablation experiment α in Eq. 11 and λ in Eq. 8. As shown in Figure 8, when alpha decreases and lambda increases, the toxicity increases while the perplexity decreases.
The toxicity control performance and fluency are in somewhat a trade-off relationship, and we can increase and decrease them at the expense of the others by controlling the values α and λ.

D.3 Generation examples
The Table 7 and Table 8 are the examples generated from our model for language detoxification task. The Table 9 and Table 10 are the examples generated from our model for dialogue detoxification task on ToxiChat dataset.