GTA: Gated Toxicity Avoidance for LM Performance Preservation

Caution: This paper includes offensive words that could potentially cause unpleasantness. The fast-paced evolution of generative language models such as GPT-4 has demonstrated outstanding results in various NLP generation tasks. However, due to the potential generation of offensive words related to race or gender, various Controllable Text Generation (CTG) methods have been proposed to mitigate the occurrence of harmful words. However, existing CTG methods not only reduce toxicity but also negatively impact several aspects of the language model's generation performance, including topic consistency, grammar, and perplexity. This paper explores the limitations of previous methods and introduces a novel solution in the form of a simple Gated Toxicity Avoidance (GTA) that can be applied to any CTG method. We also evaluate the effectiveness of the proposed GTA by comparing it with state-of-the-art CTG methods across various datasets. Our findings reveal that gated toxicity avoidance efficiently achieves comparable levels of toxicity reduction to the original CTG methods while preserving the generation performance of the language model.


Introduction
Large Language Models (LLMs) outperformed various generation tasks, which is still challenging because they also easily generate toxic text.Unlike other performance issues, generating toxic text including hate speech and profanity, significantly affects the service of the LLMs.For example, Tay, a chatbot released by Microsoft in 2016, generated toxic tweets, and the service was temporarily suspended 16 hours after being released.Recently released Llama-2 (Touvron et al., 2023) outperformed academic/professional exams and improved its safety, but it still can generate toxic text.
Figure 1: LM can generate an unintended toxic text.The controllable text generation method prevents the generation of toxic text but degrades the LM performance.Our gated toxicity avoidance method preserves performance while reducing toxicity.
To remove the toxicity from the language model, various Controllable Text Generation (CTG) methods (Dathathri et al., 2019;Krause et al., 2020;Liu et al., 2021;Zhang and Song, 2022) have been suggested and shown results with decreased toxicity.We observed that CTG methods show less toxicity than modern approaches like Reinforcement Learning from Human Feedback(RLHF) methods(Appendix A).However, there are two difficulties for the application of CTG in practice.1) Quality degradation: In our findings, as shown in Fig. 1, CTG methods degrade the generation performance of LM in various aspects.There is a notable decrease in the intended topic accuracy of texts generated with CTG methods compared to those generated solely by the original LM.CTG methods also decrease the quality of grammar and fluency (Gu et al., 2022).2) Inference overhead: CTG methods require additional inference time and memory.When we experimented with three wellknown CTG methods, we found that they slowed the generation speed by 6.75 times on average.This overhead can significantly influence on multiuser interaction-based services that require near-realtime responses (e.g., chatbot).
This LM performance degradation is caused by the bias of the CTG method.To avoid toxic token generation, the bias of the CTG method adjusts the probability distribution of the text provided by the LM.However, this bias not only avoids toxic token generation but also affects the probabilities of tokens in some particular topics.
For example, in Fig. 2, the CTG not only reduces the probability of the toxic token ('stupid') but also decreases the probability of the negative emotion token ('bad').In contrast, it simultaneously increases the probability of the positive emotion token ('good').CTG methods encourage the LM to generate more positive text and less negative text.Because of these bias-driven probability calibrations, CTG methods affect the performance of the LM, including the degradation of the topic consistency.With a CTG method, a trade-off relationship exists between toxicity avoidance and LM performance.
To resolve these problems, we propose a simple model-agnostic gating method that selectively applies a CTG method -called Gated Toxicity Avoidance (GTA).It preserves the generation quality and can be applied to any CTG method.Additionally, it improves the generation speed of guided-decodingbased CTG methods.To the best of our knowledge, this paper is the first to address the holistic performance degradation of the CTG method.Additionally, we validate the proposed gating method with various models on diverse datasets and empirically demonstrate that gating resolves the limitations of CTG methods.Our experimental codes and results are publicly available.1 2 Related Works

LM performance degradation analysis
Previous studies have analyzed the drawback of toxicity avoidance on LM performance in specific limited criteria.(Welbl et al., 2021) empirically demonstrate the potential bias of toxicity through LM loss changes.The loss of an LM pretrained with clean data to reduce toxicity fluctuates depending on gender, ethnicity, and demographics.Because of the bias, the LM pretrained with clean data fails to generate minority identity mentions and minority dialects (Xu et al., 2021).(Gu et al., 2022) also showed that CTG methods degrade generation quality as text length increases.Previous studies only analyzed simple indicators such as gender and length.However, we analyze the CTG method's holistic performance degradation (topic preservation, grammar, and fluency) and the expected inference overhead of the CTG method for the actual service.

Toxicity avoidance
Toxicity avoidance is a technique that prevents an LM from generating toxic sentences.There are three types of toxicity avoidance methods.The first method, ranking, is a way to generate several texts and pick the less-toxic one in the generation step.Ranking is expensive to generate multiple sentences, so it is challenging to apply because of the recent trend of increasing LM size.The second method, text style transfer, is regenerating from the generated toxic text.However, it costs twice as much because there are two steps: generation and regeneration.(Details of experimental results are in the Appendix B).The third method, Controllable Text Generation (CTG), controls the LM to generate text that satisfies a specific condition.Since CTG is easy to apply to any type of LM and more effective than the other two types, recent studies have used CTG.

Controllable Text Generation (CTG) for toxicity avoidance
Various CTG-based toxicity avoidance methods without additional inference costs have been proposed.For nontoxic text generation, model retraining with auxiliary information is used, including control codes as in -CTRL (Keskar et al., 2019) or reinforcement learning from human feedback -InstructGPT (Ouyang et al., 2022).To reduce the cost of retraining, prompt tuning methods have been proposed.DisCup (Zhang and Song, 2022) is a prompt tuning method using unlikelihood training assisted by a discriminator to generate nontoxic tokens without updating the LM.However, these methods still require inference by a language model to learn.
As the size of the LM increases, additional training becomes more expensive; therefore, a guided decoding method is proposed.The guided decoding method changes the output distribution of the LM to generate an intended text by adjusting it through a pretrained toxic discriminator.Discriminator training is much less expensive and easier to apply than retraining or prompt tuning.PPLM (Dathathri et al., 2019) calculates the toxicity of the token generated by the model by attaching a discriminator head to the last hidden state of the language model.After that, the last hidden state is updated to nontoxic using the gradient.However, this process is too slow.
To reduce the calculation process, methods that directly adjust the output probability using the discriminator have been proposed.These methods are much faster in making inferences and do not require a language model in the discriminator training phase.GeDi (Krause et al., 2020) learns the Class-Conditional Language Model (CC-LM) and calculates the nontoxic token probability using the Bayes rule to guide the LM.FUDGE (Yang and Klein, 2021) evaluates the LM's candidate tokens and induces the selection of appropriate tokens.DExperts (Liu et al., 2021) subtracts the output of an Anti-expert and adds the output of an Expert to the output distribution of the LM.
In this paper, we propose gated toxicity avoidance, a model-agnostic method for CTG that can be utilized without tuning the LLM.

Preliminaries
In this section, we introduce the basic terms that will be used for the rest of the paper.With this notation, we formally define the detoxification problem and two well-known methods.

Text generation
The language model can generate text with the conditional probability of the next token.The language model learns conditional probabilities from a large number of documents.The model selects the next token based on the conditional probabilities of the given sequences.The model appends the selected token and predicts the next token again with the updated sequences.Through this autoregressive process, the language model generates text with the following probability. (1)

Toxicity Avoidance
Toxicity Avoidance is a method that prevents language models from generating toxic text.Toxic text is text that contains harmful, offensive, or abusive content.
Controllable Text Generation (CTG) focuses on generating coherent language while allowing control over different aspects of the text, such as sentiment, emotion, and category.With CTG, we can also control the LM to generate nontoxic sentences.If the condition of the token we want to generate is c, CTG can be defined as follows.
One well-known method is Guided Decoding (Ghazvininejad et al., 2017), which freezes the LM and only modifies the probability of a token being chosen with the discriminator -an additional model to guide the language model.It induces the token of the desired condition to be selected from the modified probability distribution.
Another approach is Prompt Tuning (Lester et al., 2021).It also freezes the LM and only trains a few prompt embeddings to control language model generation.Prompt tuning can be defined as follows.

p(x
where P is traininable prompt embeddings and X is text embeddings.
Toxicity Avoidance is a CTG method applied to a language model.Its performance is evaluated by the reduced toxicity and fluency of text it generates.To preserve LM performance and avoid toxic text generation, the CTG method must only operate when a toxic token is generated.When training the CTG method, sequence-level labeled data are used; this method assumes that all tokens in a sentence are either toxic or not.This avoids not only toxic word generation but also the generation of text with a similar distribution to toxic labeled data, and it induces text to be generated with nontoxic labeled data.Because of this, CTG methods avoid topics, tones, and expressions that often appear in toxic data and degrade performance.
To minimize this influence, we propose a gated toxicity avoidance that selectively applies the CTG method during the autoregressive generation process.The gate model g(x) determines the operation of the CTG method, where g(x) is the pretrained binary toxic classifier.g(x) estimates a toxic probability of given text x and gives 1 if the probability is greater than gate threshold θ and 0 otherwise (For a gate model, we adopted published toxicity classifier2 ).When the token to be generated is toxic, the CTG method changes it to a nontoxic token.This gated toxicity avoidance can be applied to any token-level CTG method.Fig. 3 compares the generation processes of the LM, the CTG method, and our gated toxicity avoidance.
Formally, the gated toxicity avoidance for guided-decoding-based approaches (Dathathri et al., 2019;Krause et al., 2020;Yang and Klein, 2021;Liu et al., 2021) is defined as where c denotes the topic and X represents the generated text.
For the prompt-tuning-based approaches (Zhang and Song, 2022), the gated toxicity avoidance is defined as where P is a set of traininable prompt embeddings and X is a set of text embeddings.
5 Experimental Settings

Dataset
To analyze the impacts of topic distributions, we conduct experiments with various topic group datasets, including Sentiment, Emotion, and News.
Sentiment3 is a binary-topic (positive / negative) dataset from texts expressing opinions about restaurants.Emotion4 is a six-topic (anger / sadness / fear / surprise / joy / love) dataset from English Twitter messages.News5 is a five-topic (tech / sport / entertainment / politics / business) dataset from BBC news articles.

Evaluation Metrics
Toxicity: is a metric of the text's harmfulness.We used Perspective API6 , which is widely used to measure text toxicity with values between 0 and 1.The score represents a probability that indicates how likely it is that a reader would perceive the text as containing a toxic attribute (rude, disrespectful, or unreasonable).A higher score indicates that more readers are likely to feel the text is toxic.If this score was 0.5 or higher, it was classified as toxic text.We measured how many of the generated texts were toxic.
Accuracy: is a metric of topic consistency.This indicates whether a text matches a given topic.
Grammar: is a metric of the text's morphology and syntax quality.To evaluate grammar, we adopted a published RoBERTa-base classifier (Krishna et al., 2020) fine-tuned with the CoLA (Warstadt et al., 2018) dataset.We use the average probability value of the classifier as the score.
We described details of accuracy and grammar classifiers for automatic evalutation in Appendix F.1.
Perplexity (PPL): is used to evaluate the text's fluency.It represents how well an LM predicts a given sequence of words.A lower perplexity value indicates that the model is more confident and accurate.Perplexity was evaluated for the baseline LM that generated each text.

LM and Baseline CTG methods
LM: is a model for controllable text generation; we used two sizes (small, 117 M, and large, 762 M) of GPT2 (Radford et al., 2019).To compensate for the low generation performance of GPT2-small, we fine-tuned the LM on the three topic group datasets.If a topic is given as a prompt, a related text is generated.Table 20 in the appendix shows examples of prompts and generated texts.
In GPT-large, we used in-context learning to generate topic-matched texts.A different number of shots was applied for each topic.We used 10 shots for Sentiment, 30 for Emotion, and five for News.Only nontoxic samples were used for learning.In cases where there were many sentences in the sample (especially in the Sentiment and News datasets), only the first three sentences were used.
We generated 1,000 sentences per topic in small LM and 100 sentences per topic in large LM.Also, we used Top-K sampling and Nucleus sampling (Holtzman et al., 2020) to generate diverse texts.See Appendix D for detailed generation parameters.
To use PPLM, we trained a classification head for each topic-generation LM.It was trained using the published PPLM repository code7 .
• GeDi (Krause et al., 2020) calculates the nontoxic token probability with a Class-Conditional Language Model (CC-LM).We used the published 345M-scale toxicity CC-LM from the paper.The strength ω, which controls the degree of toxicity, was set to 15 and 30.A larger ω forces the LM to generate less toxic text.
• DExperts (Liu et al., 2021) adjusts the probability with two output distributions (expert and anti-expert).We used the paper's published large-scale (762M) toxicity expert and antiexpert.The guiding strength α, which controls the degree of toxicity, was set to 0.5 and 1.0.
Like ω in GeDi, a larger α forces the LM to generate less toxic text.
• DisCup (Zhang and Song, 2022) is a state-ofthe-art prompt tuning method.We used published prompt embeddings8 for toxicity avoidance.After the prompt tuning, GPT2-small showed low generation performance.Due to this low generation performance, the experimental results of DisCup with GPT2-small were excluded.

Experimental Results
In this section, we will answer the following research questions: • RQ1) Do CTG methods truly degrade performance?
• RQ2) Can the gated toxicity avoidance reduce toxicity while preserving performance?
• RQ3) Does the degree of performance degradation vary by topic?
• RQ4) Does the problem still occur with a large-scale model?
• RQ5) Can the gated toxicity avoidance also reduce inference time?

RQ1: CTG Method Degradation Effects
We observed that all CTG methods reduce toxicity but degrade the various aspects of LM performance.Specifically, performance degradation varies across the different topic groups and methods employed.PPLM on News surprisingly improves the topic accuracy, but it exhibits a repeated generation of a particular keyword.This improvement comes at the cost of compromised grammar and perplexity scores.
DExperts perform the lowest in terms of topic accuracy, grammar, and perplexity.On average among the CTG methods, GeDi exhibits the lowest toxicity and best grammar, perplexity, and topic accuracy.Nevertheless, it is essential to note that even GeDi degrades performance depending on the specific topic (Emotion).
Notably, among the three topic groups, the most significant performance degradation is observed in the emotion category.This category is also the most toxic (3.8%).This is due to a significant degradation in negative emotions (sadness, anger), which we discussed details in Sec 6.3.
We selected the optimal parameter that can significantly reduce toxicity while minimizing generation quality degradation.Each CTG method has its own strength parameter.The greater the strength, the lower the toxicity.However, a greater strength parameter makes the other performance metrics worse.See Appendix F.2 for the change in performance due to strength.

RQ2: Overall Performance of the Gated Toxicity Avoidance
Table 3 shows that our gated toxicity avoidance method resolves performance degradation.It reduces toxicity to the same level as the original (nongated) CTG method and preserves topic accuracy, grammar, and PPL at the same level as baseline LM (GPT-2), regardless of the CTG method.In particular, in the case of PPLM and DExperts, the generation quality was significantly improved.
In detail results for each topic, the gated toxicity avoidance shows a baseline (GPT-2) level of performance and the same level of toxicity as the original CTG regardless of the topic (see Appendix I for details).Since the toxicity reduction follows the performance of the original CTG method, GeDi GT A shows the best toxicity reduction.
Performance preservation depends on the gate threshold θ, which is used for the gate model(toxic classifier).We used θ = 0.005 as the best threshold.This value may vary depending on the gate model (Appendix F.3).The gate model can significantly reduce toxicity when using low thresholds because the distribution of the output probabilities is highly biased.

RQ3: Effectiveness of Topics
To analyze the effectiveness of each topic, we validated the performance of GeDi(see Fig 4), which shows the best performance on the Emotion topic group, which has the largest number of labels.Toxicity varies for each topic.Anger (13.1%) and sadness (7.8%) show notably high toxicity, compared to 1.85% on average for all topics.The remaining topics show very low toxicity.High toxicity is related to the degradation of the CTG method.For the anger and sadness topics, GeDi demonstrates a more significant degradation in topic accuracy.GeDi enhances that grammar and perplexity scores because it supplements missing grammatical elements in the original Emotion topic group, such as spaces, dots, and apostrophes.
For the Sentiment topic group, GeDi shows a −1.9 decrease in topic accuracy, but the grammar degrades significantly (−6.13).Both positive and negative topics decline similarly.For the tech topic  in News, GeDi shows a −2.9 decrease in topic accuracy.There is no significant change in the other topics.Changes in topic accuracy do not correlate with changes in grammar and perplexity.It means that degradation in text quality does not cause a change in topic accuracy.We posit that this difference in performance is due to the training data bias of the CTG method.

RQ4: Efficiency of the Gated Toxicity Avoidance
We validate the efficiency of our proposed gated toxicity avoidance.The gated toxicity avoidance not only preserves LM performance but also generates text more efficiently.This makes the LM generate text faster than the original CTG method.
To evaluate the efficiency, we averaged 100 text generation costs, where each generation cost was measured as the time it took to generate 100 to- kens.For fair evaluation, both GeDi and DExperts used the same size of CC-LM and Experts.Fig. 5 shows that gated toxicity avoidance reduces the generation time regardless of the CTG method.
The gated toxicity avoidance is infrequently reliant upon the CTG method, a factor contributing to its efficiency.In contrast, guided-decoding based CTG methods invoke its mechanisms at each iteration of token generation.Although the gated toxicity avoidance also invokes a gate model at each iteration, a compact gate model is more cost-effective.This performance improvement will be more remarkable in practice because the CTG will be many times larger than the gate model.We evaluated the performance according to the scale of CTG method.After training Experts and Anti-Experts of 117M parameters, we generated 100 texts 100 tokens in length with DExperts and evaluated them.The number of parameters of the gate model was 110M.See Appendix G for the experimental details.
Table 4 shows that small DExperts use less memory and generate faster.However, it has a higher toxicity than the DExperts-large, and the performance degradation is more considerable.The DExperts GT A ensures a level of toxicity similar to or better than the original DExperts and a level of generation quality similar to the baseline LM.One interesting thing about the DExperts-small GT A is that it has lower toxicity than the original DExpertssmall.This is presumed to be due to the gated model resampling tokens, even if the small-scale CTG method performance is low.If the small-scale CTG method has sufficiently low toxicity, the gated toxicity avoidance can be used to address the remaining degradation.For the performance preservation, using a gated toxicity avoidance with a smallscale CTG method is more effective in memory and speed than using only a larger CTG method.In this section, we demonstrated that the largescale LM with a CTG method reduces toxicity and degrades performance just as small-scale LM does.We also show that the gated toxicity avoidance works well at a large-scale LM.

RQ5: CTG Method with a Large-Scale LM
GeDi's performance degradation differs from that in the small-scale LM.There is a significant topic accuracy degradation in the anger, positive and negative topics, and there is no noticeable change in the other topics.The average grammar and perplexity are also degraded.The joy and sadness topics affect GeDi's grammar degradation, and perplexity increases in anger, negative, and sadness topics.
Even though DisCup shows better topic accuracy than GeDi, it still shows lower topic accuracy than the LM because of topic degradation in News and Sentiment topic groups.However, DisCup GT A mitigate topic degradation.DisCup shows better toxicity, grammar, and perplexity performance in sentiment and news topic groups, but the gaps are marginal in humans (RQ6).You can see more detailed results in Appendix F.4.

RQ6: Human Evaluation
We performed human evaluation regarding three metrics that are commonly used in the previous studies (Gu et al., 2022;Zhang and Song, 2022): topic accuracy, toxicity, and fluency.Topic accuracy and toxicity are defined in the same way as the previous automatic metrics.The evaluators read the text generated for the given topic by the LM and assigned 1 (match) or 0 (not a match) according to whether a topic-matched text was generated.Toxicity was also evaluated as 1 (toxic) or 0 (not toxic).Fluency is a metric that evaluates how natural and fluent the text is.It was rated on a scale of 1 (not fluent at all) to 5 (very fluent).To reduce the bias of the absolute evaluation, users were asked to review all samples before evaluation.See Appendix H for evaluation details.Compared to large-scale LM's automatic evaluation, human evaluation results of large-scale LM in Table 6 show greater degradations in topic accuracy.GeDi degrades significantly, and DisCup shows better accuracy than GeDi but is still lower than the LM.However, the gated toxicity avoidance shows a similar level of topic accuracy to the LM.GeDi shows the lowest toxicity, and GeDi GT A performs similarly.On the other hand, DisCup shows slightly higher toxicity than GeDi.And DisCup GT A has a higher toxicity than DisCup.Fluency is different for automatic evaluation.GeDi exhibits high fluency in the emotion topic group, similar to the small LM, but there is no noticeable difference among the methods on average.From the supplements of grammatical elements (commas, commas, quotation marks, etc.), GeDi shows slightly better than LM.Unlike automatic evaluation, DisCup GT A performs better than DisCup in fluency.

Conclusion
In this paper, we explored the effectiveness of stateof-the-art CTG methods, experimenting with various aspects -topic consistency, grammar, and perplexity.Furthermore, we proposed a novel modelagnostic solution called the Gated Toxicity Avoidance.Our findings revealed that previous CTG methods exhibit varying degrees of performance degradation across different topics, CTG methods, and scales.Regardless of these factors, the proposed gated toxicity avoidance successfully preserves the original language model's performance while achieving comparable toxicity reduction levels.Notably, the gated toxicity avoidance also demonstrates faster generation speed.Their results highlight the potential of a gated toxicity avoidance as an effective and efficient solution for the safety language model.

Limitations
One limitation of this paper is related to the scale of the language models.Recent advancements have led to the developing of state-of-the-art language models with billions of parameters.However, due to the high computational costs of generating numerous samples using LLM, we were restricted from using LM containing up to 762M parameters.In our future work, it is crucial to explore the issues we found also occurred in much larger LMs, as well as explore whether the gated toxicity avoidance can resolve them.
Furthermore, it is essential to consider additional indicators for evaluating the impact of CTG methods.Various metrics can be used to evaluate different aspects, such as instruction-following, writing style (e.g.written or spoken, dialect), religious influence, racial sensitivity, and other topics.Developing effective methods for measuring wide and varied LM performance metrics is crucial.
Although the probability of generating toxic text using our proposed method is extremely low, it is not entirely zero.In future works, our key purposes are eliminating toxic generation completely and preserving original performance.

Ethics Statement
Controllable Text Generation (CTG) possesses the potential for misuse for non-ethical purposes.Nevertheless, the toxicity avoidance process using CTG is essential for the ethical utilization of LMs, and it stands as the most effective method currently available.We believe this study will make an outstanding contribution to applying ethical LM.
A Toxicity of RLHF model: Llama2 To evaluate the toxicity of the Reinforcement Learning from Human Feedback (RLHF), we employed the Llama-2-7b-chat-hf model, which is published 13 state-of-the-art model.Using Llama-2 with prompts described in Table 8, we generated 1,000 texts for each emotion topic.In Table 7, we observed that Llama-2 with RLHF also still has higher toxicity than CTG methods (e.g., GeDi) in the generated text.It is noteworthy that, despite Llama-2 (7B)'s substantial size relative to smaller models such as GPT2-small (125M) and GPT2large (775M), GeDi, DExperts, DisCup exhibit notably lower toxicity compared to Llama-2-7bchathf.

B Performance degradation of the Text
Style Transfer method: ParaDetox We employed ParaDetox (Logacheva et al., 2022), the state-of-the-art text style transfer method, to evaluate the performance degradation of the text style transfer method.Utilizing a toxicity classifier, which was used as a gate model, we classify toxic texts from the emotion dataset and transfer them to nontoxic text.We classified the topic of transferred texts.In Table 9, we observed that the topic preservation of ParaDetox depends on the topic and notably degraded on the anger topic.

C Hardware Details
Topic generation LMs were trained from TPU-v2-8.The other tasks, training PPLM classification heads and text generation were conducted with NVIDIA RTX 3060.12.The in-context learning format is shown in Table 21.

E CTG Details E.1 PPLM
We used balanced JigSaw toxic comment classification data14 to train the classification head.The classification head is a single MLP layer.Table 13 shows the hyperparameters to train the PPLM classification head.

E.2 DExperts
We trained two experts (a toxic expert and a toxic anti-expert) to evaluate the efficiency of the smallscale DExperts experiment.Its training dataset is the same as PPLM.Table 14 shows the hyperparameters to train an expert and an anti-expert.

F Automatic Evaluation Details
The evaluation results for each topic are in the last sections.Small-scale LM results are in Appendix I, and large-scale LM results are in Appendix J.The actual generated output exists in the 'output/' directory of the repository.

F.1 Classifiers
Table 15 shows which classifiers are used to evaluate the metrics.

F.2 Small-scale evaluation result according to CTG method strength
In Table 16, we observed the LM performance according to the CTG method strength.It can be seen that as the strength increases, the toxicity reduces, but most of the other performance metrics also degrade.In the case of DExperts, topic accuracy improved when α was large (1.0).Text generation did not work properly when using larger strength.In Table 17, we observed the performance change of GeDi according to the gate threshold through experiments.The lower the gate threshold, the lower the toxicity continued to decrease, but there was no significant difference in the rest of the accuracy, grammar, and perplexity.When the gate threshold (θ) was used as 0.005, it showed the same toxicity as GeDi.Based on this, we used the same 0.005 threshold for other CTG methods.

F.4 Automatic Evaluation Results of the Large-scale LM with CTG methods
Table 18 shows automatic evaluation results of large-scale LM experiments for each topic group.

G Efficiency Experiment Details
In the generation speed experiments of GeDi and DExperts, we used the same size CC-LM and experts for a fair comparison.We used 117M expert and anti-expert as GeDi's CC-LM instead of GeDi's original 345M CC-LM.

H Human Evaluation Details
We evaluated the texts with 10 English experts.We used a total of five methods -1) a large-scale baseline LM, 2) GeDi, 3) DisCup, 4) GeDi GT A , 5) DisCup GT A .Each method generates 100 texts per topic.The evaluator compared randomly sampled two of them.Evaluators evaluated a total of 130 texts -five methods, 13 subjects, and two samples per subject.The evaluation took 45 minutes on average.Our detailed results are shown in Table 19 and Fig. 6 shows the web UI where human evaluation has been performed.Prompt Generated Example topic: positive WOW!I have to go back to try the pho.I have had pho that has a ton of taste, and it was amazing.topic: anger i feel so angry with all of them topic: politics federal ministers announce new plan to improve welfare to be given the best chance of saving £1bn every year by 2020 the latest in a series of measures designed to give states a greater level of welfare in return for their support.these measures have been announced at the autumn meeting of the uk s devolved administrations.their aim is to increase welfare by 2.4bn including £1bn for public-employee contracts to be delivered by 2020 the government said in its budget in advance.ministers also announced that the next generation of welfare would be made by 2021.public-sector workers would have

Figure 2 :
Figure2: The CTG method decreases toxic token probabilities but adjusts the probabilities of negative tokens ('bad') and positive tokens ('good').

Figure 3 :
Figure 3: Operation of the LM, CTG method, and Gated Toxicity Avoidance (GTA)

Figure 5 :
Figure 5: Time to generate 100 texts 100 tokens in length 13 https://huggingface.co/meta-llama/Llama-2-7b-chat-hf D Generation Details D.1 Small-scale Topic Generation LM Details Hyperparameters for training and generation are shown in Table 10 and 11.Generation examples are shown in Table 20.D.2 Large-scale In-context Learning Details Hyperparameters for training and generations are shown in Table

Figure 6 :
Figure 6: Human evaluation web UI which was developed using streamlit 15

Table 1 :
LM's performance degradation by CTG method

Table 2 .
LM easily generates a toxic text in an anger topic.The texts generated by PPLM are not fluent at all.DExperts generates too long and not natural text.GeDi's text quality is similar to LM, but it does not contain anger emotions.

Table 4 :
Performance of CTG methods varying scale

Table 5 :
Automatic evaluation result on large-scale LM

Table 6 :
Human evaluation result on large-scale LM

Table 7 :
Performance of the Llama-2-7b-chat-hf on emotion topic

Table 9 :
Topic accuracy after detoxification using Paradetox

Table 10 :
Hyperparameters for training small-scale Topic Generation LM

Table 11 :
Hyperparameters for generate text in smallscale experiment

Table 12 :
Hyperparameters for generate text in largescale experiment

Table 13 :
Hyperparameters for training PPLM classification head

Table 14 :
Hyperparameters for training small-scale DExperts

Table 15 :
Published classifier to evaluate automatic evaluation metrics

Table 16 :
Change of performance of GeDi, DExperts according to strength (α, ω).The greater the value, the stronger the strength

Table 17 :
Change of performance of GeDi gated according to gate threshold θ

Table 18 :
Automatic evaluation result of the large-scale LM with CTG method for each topic groups

Table 19 :
Average human evaluation results by topic group

Table 20 :
Examples generated by fine-tuned small-scale LM

Table 21 :
In-context learning prompt examples for generating topic-related text in large-scale experiment.The first line is instruction, where the bold text means the topic to be generated.shots are seperated with '==='.

Table 29 :
Automatic evaluation result of the small-scale LM for surprise topic

Table 30 :
Automatic evaluation result of the small-scale LM for business topic

Table 31 :
Automatic evaluation result of the small-scale LM for entertainment topic

Table 32 :
Automatic evaluation result of the small-scale LM for politics topic

Table 33 :
Automatic evaluation result of the small-scale LM for sport topic

Table 34 :
Automatic evaluation result of the small-scale LM for tech topic

Table 35 :
Automatic evaluation result of the large-scale LM for negative topic

Table 36 :
Automatic evaluation result of the large-scale LM for positive topic

Table 37 :
Automatic evaluation result of the large-scale LM for sadness topic

Table 38 :
Automatic evaluation result of the large-scale LM for joy topic