Language Model Detoxification in Dialogue with Contextualized Stance Control

To reduce the toxic degeneration in a pretrained Language Model (LM), previous work on Language Model detoxification has focused on reducing the toxicity of the generation itself (self-toxicity) without consideration of the context. As a result, a type of implicit offensive language where the generations support the offensive language in the context is ignored. Different from the LM controlling tasks in previous work, where the desired attributes are fixed for generation, the desired stance of the generation depends on the offensiveness of the context. Therefore, we propose a novel control method to do context-dependent detoxification with the stance taken into consideration. We introduce meta prefixes to learn the contextualized stance control strategy and to generate the stance control prefix according to the input context. The generated stance prefix is then combined with the toxicity control prefix to guide the response generation. Experimental results show that our proposed method can effectively learn the context-dependent stance control strategies while keeping a low self-toxicity of the underlying LM.


Introduction
Large pretrained Language Models, such as GPT2 (Radford et al., 2019), can produce coherent, almost human-like texts, but they are prone to generating offensive language, which hinders their safe deployment (Gehman et al., 2020).An extensive body of work has focused on detoxifying pretrained LMs (Dathathri et al., 2020;Krause et al., 2020;Qian et al., 2022).However, it can be more complicated when the LMs are applied to downstream Natural Language Generation (NLG) tasks, such as dialogue response generation.When applied in dialogue, the uncontrolled models tend to generate toxic content and in addition to explicitly offensive utterances, Baheti et al. (2021) suggest that these models can also implicitly insult a group or individual by aligning themselves with an offensive statement, as shown in Figure 1.Therefore, to detoxify a pretrained LM applied in dialogue, the stance of the generated response needs to be taken into consideration.In a normal dialogue, we do not need to control the stance, but if the user inputs offensive language, the model should not respond with a positive stance.In other words, the eligible stance is context-dependent and we need to consider the dialogue context.
One straightforward solution is to design a control flow with a binary offensive language classifier, where the dialogue context is taken as input for the classifier.If the context contains offensive language, an NLG model with both toxicity control and stance control is used for response generation.We would like the self-toxicity to be low and the stance not to be supportive.On the other hand, if the context does not contain offensive language, the stance does not need to be controlled, so another NLG model with only toxicity control is used for response generation.However, this Classify-then-Generate framework has several limitations.First, it requires training a classifier and controlled NLG models separately, introducing additional model parameters.Second, its performance relies heavily on the classifier, so the performance of this classifier can be a bottleneck.
To address these limitations, we propose a novel method to do context-dependent control, where the offensive language classification is learned implicitly together with the stance control, instead of being learned explicitly by a classifier.Following Li and Liang (2021) and Qian et al. (2022), we use prefix, a small continuous vector prepended to the LM, to achieve controllability, and we further introduce hierarchical prefixes for contextualized control.More specifically, meta prefixes are introduced to control the underlying LM to generate the desired stance prefix according to the dialogue context, which is then combined with the toxicity prefix to guide the response generation.Therefore, the model can be trained end to end, without the bottleneck of a classifier.Besides the Language Modeling loss, we introduce two novel training loss terms to push the model to learn about the context-dependent control strategy.Experimental results show that our method effectively controls the stance according to the offensiveness of the user utterance while keeping the self-toxicity at a low level.Compared with the baselines, our controlling method has significantly less effect on the stance of the generations when the input user utterance is not offensive and when the input user utterance is offensive, our method achieves a lower support stance score.
To conclude, our main contributions are: • We propose a novel control framework that combines context-dependent and contextindependent control utilizing hierarchical prefixes.
• We introduce novel contrastive training objectives to guide the meta prefixes to learn the control strategy implicitly.
• Experiments show that our proposed method can effectively learn the contextualized stance control while keeping a low self-toxicity of the NLG model.

Related Work
To reduce the offensive content generated by the LMs, previous research on offensive language de-tection can be utilized to filter out undesired generations.

Offensive Language Detection
Neural text classifiers, especially Transformerbased classifiers, achieve state-of-the-art performance in offensive language detection.In the SemEval-2020 offensive language identification task (Zampieri et al., 2020), the top-10 teams used large pretrained models, such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), XLM-RoBERTa (Conneau et al., 2020), or an ensemble of them.For example, Wang et al. (2020) use the pretrained multilingual model XLM-R (Conneau et al., 2020) and fine-tune it with labeled offensive language data.Similarly, in the SemEval-2021 Toxic Spans Detection task, the top-ranked team (Zhu et al., 2021) used an ensemble of a BERT-based token labeling approach and a BERT-based span extraction approach, while the team of the secondbest performing system (Nguyen et al., 2021) used an ensemble of two approaches utilizing a domainadaptive pretrained RoBERTa on a toxic comment classification task (Hanu and Unitary team, 2020).Despite achieving SOTA performance, Kennedy et al. (2020) find that neural classifiers finetuned for hate speech detection tend to be biased towards group identifiers, so they propose a novel regularization technique based on the post-hoc explanations extracted from fine-tuned BERT classifiers to encourage models to better learn the hate speech context.

Controllable Text Generation
Generations classified as offensive can be simply discarded.However, this post-filtering strategy using classifiers is inefficient, and there may exist cases where no safe choices exist within a fixed number of generations (Wallace et al., 2019).In order to circumvent this limitation, recent research has focused on controlling the generation of the Transformer-based models from the source.Keskar et al. (2019) propose a novel pretrained model, CTRL.CTRL achieves controllability at the expense of training a large conditional LM with 1.6 billion parameters from scratch, which is costly.Therefore, later research proposes controlling methods that do not require updating the parameters of LMs.Dathathri et al. (2020) freeze the parameters of the GPT2 but stack an additional attribute model on top of it.It guides generation by iteratively up-dating the LM's hidden representations using the gradients back-propagated from the attribute model.Instead of using updated hidden representations to guide generation, Krause et al. (2020) use two conditional LMs to directly re-weight the next token probability given by the LM during generation.Li and Liang (2021) also keep LM parameters frozen, but optimize a small continuous task-specific vector (called a prefix), to achieve the competitive results with fine-tuning on downstream NLG tasks and Qian et al. (2022) further improve the prefixtuning method with contrastive training objectives to achieve better attribute alignment.
All the aforementioned work assumes that the desired attributes are pre-selected before generation.However, in our dialogue detoxification task, the desired stance attribute depends on a hidden attribute of the input context, which leads to an additional challenge.

Method
Given a user utterance c, our goal is to guide the generation model to deliver a contextually safe response, which includes a context-independent attribute: self-toxicity, and a context-dependent attribute: stance.Each example in the training dataset X is a tuple of (c, r, t c , t r , s r ), where c is the user utterance text, r is the response to the user utterance, t c and t r are the offensiveness annotations of c and r respectively, and s r is the stance annotation of the response r.Following Li and Liang (2021) and Qian et al. (2022), we use prefix, a small continuous vector prepended to the LM's hidden representations, to control the generation.Note that during the training or the application of the prefixes, the parameters of the underlying LM are kept frozen, so only the prefixes need to be optimized and stored.Since the self-toxicity control is context-independent and offensiveness annotations are available for training, we first train the toxicity control prefixes, denoted as H β , following the supervised method in Qian et al. (2022).1, :, :].h 0 β corresponds to non-offensive text and h 1 β is the opposite.Both h 0 β and h 1 β are vectors of dimensions M ×D, where M is the length of a prefix and D is the size of the hidden dimension (see Appendix A for detailed explanation).
However, the controlled generation model should avoid not only generating a response that is offensive by itself, but should also avoid generating a response that supports an offensive user utterance.More specifically, there are four cases of training examples in contextualized stance control: • Case 1 t c = 0, s r = 0: The user utterance is not offensive.The response r does not support the user utterance.It satisfies our stance requirement.
• Case 2 t c = 0, s r = 1: The user utterance is not offensive.The response supports the user utterance.It satisfies our stance requirement.
• Case 3 t c = 1, s r = 0: The user utterance is offensive, but the response does not support it.
Our stance requirement is still satisfied.
• Case 4 t c = 1, s r = 1: The user utterance is offensive, and the stance of the response is supportive.Our stance requirement is violated.
Note that the offensiveness annotations of the user utterances are not available during evaluation, so the model needs to learn it along with the controllability.One way to learn offensive language detection is to train a binary classifier in an explicit way.However, the errors from the detector will propagate to the generated responses (Section 4.2).Instead of learning offensive language detection explicitly, we propose to learn it in an implicit way along with stance control.We introduce another set of prefixes, meta prefixes H α , to achieve contextualized controllability.Meta prefixes are trained to generate the stance control prefix according to the user utterance, which is then combined with the toxicity control prefix mentioned above to guide the generation, as shown in the upper part of Figure 2. Same as 0, :, :] indicates that the stance of the response meets our requirement (Case 1, 2, 3 above), while h 1 α = H α [1, :, :] means that the stance of the response violates our requirement (Case 4).More formally, given the annotations t c and s r , we define a binary variable meta prefix index m r as follows: m r = 1 if and only if t c = 1 and s r = 1.In all other cases, m r = 0.
Given a training example, we first infuse the corresponding prefixes with the stance and offensiveness attributes by encouraging them to reconstruct the response r.As illustrated in Figure 2, according to m r , we select the corresponding meta-prefix h mr α and prepend it to the user utterance c as input for the LM.The output of the LM is a generated  prefix of dimension M × D. Then the generated prefix is combined with the toxicity prefix h tr α by element-wise addition.The resultant prefix is appended to the user utterance c to guide the LM to generate the response r.Therefore, the first part of the training loss is the Language Modeling loss L LM .
The computation of log p(r t |r <t , c, t r , m r ) is parameterized as log p α,β,γ (r t |r <t , c, h mr α , h tr β ), where γ is the set of fixed LM parameters, and α, β represent learnable prefix parameters.
Although L LM infuses the corresponding prefixes with the stance and offensiveness attributes, the offensiveness annotation of the user utterance t c is ignored in L LM and thus the meta prefixes are not pushed to learn the offensiveness and rely on it to control the stance.To address this problem, we introduce two additional contrastive loss terms utilizing the annotation t c .
In order to push the meta prefixes to learn about the stance requirement when the user utterance is offensive, we add a stance contrastive loss L s to differentiate between Case 3 and Case 4 stated above.As shown in Figure 2, each offensive user utterance in the training dataset is combined with the two meta prefixes separately, and the distance between two generated prefixes is used to calculate the stance contrastive loss L s .
where m is a pre-set margin, and 1 is the indicator function.1 tr=1 = 1 if t r = 1 and 1 tr=1 = 0 if t r = 0. L s is only calculated when the input example consists of an offensive user utterance (t c = 1).d s is the distance between the generated prefixes as in the equation below.
where f α,γ is the function corresponding to the underlying LM controlled by the meta prefixes and f α,γ (h mr α , c) is the generated prefix given the meta prefix h mr α and the user utterance c. ¬m r = 1−m r is the opposite of m r .Optimizing L s pushes the prefix generated given h mr α , c and that generated give h ¬mr α , c to be away from each other by a margin m.In other words, it encourages the meta prefixes to learn that when the user utterance is offensive, the two meta prefixes should generate opposite stance prefixes.By combining L LM and L s , the meta prefixes are pushed to learn that when the user utterance is offensive, h 0 α is supposed to generate a non-supportive stance prefix, while h 1 α is supposed to generate a supportive stance prefix.
Since the stance requirement is different when the user utterance is offensive and inoffensive, the meta prefixes also need to learn about the contextdependency and about what stance to achieve when the user utterance is not offensive.We achieve this by introducing another context contrastive loss L c to differentiate between the aforementioned Case 3 and the union of Case 1, 2. As illustrated in Figure 3, the meta prefix h 0 α is combined with offensive user utterances and non-offensive user utterances respectively, and the distance between the generated prefixes in these two cases is used to calculate the context contrastive loss. (5) e 0 is the average of the generated prefix given the meta prefix h α and a non-offensive user utterance while e 1 is the average of the generated prefix given the meta prefix h 0 α and an offensive user utterance.Since h 0 α corresponds to the acceptable stances, L c teaches the model that using h 0 α to guide generation, the generated stance prefix should be different when the user utterance is offensive (t c = 1) and when it is not offensive (t c = 0), so this loss term pushes the meta prefixes to consider the user utterance and to differentiate between offensive user utterance and inoffensive user utterance implicitly.
The final training loss L is a weighted sum of the three loss terms described above.
After training, the meta prefix h 1 α and the toxicity prefix h 1 β do not need to be saved.Only h 0 α and h 0 β are used to guide generation during evaluation.

Experimental Settings
We use a large pretrained response generation model, DialoGPT (Zhang et al., 2020), as the backbone model in our experiments.We use DialoGPT instead of GPT2 because DialoGPT is pretrained on Reddit data for conversational response generation and it excludes the pretraining data which are from toxic subreddits or contain offensive language, identified by phrase matching against a large blocklist.As a result, the self-toxicity of DialoGPT tends to be relatively low (Baheti et al., 2021).In our experiments, we use DialoGPT-medium model (345M parameters) implementation by Huggingface (Wolf et al., 2020).
Besides the uncontrolled DialoGPT model, we experimented with the following methods: Prefix-Tuning (Li and Liang, 2021): We train a prefix to guide the generation towards low toxicity and appropriate stances.Therefore, we filter out the training examples where the responses are annotated as offensive or the response stance violates our requirements (Case 4 in Section 3).The remaining training examples are considered as safe ones and are used to train the prefix.During training or generation, the prefix is prepended to the hidden states of the input user utterance.Contrastive Prefixes (Qian et al., 2022): We train two prefixes simultaneously.One prefix guides the model to generate safe responses, while the other one guides the model to unsafe responses.Same as in Prefix-Tuning, the unsafe responses are either offensive themselves or support an offensive user utterance, while the other responses are considered safe.Thus the training dataset is separated into two categories, corresponding to the two prefixes.We set the weight of the Language Modeling loss to be 0.8, and the weight of the discriminative loss to be 0.2.The position of the prefix is the same as above and the prefix corresponding to safe responses is used for evaluation.Cls-Gen Flow: This is the Classify-then-Generate two-step control flow mentioned in Section 1.A RoBERTa (Liu et al., 2019) classifier is finetuned for offensive language detection.We also train the toxicity control prefixes and stance control prefixes separately following Qian et al. (2022).The weight of the Language Modeling loss is 0.8 and the weight of the discriminative loss is 0.2.During the evaluation, if the classifier predicts the user utterance as offensive, the toxicity control prefix and the non-supportive stance prefix are concatenated and prepended to the input user utterance for generation.If the classifier predicts it as non-offensive, only the toxicity control prefix is prepended to the input user utterance for generation.Ours: We reuse the toxicity prefixes trained in the Cls-Gen Flow method to initialize H β in our method.During evaluation, the meta prefix h 0 α , which corresponds to the contextual safe stance, and the toxicity control prefix h 0 β , which corresponds to low toxicity, are used to guide genera-tion.We set ω 1 = 0.5, ω 2 = 0.3, ω 3 = 0.4, and m = 0.8.
For each testing example, 10 completions are generated and evaluated.Other hyperparameters and the training details are listed in Appendix A.
We use the ToxiChat dataset collected by Baheti et al. (2021) to train and evaluate our method.Intended for analyzing the stance of neural dialogue generation in offensive contexts, ToxiChat is a crowd-annotated English dataset of 2,000 Reddit threads and model responses labeled with offensive language and stance.Each training example in the dataset consists of a list of Reddit user utterances, two machine-generated responses (one from DialoGPT and the other one from GPT3 (Brown et al., 2020)), along with the stance and offensive annotations of each utterance and each machinegenerated response.We use the same train, dev, test split as Baheti et al. (2021).In each training example, the last utterance in the utterance list is taken as input text for all the experimented methods and we use the machine-generated responses in the dataset for training.In the original dataset (Baheti et al., 2021), the annotation of offensiveness is binary and the stance is annotated as agree, disagree, or neutral.Since our method assumes a binary stance, the data with a neutral stance can either be discarded or mixed with the data with disagree stance.In our preliminary experiments, we find that mixing the two stances makes the prefix-based models confused about the stances while simply discarding the neutral stance data results in better results.Therefore, we discard the training examples with a neutral stance when training prefixes in all the baselines and our methods.When training the offensive language classifier in the Cls-Gen Flow method, we did not discard the neutral stance data because we find keeping them results in better classification performance.In both cases, the training datasets are manually balanced with oversampling.We evaluate the methods from three aspects: stance alignment, self-toxicity, and linguistic quality.The linguistic quality is evaluated using the perplexity calculated by GPT2-XL (1.5B parameters).Self-toxicity refers to the offensiveness of the response itself without consideration of the in-put user utterance.Google Perspective API1 is used for self-toxicity evaluation.For stance evaluation, we use the GATE Cloud2 English stance classifier service (Li and Scarton, 2020), where the possible stances are support, deny, query, and comment.By controlling the response generation, we hope that the toxicity of the generations to be low while the linguistic quality is not sacrificed much no matter if the user utterance is offensive or not.However, the controlling methods should have different effects depending on the offensiveness of the user input.
When the input user utterance is not offensive (t c = 0), the controlling methods should not affect the stance of the generations.In other words, we would like the response stance of the controlled model to be close to that of the uncontrolled model.Therefore, we quantify the Stance Shift of a generated response r ′ as follows when the user utterance is not offensive: where Y s is the set of stance classes and f θ is the stance evaluation function.r ′ dgpt is the response generated by the uncontrolled DialoGPT.We report both the 4-way stance shift and the 3-way stance shift.In the 4-way stance shift, Y s consists of the 4 stance categories of the stance classification API as mentioned above.In the 3-way stance shift, we do not differentiate between the stances comment and query since both of them can be considered as neutral stances.
On the other hand, when the input user utterance is offensive (t c = 1), the controlling methods are expected to lower the supportive stance rate while increasing the non-supportive stance rate.Therefore, we compare the support stance scores achieved by each method.

Results
We compare our method to the aforementioned baselines.The experimental results are shown in Table 1.The results show that simply separating the training dataset into two categories and using Prefix-Tuning or Contrastive Prefixes for training can not result in the desired controllability.This shows the complexity and difficulty of our control task, where the stance control is Cls-Gen Flow utilizes the offensiveness annotations to train an offensive language classier and also a toxicity control prefix, while the stance annotations are used to train the stance control prefix.However, it does not effectively guide the response generation towards our desired attributes.The reason is twofold.On one hand, the offensive language classifier does not make perfect predictions.It achieves an accuracy of 78.7% and F1 of 70.9% on the testing dataset.Therefore, the offensive language classifier introduces mistakes from the beginning, which are then propagated to the generated responses.On the other hand, the trained toxicity control prefix has an implicit bias on stance, although we have manually balanced the training dataset.It achieves a support stance score of 0.403 and a deny stance score of 0.239 on the testing dataset.This results in a larger stance shift when the toxicity control prefix is used to guide genera-tion given non-offensive user input, and when it is concatenated with a non-supportive stance prefix to guide generation, the support stance score is not lowered significantly as shown in Table 1.
Instead of relying on an offensive language classifier to explicitly enforce the context-dependent control, our method implicitly pushes the model to learn about the rule by introducing the metaprefixes and the novel contrastive loss terms as described in Section 3. The results show that our method can effectively control the stance according to the offensiveness of the user utterance while keeping the self-toxicity at a low level.When the user utterance is non-offensive, our method achieves a low stance shift, and when the user utterance is offensive, the support stance score is lowered significantly.This indicates that our method learns to implicitly analyze the offensiveness of the user utterance and apply different control strategies accordingly.Besides, the perplexity score shows that our method achieves controllability without sacrificing the linguistic quality much.
Ablation study (  contrastive loss L c results in a significant increase in both 3-way and 4-way stance shifts, although the support stance score is close to that of the full model.This indicates that the model ignores the offensiveness of the user utterance and generates more responses with a denying stance and fewer responses with a supportive stance in both cases.This problem is further exacerbated by the additional removal of the stance contrastive loss L s .We also find that removing the context contrastive loss results in slightly higher toxicity and much higher perplexity.One possible reason is that without L c , part of the training dataset where the user utterance is not offensive (t c = 0) is not fully utilized for training, leading to a slightly worse self-toxicity and a loss of linguistic quality.Table 2 shows the detailed stance and toxicity scores.The examples of the generated responses are shown in Table 3.

Conclusion
In this work, we propose a novel method for contextual detoxification, where a context-dependent attribute: stance, and a context-independent attribute: toxicity, are controlled within a unified hierarchical prefix framework.Experimental results show that our proposed method can successfully guide an NLG model to generate safer responses with the stance taken into consideration.Besides the dialogue detoxification task we experimented with, our proposed framework can be extended to other combinations of the context-dependent and the context-independent control.

Limitations
Context is important for identifying offensive language, especially for implicit offensive language.
In this work, we consider one category of contextual offensive language, where a response supports a previous offensive utterance in the context.Other categories of contextual offensive language, such as sarcasm and circumlocution, are not covered in this work.Future work in this area may cover more types of contextual offensive language.Although experimental results show that our methods can effectively lower the support stance score of the generations given an offensive input, it is not guaranteed that the model with our controlling method will produce a generation with a safe stance.

Ethical Considerations
Our proposed method is intended for contextdependent detoxification with stance control.It can be extended to other combinations of the contextdependent and the context-independent control.However, it is not intended for hallucination or factuality control.After training, the prefixes h 1 α and h 1 β should be discarded and only h 0 α and h 0 β should be used for evaluation or application.h 1 α and h 1 β should not be used to generate offensive language or the responses supporting offensive language.Due to the sensitive nature of this work, examples in Figure 1 and Table 3 contain offensive language.We would like to clarify that the examples shown in this paper do not represent any opinion of the authors.

Figure 1 :
Figure 1: An illustration of two types of offensive responses.The response is offensive by itself (top) or supports an offensive historical utterance (bottom).Offensive words are masked.

Figure 2 :
Figure 2: An illustration of the training method and two loss terms: L LM and the stance contrastive loss L s .m r denotes the meta prefix index of the training example, as defined in Section 3. ¬m r is the opposite of m r .h mr α = H α [m r , :, :] is a meta prefix.h tr β = H β [t r , :, :] is a toxicity control prefix.⊕ means element-wise addition.The underlying Language Model is pretrained and its parameters are frozen during training.
The final training dataset consists of 1,000 examples for prefix training and 4,794 examples for stance classification training.The development and testing datasets consist of 300 examples each.
Table1) shows that both the stance contrastive loss L s and the context contrastive loss L c are critical for our model to learn about the user utterance.Removing the context It was just a random pic of the girl Contra.Prefixes A friend of mine has a pair of them.Cls-Gen Flow I think she was wearing a dress Ours She was very cute and had the very best design of the UFC.

Table 3 :
Examples of the generation.In the first column are user utterances.In the third column are the generated responses.Contra.prefixes: Contrastive Prefixes.