Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts

Dialogue models trained on human conversations inadvertently learn to generate toxic responses. In addition to producing explicitly offensive utterances, these models can also implicitly insult a group or individual by aligning themselves with an offensive statement. To better understand the dynamics of contextually offensive language, we investigate the stance of dialogue model responses in offensive Reddit conversations. Specifically, we create ToxiChat, a crowd-annotated dataset of 2,000 Reddit threads and model responses labeled with offensive language and stance. Our analysis reveals that 42% of human responses agree with toxic comments, whereas only 13% agree with safe comments. This undesirable behavior is learned by neural dialogue models, such as DialoGPT, which we show are two times more likely to agree with offensive comments. To enable automatic detection of offensive language, we fine-tuned transformer-based classifiers on ToxiChat that achieve 0.71 F1 for offensive labels and 0.53 Macro-F1 for stance labels. Finally, we quantify the effectiveness of controllable text generation (CTG) methods to mitigate the tendency of neural dialogue models to agree with offensive comments. Compared to the baseline, our best CTG model achieves a 19% reduction in agreement with offensive comments and produces 29% fewer offensive replies. Our work highlights the need for further efforts to characterize and analyze inappropriate behavior in dialogue models, in order to help make them safer.

Dialogue models trained on human conversations inadvertently learn to generate toxic responses. In addition to producing explicitly offensive utterances, these models can also implicitly insult a group or individual by aligning themselves with an offensive statement. To better understand the dynamics of contextually offensive language, we investigate the stance of dialogue model responses in offensive Reddit conversations. Specifically, we create TOXICHAT, a crowd-annotated dataset of 2,000 Reddit threads and model responses labeled with offensive language and stance. Our analysis reveals that 42% of human responses agree with toxic comments, whereas only 13% agree with safe comments. This undesirable behavior is learned by neural dialogue models, such as DialoGPT, which we show are two times more likely to agree with offensive comments. To enable automatic detection of offensive language, we fine-tuned transformerbased classifiers on TOXICHAT that achieve 0.71 F 1 for offensive labels and 0.53 Macro-F 1 for stance labels. Finally, we quantify the effectiveness of controllable text generation (CTG) methods to mitigate the tendency of neural dialogue models to agree with offensive comments. Compared to the baseline, our best CTG model achieves a 19% reduction in agreement with offensive comments and produces 29% fewer offensive replies. Our work highlights the need for further efforts to characterize and analyze inappropriate behavior in dialogue models, in order to help make them safer. 1

Introduction
Despite significant progress toward data-driven conversational agents (Ritter et al., 2011;Li et al., 2016), dialogue models still suffer from issues surrounding safety and offensive language. Previous 1 Our code and corpus are available at https:// github.com/abaheti95/ToxiChat Because religious people who think they have license from God to do whatever they want are f***ing psychotics Thank you for saying what I was thinking! Figure 1: Example of an offensive comment by a Reddit user followed by three Dialogue model's responses. We also show the stance labels for the responses with respect to the preceding offensive comment.
research has shown that dialogue models can produce utterances that are gender and racially biased (Wolf et al., 2017;Sheng et al., 2020;Dinan et al., 2020a). For example, OpenAI's GPT-3 (Brown et al., 2020), a 175 billion parameter neural network, has been shown to generate dangerous advice, such as recommending a hypothetical patient to kill themselves. 2 Presenting users with content generated by a neural network presents new risks, as it is difficult to predict when the model might say something toxic, or otherwise harmful.
A key challenge for conversational AI is that toxic language is often context-dependent (Dinan et al., 2019a), making it notoriously difficult to detect; text that seems innocuous in isolation may be offensive when considered in the broader context of a conversation. For example, neural chatbots will often agree with offensive statements, which is undesirable (see examples in Figure 1). The solution employed by current systems, such as GPT-3 or Facebook's Blender chatbot (Roller et al., 2021), is to stop producing output when offensive inputs are detected (Xu et al., 2020). This is problematic, because today's toxic language classifiers are far from perfect, often generating false positive predictions. Rather than completely shutting down, for some applications, it may be preferable to simply avoid agreeing with offensive statements. However, we are most excited about the future potential for models that can gracefully respond with non-toxic counter-speech (Wright et al., 2017), helping to diffuse toxic situations.
To better understand stance usage in offensive contexts, we recruited crowd-workers on Amazon Mechanical Turk to annotate TOXICHAT, a corpus of Reddit conversations that include automatically generated responses from DialoGPT (Zhang et al., 2020) and GPT-3 (Brown et al., 2020). Posts and comments are annotated for targeted-offensiveness toward a particular person or group (Sap et al., 2020). We also annotate stance toward each of the previous comments in the thread. Using our annotated corpus, we show that 42% of human responses in offensive contexts exhibit agreement stance, whereas only 13% agree with safe comments. Analysis of 5 million Reddit comment threads across six months, similarly finds users are three times more likely to agree with offensive comments. Furthermore, we find that neural chatbots learn to mimic this behavior -DialoGPT, GPT-3, and Facebook's Blender chatbot are all more likely to agree with offensive comments.
Finally, we present initial experiments with two controllable text generation (CTG) methods that aim to control the stance of automatically generated replies. Our experiments suggest that domain adaptive pretraining (Gururangan et al., 2020) reduces the number of contextually offensive responses, although this does not completely eliminate the problem, suggesting the need for further research on controllable stance in neural text generation.
Our main contributions include: (1) We release TOXICHAT, a corpus of 2,000 Reddit conversations that are augmented with automatic responses from DialoGPT and GPT-3, and annotated with targeted offensive language and stance. (2) We present an analysis of stance in offensive and safe contexts using TOXICHAT, demonstrating that neural dialogue models are significantly more likely to agree with offensive comments. (3) We show TOXICHAT supports training and evaluating machine learning classifiers for stance in toxic conversations. (4) We conduct preliminary experiments on controlling the stance of neural responses to prevent models from agreeing with offensive statements.

Creating the TOXICHAT Corpus
Addressing problematic responses in neural conversation requires both understanding whether a response is offensive and whether it agrees with previous offensive utterances. We develop an interface to annotate these two concepts in conversations that are enriched with dialogue model responses.
Formally, a thread consists of k utterances = {u 1 , u 2 , ..., u k }, where the last comment, u k , is generated by a dialogue model. For each u i , we collect annotations of: 1) Offensiveness -We consider u i offensive if it is intentionally or unintentionally toxic, rude or disrespectful towards a group or individual following Sap et al. (2020). This is a binary choice, where u i is either Offensive or Safe. 3 For offensive comments, we further annotate target groups from a predefined list comprising identity-based groups of people (e.g., people of various sexuality/sexualorientation/gender, people with disabilities, people from a specific race, political ideologies, etc.) and specific individuals e.g., (public figures, Reddit users, etc.) We present the list of selected target groups in Figure 7 in the Appendix. 2) Stance -We annotate the stance of u i towards each previous comment, u j , ∀j < i. Stance is viewed as a linguistically articulated form of social action, in the context of the entire thread and sociocultural setting (Du Bois, 2007;Kiesling et al., 2018). Stance alignment between a pair of utterances is annotated as Agree, Disagree or Neutral. Our primary interest is in analyzing the stance taken towards offensive statements. We assume that a user or a chatbot can become offensive by aligning themselves with an offensive statement made by another user (see Figure 1). 4 Additionally, for dialogue model responses u k , we also annotate their grammatical and contextual plausibility given the context. A screenshot of our annotation interface is shown in Figure 8 in the Appendix.

Data Collection
Our annotated dataset contains labeled Reddit conversations extended with dialogue model responses ( §3.1). We gather Reddit posts and comments (Baumgartner et al., 2020) 5 that were written between May and October, 2019. From this, we construct threads, each of which comprise a title, post and subsequent comment sequence. We extract threads from two sources: (1) Any SubReddits: threads from all SubReddits, (2) Offensive Sub-Reddits: threads from toxic SubReddits identified in previous studies (Breitfeller et al., 2019) and Reddit community-reports. 6 (Appendix B).
We are most interested in responses generated by dialogue models in offensive contexts. However, offensive language is rare in a random sample (Davidson et al., 2017;Founta et al., 2018). Hence, we implement a two-stage sampling strategy: (1) Random sample -From both sources, randomly sample 500 threads (total 1000).
(2) Offensive sample -From remaining threads in both sources, sample additional 500 threads (total 1000), whose last comment is predicted as offensive by a classifier. Specifically, we used high-precision predictions (probability ≥ 0.7) from a BERT-based offensive comment classifier (Devlin et al., 2019) that was fine-tuned on the Social Bias Inference Corpus (Sap et al., 2020). This classifier achieves ≈ 85.4 Offend label F1 on the SBIC dev set.

Generating Dialogue Model Responses
To study the behavior of neural chatbots in offensive contexts, we extend the sampled 2,000 Reddit threads with model-generated responses. We consider the following pretrained models in this study: DGPT -A GPT-2 architecture trained on 147M Reddit comment threads (Zhang et al., 2020). To reduce the risk of offensive behavior, the authors filtered out comment threads containing offensive phrases during training. We use DialoGPT-medium model (345M parameters) implementation by huggingface (Wolf et al., 2020). GPT-3 -Recently, OpenAI released API access to GPT-3 language model, a model equipped to solve many tasks using text-based interaction without additional training (Brown et al., 2020). We follow the API guidelines to use GPT-3 as a dialogue agent. To generate a response for a comment thread, we provide GPT-3 with the prompt -"The following is a conversation thread between multiple people on Reddit. U1:u 1 U2:u 2 ... ", where u 1 , u 2 , ... are the user comments. The model then predicts the next turn in the conversation. We select the largest GPT-3 model, 'davinci' with 175B parameters, in our data construction. Blender -More recently, Facebook released Blender Bot; a 2.7B parameter dialogue model (Roller et al., 2021). Blender bot is first pretrained on 1.5B Reddit comment threads (Baumgartner et al., 2020) and later finetuned on Blended Skill Talk (BST) dataset . The BST dataset contains 5K polite conversations between crowdworkers which aims to blend 3 conversational skills into one dataset 1) engaging personality (Zhang et al., 2018b;Dinan et al., 2020b), 2) empathetic dialogue (Rashkin et al., 2019) and 3) knowledge incorporation (Dinan et al., 2019b).
We only include the first two models during annotation but compare our controlled text generation models against all three dialogue models in §6.1. Responses for DGPT and GPT-3 are generated on the comments part of the threads 7 using nucleus sampling (p = 0.9) (Holtzman et al., 2019). Blender bot uses beam search with beam size = 10 and min. beam sequence length = 20 to generate responses.

TOXICHAT Corpus Statistics
We recruited crowd-workers from the Amazon Mechanical Turk platform to annotate the 2000 threads from our corpus, with five workers annotating each thread. Overall statistics for TOXICHAT are presented in Table 5 in the Appendix. The inter-rater agreement was measured using Krippendorff's alpha (Krippendorff, 2011) and pairwise agreement, which was found to be α = 0.42 and 82.8% respectively for offensive labels 8 and α = 0.22 and 85.1% for stance labels. 9 We found Krippendorff's alpha on the human-only responses is somewhat higher (α = 0.45 for offensive and α = 0.26 for stance) than the chatbot-only responses (α = 0.32 for offensive and α = 0.18 for stance). Lower agreement for chatbot responses is likely due to their higher proportion of incoherent responses. Approximately 25% of DGPT responses and 12.5% of GPT-3 responses were identified as not plausible.
Due to the inherent complexity of our MTurk annotation task (see the screenshot of the crowd annotation interface in Figure 8 in the appendix), we observe relatively low agreement levels. How-ever, we find that aggregating worker annotations produces gold labels of sufficiently high quality for training and evaluating models (we consider the gold label as offensive or agreeing if at least 2 of the five workers agree). We manually verified the quality of the aggregate labels by comparing them with an in-house annotator's carefully labeled 40 threads. The F1 score of the aggregate annotations was 0.91 and 0.94 for offensive language and stance, respectively, providing a human upperbound estimate for identifying stance and offensive comments.

Stance Dynamics in TOXICHAT
Directly vs Contextually Offensive Replies.
Our key finding is that most offensive responses are directly offensive, but the occurrence of contextually offensive dialogue responses is also nontrivial. To elucidate, dialogue model can spew offensive language either 1) directly -by disrespecting a target-group or 2) contextually -by agreeing with previous offensive utterances ( Figure 1). The distribution of these offensive responses from both dialogue models and human reply comments is presented in Figure 2. Compared to humans, dialogue model responses are overall less offensive, where GPT-3 (389 out of 2,000) is more offensive than DGPT (179 out of 2,000).

Agreement with Offensive vs Safe comments.
We also plot the percentage of responses with the "Agree" stance towards previous offensive vs. safe comments in Figure 3. Surprisingly, we find that humans are more likely to agree with preceding offensive comments (41.62%) compared to safe comments (12.89%). Further analysis in Appendix E shows this is a consistent phenomenon based on an automated analysis of 5 million threads written over six months. We hypothesize that the higher proportion of agreement observed in response to offensive comments may be explained by the hesitancy of Reddit users to engage with offensive comments unless they agree. This may bias the set of respondents towards those who align with the offensive statement, essentially creating an echochamber (Cinelli et al., 2021;Soliman et al., 2019). Regardless of the cause, this behavior is also reflected in dialogue models trained on public Reddit threads. In our human-annotated dataset, both DGPT and GPT-3 are almost two times more likely to agree with a previous offensive comment, as compared to a safe comment. Further analysis us-  ing our automatic toxicity and stance classifiers is presented in Table 3.
Target-Group Distribution. In Figure 4, we visualize the distribution of target group frequencies. We see that Reddit user responses in threads (i.e. comments) are offensive towards both demographic groups (women, feminists, religious folks, LGBTQ folks etc.) and specific individuals (celebrity, Reddit user). This mirrors the discrimination that people report facing in real life (RWJF, 2017). On the contrary, dialogue models responses are more offensive towards individuals and women.
On an average, they respond more with personal attacks directed towards individuals as opposed to offending a certain demographic. We show some qualitative examples from our dataset in Figure 5.  Target groups are organized in decreasing frequency in each decagon, starting clockwise from the top-right corner.
Toxicity Triggers (Zhou et al., 2021) which is a lexicon of 378 "bad" words, phrases, and regular expressions. 10 We find that only 3.35% of DGPT offensive responses contain profanity compared to 39.59% of GPT-3 and 66.47% of Reddit user's offensive responses. Thus, filtering training instances containing offensive phrases reduce profanity in DGPT responses (Zhang et al., 2020). However, this filtering doesn't eradicate the model's offensive behavior.

Offensive Language and Stance Classification
We now investigate the predictability of Offensive Language (Offensive) and Stance (Stance) in conversations that include generated responses. Given a thread, T = (u 1 , u 2 , ..., u k ), we predict Offensive labels o i ∈ {0, 1} for each utterance, u i , i ≤ k and Stance labels s i←j ∈{Neutral, Agree, Disagree} for every pair of utterances (u i , u j ), i < j ≤ k.

Model Architectures
In both classification tasks, we experiment with the following three model architectures: where ⊕ is concatenation operator, is element-wise multiplication.

Loss Functions
The standard cross-entropy loss function is used for the Offensive task, however, because Stance has an imbalanced class distribution (about 1:10 for Agree and 1:40 for Disagree), we use weighted cross-entropy (wCE) with weights (1, 100, 100) for {Neutral, Agree, Disagree} respectively. We also experiment with Class-Balanced Focal Loss, CB foc (Cui et al., 2019).
Formally, let C = {Neutral, Agree, Disagree} andŝ = (z 0 , z 1 , z 2 ) represent the unnormalized scores assigned by the model for each stance label. Then, where y is the correct stance label, n y is the number of instances with label y and p m = sigmoid(z m ),

Offensive to celebrity
Offensive to democrats and republicans Figure 5: Examples of dialogue model generated offensive personal attacks without explicit bad words.
ative loss for well classified instances. In our experiments, the hyperparameters β and γ are set to 0.9999 and 1.0, respectively.

Evaluation
We divide TOXICHAT into train, dev, and test sets using a 70-15-15 ratio. Identifying offensive reply utterances (u i , i ≥ 2) is challenging since it may require understanding the entire thread context. Hence, we evaluate Offensive task using offensive label F 1 score for (1) all utterances, (2) first utterance, and (3) reply utterances in the thread. For the Stance task, we present per class F 1 as well as macro-F 1 scores for all utterance pairs. We also report these metrics for adjacent pairs of utterances i.e. for pairs (u i , u i+1 ), which are easier to predict. Hyperparameters and implementation details are present in Appendix D.

Results and Analysis
We present the test set evaluation results of Stance and Offensive tasks in Table 1 and 2, respectively. We observe similar trends as test in the dev set evaluation metrics presented in Table  6 and 7 in the Appendix. The DGPT model with full thread context outperforms BERT and NBOW models which lack the global context.
For the Offensive task, DGPT classifier achieves higher accuracy for detecting offensiveness in the first utterance (first u F 1 ) compared to BERT. This suggests that pretraining on in-domain Reddit comments improves the performance. Augmenting our training set with SBIC data shows further improvement in all the metrics. However, even the best model achieves 0.714 F 1 on all utterances, showing that the task is challenging. Classification models perform worse on dialogue model responses within our dataset, as they can be incoherent but distributionally similar to natural language. To corroborate, the best model, DGPT+, gets 0.673 F 1 on GPT-3 responses and 0.489 F 1 on DGPT responses.
Stance classification models struggle to perform well as evidenced by low F1 scores on detecting 'Agree' and 'Disagree' stance. As found in prior work on stance detection (Yu et al., 2020), stance alignment is challenging because it is contextual, nuanced, and doesn't need high word-overlap to convey implicit agreement/disagreement. For instance, a sarcastically worded question, like "Oh really?", can also show indirect disagreement. Training with weighted cross-entropy loss (wCE) boosts the performance of the DGPT classifier by getting the highest 'Agree' label F 1 . However, its performance on Disagree classification is still poor. This issue is mitigated by training DGPT classifier with class balanced focal loss (CB foc ), which achieves the highest overall Macro-F 1 .
Our data analysis confirms that dialogue models can generate some contextually offensive language. To steer the generation away from offensive content, we experiment with some preliminary strategies using controlled text generation (CTG). We consider the following three control attributes: (1) Offensive -to control safe or offensive response generation, (2) Stance -to control agreeing or neutral response generation towards its immediately preceding comment, 11 and (3) Both Offensive and Stance -to control response generation with both control types.
To train CTG models, we need conversations with their last response labeled with control attributes. Therefore, we extract 5 million comment threads, similar to §3, and retrieve offensiveness and stance predictions using our best DGPT modelbased Offensive and Stance classifiers ( §5.4). To minimize classification errors, we use high precision predictions by selecting appropriate thresholds for different classification probabilities. 12 For each thread, we retain Offensive prediction of the last utterance and Stance prediction between the last two utterances.
For all 3 proposed control experiments, we first create samples of L ≈ 250, 000 highprecision classifier labeled threads in the format . Here x i is the thread without the last utterance, ct i is the classifier labeled control token and y i is the last utterance or response to x i . We discard 'Disagree' stance responses, as we only found about 10, 000 high-precision disagreeing responses. Our final sample contains about 100, 000 offensive responses and 75, 000 agreeing responses. We further divide into each control dataset of size L into a 95-5 ratio to get train and dev split.

Modeling, Training and Testing Details
We use CTG techniques that were found effective in reducing toxicity in language models by Gehman et al. For each CTG experiment, we fine-tune DialoGPTmedium on the train split for 3 epochs and tune hyperparameters using dev set perplexity.
Our goal is to test the conversation models in offensive contexts, where they have a propensity to agree with offensive comments, hence, we sample a test set of 500 threads where the last utterance is offensive. Using this test set, our CTG models are compared against DGPT-medium, GPT-3, and Blender in both automatic and human evaluations.

Automatic Evaluation
An ideal dialogue model should have diverse, engaging and safe responses. Thus, we evaluate the responses generated by all the candidate conversation models using the following automatic metrics, Distinct-1,2 is the ratio of unique unigrams and bigrams to the total. % Bad is percentage of generated responses containing profane word/phrases identified by Toxicity Triggers (Zhou et al., 2021, similar to §4). % Off is percentage of responses predicted offensive by the DGPT+ Offensive classifier. % Agree, % Neutral are percentages of generated responses predicted agree or neutral respectively by the DGPT (CB foc ) Stance classifier. 13 Table 3 contains the results from our automatic evaluations on 500 offensive test threads. Pretrained dialogue models DGPT and GPT-3 generate ≈ 30% and ≈ 41% offensive responses when tested in offensive contexts. On the other hand, fine-tuning dialogue models on safe conversations reduce their offensive behavior, as seen with Blender bot and DAPT safe control responses. However, additional safe conversations fine-tuning alone doesn't eliminate offensive behavior. Surprisingly, Bender and DAPT safe control models both show higher agreement in offensive contexts than the DGPT baseline. Fine-tuning on both 'neutral' and 'safe' responses, as in the case of the DAPT -neutral stance control model, simultaneously reduces the agreement while generat- 13 We predict the most likely class in automatic evaluation instead of high-precision threshold prediction, which was used to generate fine-tuning data for controllable text generation.

Control
Len.  ing less offensive responses. ATCON both control model also outperforms the DGPT baseline in %Off, and %Agree metrics but with smaller margins that DAPT neutral stance control model. Finally, our evaluation of Reddit user responses (last row in Table 3) also finds them to be highly offensive and agreeing in offensive contexts. 14

Human evaluation
To validate the findings of our automatic evaluation presented above, we conduct in-house human evaluation of 4 models: DGPT baseline, Blender bot, DAPT neutral stance control and ATCON both control. We exclude GPT-3 from this evaluation as we don't have access to its model parameters and can't fine-tune it for CTG. For every model response, we investigate its plausibility {Yes, No}, stance towards the last comment in the thread {Agree, Disagree, Neutral}, and offensiveness {Yes, No}. We recruit two annotators to evaluate model responses for a sample of 250 offensive test threads. The Cohen's Kappa and pairwise-agreement for the two annotators are κ = 0.40 and 77.9% for plausibility, κ = 0.74 and 87.1% for stance and κ = 0.76 and 92.3% for offensiveness. We resolve disagreements between annotators using a 3rd inhouse adjudicator. The results of the evaluation are present in Table 4. According to human evals, the DAPT model achieves the lowest 'agree' responses and highest 'neutral' responses but is slightly more offensive than Facebook's Blender chatbot. Blender is the least offensive but most agreeing among all evaluated models. This implies that our offensive 14 The test threads used to evaluate dialogue models didn't have a follow-up Reddit user response. Hence, we collect a different set of 500 offensive threads with a final user response.  and stance classifiers don't generalize well to unseen dialogue model responses (Blender bot responses weren't present in the classifier training data). Other discrepancies between the human and automatic evaluations suggest that our stance classifier overestimates the 'neutral' stance and underestimates the 'agree' stance. After some manual investigation, we observe that Blender chatbot mostly generates benign empathetic responses but agrees a lot in offensive context by using sentence starters like "I know right? ..." (examples in Figure 9). Blender chatbot also outperforms the CTG models in terms of plausibility, likely due to its larger model size. Similar to the finding of Gehman et al.
(2020), ATCON model is only slightly less offensive than the DGPT baseline and doesn't reduce the agreement rate. Therefore, we find finetuning on safe and neutral conversations i.e. DAPT to be the most effective technique in reducing offensive behavior in chatbots, but it is still far from perfect.

Related Work
Identifying Toxicity -Most works on identifying toxic language looked at isolated social media posts or comments while ignoring the context (Davidson et al., 2017;Xu et al., 2012;Zampieri et al., 2019;Rosenthal et al., 2020;Kumar et al., 2018;Garibo i Orts, 2019;Ousidhoum et al., 2019;Breitfeller et al., 2019;Sap et al., 2020;Hada et al., 2021;Barikeri et al., 2021). These methods are ill-equipped in conversational settings where responses can be contextually offensive.  Xu et al. (2020) train chatbots to avoid sensitive discussions by changing the topic of the conversation. In contrast, we tackle contextual offensive language by fine-tuning models to generate neutral and safe responses in offensive contexts.

Conclusion
To better understand the contextual nature of offensive language, we study the stance of human and model responses in offensive conversations. We create TOXICHAT, a corpus of 2,000 Reddit conversations augmented with responses generated by two dialogue models and crowd-annotated with targeted-offensive language and stance attributes. Classifiers trained on our corpus are capable of automatically evaluating conversations with contextually offensive language. Our analyses consistently find that Reddit users agree much more with offensive contexts. This trend could be explained by the tendency of social-media users to form echo-chambers (Cinelli et al., 2021;Soliman et al., 2019). Consequently, dialogue models learn to mimic this behavior and agree more frequently in offensive contexts. However, fine-tuning dialogue models on cleaner training data with desirable conversational properties (safe and neutral responses with DAPT) can mitigate this issue to some extent. To further strengthen dialogue safety, future research on detection of offensive context (Dinan et al., 2019a;Zhang et al., 2018a) and subsequent generation of nonprovocative counter-speech (Chung et al., 2019) is crucial.

Societal and Ethical Considerations
This paper tackles issues of safety of neural models, and specifically it attempts to understand how dialogue systems can help combat social biases and help make conversations more civil (Dinan et al., 2019a;Xu et al., 2020). For this purpose, we crowdannotate a dataset of offensive conversations from publicly available Reddit conversations enriched with automatically generated responses. This study was conducted under the approval of the Institutional Review Board (IRB) of Georgia Institute of Technology. We paid crowd workers on Amazon's Mechanical Turk platform $0.8 per HIT and gave extra bonuses to annotators with high annotation quality. We estimate that the hourly pay of crowd workers was $12.26. The in-house annotators were paid $13 per hour. Finally, we note that classifiers trained on our dataset are fallible and should be used with careful consideration (Sap et al., 2019;Dixon et al., 2018). source for bias evaluation and debiasing of conversational language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1941-1955, Online. Association for Computational Linguistics.

A Data Preprocessing
As a data cleaning step, we replaced all urls in the threads with a special token. We also limited the posts to ≤ 70 words and comments to ≤ 50 words. Only the posts containing textual data were allowed.

C Comparison with SemEval-2017
We compare TOXICHAT with SemEval-2017 Challenge Task 8, a corpus of stance in twitter threads discussing rumors. Specifically, we chart the word, sentence and label distribution of threads in both datasets in Table 5. Our corpus is bigger with more and longer sentences on average. The threads in our corpus are longer with more stance labels. Unlike SemEval-2017, who only annotate the stance with respect to the first comment in the thread, we annotate stance of all pair of utterances.

D Model Implementation Details
We conduct our experiments of §5 using huggingface transformers (Wolf et al., 2020) and pytorch libraries. All models are finetuned/trained using Adam optimizer (Kingma and Ba, 2015) and with learning rate 2 × 10 −5 . We use 300d GloVe embeddings (Pennington et al., 2014)   .680 Table 6: Dev set, Offensive F 1 scores for all utterances, first utterances and reply utterances in all threads. DGPT+ indicates DGPT model trained on our dataset augmented with instances from SBIC (Sap et al., 2020). and trained for 30 epochs. BERT and DGPT models are fine-tuned for 12 epochs. The DGPT model fine-tuned with class-balanced focal loss (CB foc ) for the Stance task performed better with learning rate 5 × 10 −5 and 16 epochs. The checkpoint with best all utterance F 1 on Dev set is selected for models of the Offensive task. While, the checkpoint with best all stance-pairs macro-F 1 is selected for the Stance task. All experiments are done on a single Nvidia RTX 2080 Ti GPU.

E Classifier Analysis on Reddit
We make predictions using our best Offensive and Stance classifiers on 5M Reddit threads downloaded for controlled text generation (CTG) experiments §6. Using the Offensive predictions, we identify the Offensive (and Safe) comments in the threads using P(Offensive) ≥ 0.7 (and P(Safe) ≥ 0.7). For each offensive and safe comment, we plot the distribution of its reply comment stance labels in Figure 6. Across the 6 month data that we analyzed, our classifiers consistently found that Reddit users agree 3× more with offensive contexts than safe. Moreover, our classifiers find more high-precision stance labels in safe context (only ≈ 9% ambiguous) compared to offensive context (≈ 27% ambiguous).   Figure 6: Monthly distribution of Stance classifiers labels on responses to offensive vs safe Reddit user comments. For Agree, Disagree and Neutral labels, we only use high-precision predictions. The predictions with low-precision are labeled as Ambiguous on the figure. Reddit users consistently agree 3× more with offensive contexts than safe.  Figure 7: List of all the target groups segmented into categories for better readability. "None" is also an option.