Generate, Prune, Select: A Pipeline for Counterspeech Generation against Online Hate Speech

Countermeasures to effectively fight the ever increasing hate speech online without blocking freedom of speech is of great social interest. Natural Language Generation (NLG), is uniquely capable of developing scalable solutions. However, off-the-shelf NLG methods are primarily sequence-to-sequence neural models and they are limited in that they generate commonplace, repetitive and safe responses regardless of the hate speech (e.g.,"Please refrain from using such language.") or irrelevant responses, making them ineffective for de-escalating hateful conversations. In this paper, we design a three-module pipeline approach to effectively improve the diversity and relevance. Our proposed pipeline first generates various counterspeech candidates by a generative model to promote diversity, then filters the ungrammatical ones using a BERT model, and finally selects the most relevant counterspeech response using a novel retrieval-based method. Extensive Experiments on three representative datasets demonstrate the efficacy of our approach in generating diverse and relevant counterspeech.


Introduction
Hate speech is any form of expression through which speakers intend to vilify, humiliate, or incite hatred against a group or a class of persons on the basis of some characteristics, including race, religion, skin color, sexual identity, gender identity, ethnicity, disability, or national origin (Ward, 1997;Nockleby, 2000). Its ever-growing increase on the Internet makes it a problem of significant societal concern (Williams, 2019); effective countermeasures call for not blocking freedom of speech by means of censorship or active moderation (Gagliardone et al., 2015;Strossen, 2018). A very promis-Hate Speech: I am done with Islam and isis. All Muslims should be sent to their homeland. Britain will be better without their violence and ideology.

Expert:
I agree that ISIS is an evil aberration, but to extend this to include up to 3 million people just in the UK is just plain silly.

Commonplace:
Hate speech is not tolerated. Please review our user policies. Thank you for your cooperation.

Not relevant:
Use of the r-word is unacceptable as it demeans and insults people with disabilities. ing countermeasure is counterspeech-a response that provides non-negative feedback through factbound arguments and broader perspectives to mitigate hate speech and fostering a more harmonious conversation in social platforms (Schieb and Preuss, 2016;Munger, 2017;Mathew et al., 2018;Shin and Kim, 2018). Counterspeech as a measure to combat abusive language online is also promoted in active campaigns such as "Get The Trolls Out". 1 What makes an effective counterspeech? Informed by psychosocial and linguistic studies on counterspeech (Mathew et al., 2019b) and the large number of effective counterspeech examples created by crowdsourcing (Qian et al., 2019) and by experts (Chung et al., 2019), we identify that effective counterspeech should be diverse and relevant to the hate speech instance. Diversity is the requirement that a collection of counterspeech should not be largely commonplace, repetitive and safe responses without regard to the target or type of hate speech (e.g., "Please refrain from using such language."). Relevance refers to the property that counterspeech should directly address and target the central aspects of the hate speech, enabling coherent conversations rather than irrelevant or offtopic ones (e.g., the hate speech instance targets an ethnic group, while the counterspeech talks about people with disabilities). Comparative examples are shown in Table 1 where we list some counterspeech that lack diversity or relevance.
While NLG systems (in particular, sequence-tosequence models) offer much promise for generating text at scale (Sutskever et al., 2014;Zhu et al., 2018;Lewis et al., 2020), the quality of the outputs is modest in the context of the requirements identified above. Indeed, Qian et al. (2019), the only existing quality work on counterspeech generation, has highlighted their limitations: the responses are largely commonplace and sometimes irrelevant. These limitations apply more broadly to general conversational language generation tasks, arising primarily due to the intrinsic end-to-end training nature of a single sequence-to-sequence architecture (Sordoni et al., 2015;Li et al., 2016;Serban et al., 2017;Jiang and de Rijke, 2018). Model refinements to account for these limitations have been addressed individually: improved diversity (Li et al., 2016;Xu et al., 2018) or improved relevance (Gao et al., 2019;. However, combining these improvements into a single model is not straightforward. Such is the goal of this paper. We tackle the problem from an entirely novel angle by proposing a three-module pipeline approach, Generate, Prune, Select (denoted as "GPS") to ensure the generated sentences adhere to the required properties of diversity and relevance. First, the Candidate Generation module generates a large number of diverse response candidates using a generative model. As such, a large candidate pool is made available for selection, which accounts for improved diversity. Second, the Candidate Pruning module prunes the ungrammatical candidates from the candidate pool. Last, from the pruned counterspeech candidate pool, the Response Selection module selects the most relevant counterspeech for a given hate speech instance by a novel retrievalbased response selection method.
We demonstrate the efficacy of GPS, the first pipeline approach for counterspeech generation, by a systematic comparison with other competitive NLG approaches in generating diverse and relevant counterspeech. We derive new state-of-the-art results on three benchmark datasets by showing improved diversity and relevance using both auto-matic and human evaluations.

Proposed Model
We assume access to a corpus of labeled pairs of conversations D = {(x 1 , y 1 ), (x 2 , y 2 ), ..., (x n , y n )}, where x i is a hate speech and y i is the appropriate counterspeech as decided by experts or by crowdsourcing. The goal is to learn a model that takes as input a hate speech x and outputs a counterspeech y. A motivating example is shown in Table 1. Most importantly, we aim at generating diverse and relevant counterspeech. We present an overview of the model in Figure 1 and describe each module in detail below.

Candidate Generation
The main goal of this module is to create a diverse candidate pool for counterspeech selection. We extract all available counterspeech instances Y = [y 1 , y 2 , ..., y n ] from the training dataset and enlarge the counterspeech pool by a generative model. Specifically, we utilize an RNN-based variational autoencoder (Bowman et al., 2016), that incorporates the global distributed latent representations of all sentences to generate candidates. Both the encoder and the decoder have two layers with 512 nodes each, and we use two highway network layers (Srivastava et al., 2015) to facilitate robust training. Like all other generative models, it aims to maximize the lower bound of the likelihood L of generating the training data Y , where θ denotes all parameters of the generative model, z is a latent variable having a Gaussian distribution with a diagonal covariance matrix, p denotes the prior distribution, q denotes the posterior distribution, and KL denotes the KL-divergence (Kullback and Leibler, 1951). In the training process, we apply the KL annealing technique (Bowman et al., 2016) to prevent the undesirable stable equilibrium problem (i.e., the first term of the likelihood function KL(q θ (z|y)||p(z)) becomes zero). Upon the completion of the training, we generate candidates by simply decoding from noise sampled from a standard Gaussian distribution (i.e., ∼ N (0, 1)). As demonstrated by Bowman et al. (2016) (and as inferred from our own experiments described in Section 3), the generative model not only captures  holistic properties of sentences such as style, topic, and high-level syntactic features, but also produces diverse candidates.

Candidate Pruning
Though candidates generated by such an RNNbased variational autoencoder are diverse, they are not always grammatical as pointed out by Bowman et al. (2016). Therefore, in this module, we prune the candidate list and retain only the grammatical ones. Toward this, we train a grammaticality classifier on the corpus of linguistic acceptability (CoLA) (Warstadt et al., 2018), a dataset with 10,657 English sentences labeled as grammatical or ungrammatical from linguistics publications. We select BERT (Devlin et al., 2019) as the classification model, and fine-tune it on the CoLA dataset. The choice of BERT is to best capture both the syntactic and the contextual information, and we select the 'bert-base-cased' model for its better computational efficiency.

Response Selection
We now have a collection of diverse and grammatical counterspeech responses. Finally, we aim to select the most relevant response to a given hate speech instance.
Taking into consideration the limited training instances that are realistically available (Chung et al., 2019;Qian et al., 2019), and inspired by the recent success of pretrained models (Devlin et al., 2019), we innovate on a pretrained response selection model for task-oriented dialogue systems (Henderson et al., 2019) and perform fine-tuning on our dataset. Henderson et al. (2019) proposed two response selection methods, but we find that neither of them is well-suited for our task.
1. Train a response selection classifier with the negative sampling technique: It relies on randomly drawing other candidates from the candidate pool as negative examples. However, in our task, one hate speech instance usually has multiple appropriate counterspeech instances. For example, given the hate speech in Table 1, there are many other instances that can work as quality counterspeech, such as "You cannot blame all people for the actions of a few. Banning something altogether will not solve anything." or "Does prohibition of anything ever work? I thought religious tolerance was one of our 'British values'?". Therefore, many wrongly chosen negative examples may negatively impact the inductive bias of the response selection classifier.
2. Select by cosine similarity: we point out that the embeddings of the input (hate speech) and the responses (counterspeech candidates) do not share the same latent vector space and therefore, the learned embeddings and their cosine similarities may not fully serve the purpose of relating the response to the input.
Therefore, instead of adopting the two available methods directly, we improve on the second one by fusing the latent spaces of the input and the responses, inspired by Gao et al. (2019). Specifically, we propose to learn a linear embedding mapping from the latent space of the responses to the latent space of the input, and then select the best response by cosine similarity. Mathematically, we use e x to denote the input embedding and e y to denote the response embedding. We aim to learn a linear mapping from e y to e y , where e y = (W + BI) · e y , W and B are learnable parameters, and I is an identity matrix. We learn the mapping such that the sum of the cosine similarities between e x and e y for the training data is maximized. By way of this transformation, e y now maps the vector space of the responses to that of the input, and thus allows the pretrained model to effectively utilize the discriminative power of the sentence embeddings. We empirically observe that the linear mapping works well and leave other advanced mapping techniques for future work.

Empirical Evaluation
In this section, we empirically evaluate the performance of our proposed approach and a set of baseline models.

Experimental Setup
Datasets: We use the benchmark datasets collected by Qian et al. (2019), which are fully-labeled hate speech intervention datasets collected from Reddit and Gab, comprising 5,257 and 14,614 hate speech instances respectively. We use the filtered conversation setting in Qian et al. (2019), which includes the posts labeled as hate speech only and discards other non-hateful conversations. Besides, we use the English language portion of the CO-NAN dataset (Chung et al., 2019), which contains counterspeech for 408 hate speech instances, written by experts trained on countering hatred. The Reddit, Gab and CONAN datasets have on average 2.66, 2.86 and 9.47 ground truth counterspeech for each hate speech respectively. Training Data: Since each hate speech can have multiple ground truth counterspeech, we follow Qian et al. (2019) to dis-aggregate the counterspeech and construct a pair (hate speech, counterspeech) for each of the ground truth counterspeech in each dataset. Given a counterspeech dataset, we randomly choose 70% (hate speech, counterspeech) pairs for model training, 15% for cross validation and the rest 15% for testing. Baselines: We compare our proposed approach with the following competitive baseline models: 1. Seq2Seq (Sutskever et al., 2014;Cho et al., 2014) is a widely used neural model for language generation. We use 2 bidirectional Gated Recurrent Unit (GRU) layers for the encoder and 2 GRU layers followed by a 3-layer neural network as the decoder. 2. Maximum Mutual Information (MMI) (Li et al., 2016) is a diversity-promoting approach for neural conversation models. We implement the MMI-bidi model (Li et al., 2016) and adopt incremental learning (Ranzato et al., 2016) to facilitate robust training. 3. SpaceFusion (Gao et al., 2019) optimizes both diversity and relevance by introducing a fused latent space, where the direction and distance from the predicted response vector roughly match the relevance and diversity, respectively. We align the direction parameter with the ground truth counterspeech. To better exercise the diversity power, we randomly choose the distance parameter at each time of generation. 4. BART (Lewis et al., 2020) is the state-ofthe-art pre-trained sequence-to-sequence model for language generation. It has a standard Transformers-based neural machine translation architecture which can be seen as generalizing BERT (Devlin et al., 2019), GPT (Radford et al., 2018) and many pretraining schemes. We finetune the BART model on our training data.
We compare with Seq2Seq since they are initially proposed and used by Qian et al. (2019). 2 We select MMI, SpaceFusion and BART as baselines because they are the state-of-the-art models in promoting diversity, optimizing both diversity and relevance, and generating quality language respectively.

Evaluation
We evaluate all model outputs along three dimensions: diversity, relevance and language quality. Diversity refers to vocabulary richness, variety in expression and the extent to which the response is dissimilar from the rest in a generated collection of responses. Relevance captures the extent to which the counterspeech addresses the central aspect of the hateful message and makes a coherent conversation towards mitigating the hate speech. A low relevance score means that the counterspeech is irrelevant to the hate speech or off-topic (e.g., the hate speech talks about LGBTQ whereas the counterspeech is related to religious beliefs). Language   quality measures whether the generated responses are grammatical, fluent and readable.

Automatic Evaluation
We evaluate diversity by distinct n-grams (Dist-n) (Li et al., 2016), Entropy (Ent-n) (Zhang et al., 2018) and Self-BLEU (Zhu et al., 2018). For relevance, we compare 1) the generated response with the ground truth counterspeech by BLEU (Papineni et al., 2002) and ROUGE (Lin and Hovy, 2003;Lin, 2004) for syntactic similarity, and by MoverScore (Zhao et al., 2019) and BERTScore (Zhang et al., 2020a) for semantic similarity; 2) the generated response with the hate speech by BM25 (Manning et al., 2008), a relevance estimation function widely used in information retrieval. We adopt GRUEN (Zhu and Bhat, 2020) to eval-uate the language quality. Note that larger scores indicate better quality, except for Self-BLEU.

Human Evaluation
Following Qian et al. (2019), we ask human annotators to perform "winner selection" instead of numeric ratings. We randomly sampled 100 hate speech instances from each dataset's test collection, and corresponding model-generated counterspeech outputs from Seq2Seq, MMI, BART, and our model. 3 Two human annotators (native English speakers and sensitive about hate speech online) were presented with one hate speech and the four generated outputs, and were asked to evaluate the outputs for relevance and language quality separately. As for diversity, the annotators were presented with 10 responses for each of the four models each time, and were asked to select the model with the best diversity. For each of the three dimensions, the best output received a 1 (ties were allowed, i.e., multiple outputs could receive a 1) and the rest received a 0.

Results
We obtain the results of our model (denoted as "GPS") by generating 30k, 30k, 40k candidates for the CONAN, Reddit and Gab datasets respectively. 4 After the Candidate Pruning module, we You cannot blame all people for actions of a few. Banning something altogether will not solve anything.
(b): The world would be a better place without Muslims. They are only killing and raping our children.
The world would actually be a very dark place without Muslims who contribute a lot to our society. What about our Muslim doctors, scientists, philanthropists, actors, job-creators?
(c): What is the point in listening to black women? They are all retarded.
The point of listening to black women, and women in general, is that they could broaden and enlighten your view of the world. The results by automatic metrics and human evaluation metrics are presented in Table 2 and Table  3 respectively. Overall, GPS has the best diversity with significant margins than the baselines. For relevance, GPS has slightly better performance for BLEU, ROUGE, MoverScore and BERTScore, while has much better performance on BM25. This implies the counterspeech generated by GPS are more related to the hate speech and therefore, make more coherent conversations. Examples of counterspeech generated by GPS are presented in Table  4. We find that GPS is able to generate diverse and relevant rather than merely commonplace responses, such as "Please refrain from using such language". Comparative case studies for different baseline models are shown in Appendix A.4. Therefore, we conclude that GPS has the best diversity and relevance, compared to the baselines. Besides, GPS has comparable language quality with the best baseline model-BART.
Among these baselines, BART is the strongest one with much better relevance and language quality. Yet, BART still suffers from the diversity issue, as discussed in Section 4.3. SpaceFusion has very poor results overall, though a manual inspection of the latent space fusion visualization suggests otherwise. One explanation is that SpaceFusion, with substantially more parameters compared with the Seq2Seq model may not have had sufficient training instances for its optimal performance. In their own experiments, Gao et al. (2019), demonstrate that SpaceFusion worked well on two datasets with 0.2M and 7.3M conversations, which is at least one to two orders of magnitude larger than our dataset. If provided with more training data, SpaceFusion could possibly be a strong candidate too. In comparison, though BART is an even more complicated model with 139M parameters, it was pre-trained on the BooksCorpus dataset (Zhu et al., 2015) with over 7,000 unique unpublished books and has the fine-tunable property.

Ablation Study
We compare with the following ablations of GPS and show the results in  G-BART has almost the same performance as GPS. Therefore, we select the RNN-based variational autoencoder for candidate generation for its better computational efficiency. Compared with the full model, though P-no has slightly better performance on diversity, it performs poorly on both relevance and language quality. Three ablation methods for response selection have similar performance. They have comparable performance to GPS on diversity and language quality, but worse results on relevance.
The ablation study demonstrates the significance of the Candidate Pruning module and our proposed Response Selection method. It also implies that diversity, language quality and relevance are improved by the Candidate Generation module, the Candidate Pruning module, and the Response Selection module respectively.

Generation vs. Selection
This section studies the relationship between the Candidate Generation module and the Response Selection module. The more candidates we generate, the more diversity the model gains potentially. However, one might think that the selection model may suffer from a very large candidate pool and result in poor relevance. Empirically as shown in Figure 3, we find that once the number of candidates generated has passed a threshold, the diversity (i.e., the blue line) almost converges. Besides, we also find the relevance is not compromised and relatively stable even with more candidates generated beyond the threshold. Therefore, we select the number of candidates at the "elbow" point based on the performance on the validation dataset, for efficient computations. We scale Dist-2 by 10 times for better visualization.

Explicit Relevance (BM25) vs. Diversity
Based on the reasoning that models with better BM25 scores should specifically address the cen- tral aspect of hate speech and thus produce dissimilar responses for different hate speech, we hypothesize that models with better BM25 should generate more diverse responses. Therefore, we present scatter plots of BM25 and diversity scores for all five models (in Section 3.1) on all three datasets altogether in Figure 4, resulting 15 data points per subfigure. We find that BM25 and diversity have a reasonably strong correlation (Pearson's Correlation scores are 0.47 and -0.60 for Ent-1 and Self-BLEU-1 respectively).

Related Work
We focus on three areas to the problem of hate speech and its countermeasures, i.e. (i) psychosocial analysis, (ii) automatic counterspeech generation, and more broadly, (iii) conversational language generation.

Psychosocial Analysis of Counterspeech
Effectiveness of Counterspeech: There is a significant research interest in understanding the effectiveness of counterspeech to fight hatred and deescalate the conversation as evidenced by a growing number of recent studies (Schieb and Preuss, 2016;Munger, 2017;Mathew et al., 2018). Munger (2017) found that subjects who were educated by high-follower white males, significantly reduced their use of racist slurs on Twitter. Schieb and Preuss (2016) studied counterspeech on Facebook via a simulation, and concluded that counterspeech could have a considerable impact on a given audience, and the impact was a function of the proportion of hate speakers in the audience. In a subsequent study, Mathew et al. (2018) recorded the case of a user who, after seeing the counterspeech posted to her hateful messages on Twitter, openly apologized for her actions. Besides academia, some organizations are also set to promote countermeasures via campaigns such as the no hate speech movement 5 and the Facebook counterspeech campaign 6 . Therefore, Benesch (2014); Mathew et al. (2019b) suggest that counterspeech can be regarded as one of the most promising and "constitutionally preferred" approaches to hate speech. In addition, counterspeech could be likened to the effect of prosocial active bystanders in face-to-face bullying scenarios, where bystander intervention (speaking on behalf of the victim) has been found to successfully abate victimization most of the time (O'connell et al., 1999;Craig et al., 2000). for influential users) to produce effective counterspeech. Mathew et al. (2018) found that the hate tweets by verified accounts were much more viral as compared to tweets by non-verified accounts, by analyzing the hate speech and counterspeech accounts on Twitter. Mathew et al. (2019a) study how hate speech spreads in online social media. More recently, Sap et al. (2020) studied pragmatic formalisms to capture ways in which people express social biases and power differentials in language, permitting a broader computational framework for processing hate speech.

Counterspeech Generation
Though the effectiveness of counterspeech is wellmotivated from both psychosocial and linguistic perspectives, limits to manual counterspeech generation at scale have prompted automatic generation of counterspeech, an area that has received little attention to date. The first key challenge in this direction is the creation of reliable counterspeech datasets of high quality. Mathew et al. (2019b) (2020) proposed an approach to collect counterspeech responses in a more effective manner, but have not yet released a quality dataset. In our work, we conduct the experiments on all the publicly available datasets (i.e., (Chung et al., 2019;Qian et al., 2019)) to date, to the best of our knowledge.
Research on NLG algorithms for counterspeech generation is still in its infancy. Qian et al. (2019) made the only initial attempt and proposed the use of three neural models to generate counterspeech. However, they only experimented with the most basic model architectures (e.g., Seq2Seq) to prove the feasibility of the task, and leave the performance improvement for future work. In our work, we extend their results by studying more advanced architectures, identifying principal dimensions of effective counterspeech, and proposing a novel pipeline to better solve the problem. To the best of our knowledge, this paper represents the first successful pipeline model for counterspeech generation.
From the technical perspective, our work shares some high-level similarities with Tekiroglu et al.
(2020) since we both use generative models to generated candidates. However, we would like to highlight that our essential goals are different. Tekiroglu et al. (2020) aim to collect quality data by enabling language models and studying human annotation strategies, while we aim to generate counterspeech to a given hate speech.

Conversational Language Generation
Counterspeech generation is broadly related to conversational language generation, where most of the best performing approaches are based on neural models trained in a sequence-to-sequence manner (See et al., 2019a). Despite the good performance of these models, one of their widely acknowledged intrinsic drawbacks is the generation of safe and commonplace responses (Sordoni et al., 2015) due to improper objective function (Li et al., 2016), lack of model variability (Serban et al., 2017;Zhao et al., 2017), weak conditional signal (Tao et al., 2018), and model over-confidence (Jiang and de Rijke, 2018). Such tendency has prompted the study of methods that improve diversity and has resulted in a wide variety of solutions, such as optimizing a different loss function (Li et al., 2016;Zhang et al., 2018), varying the latent space (Shao et al., 2019;Gao et al., 2019), utilizing adversarial learning (Xu et al., 2018;Shetty et al., 2017;Shi et al., 2018), and leveraging non-conversational information (Wu et al., 2020;Su et al., 2020;Tu et al., 2019). Our work is different from all above in that we adopt a pipeline model which promotes diversity by generating a variety of candidates. As such, it does not have the aforementioned intrinsic drawback of a sequence-to-sequence model.

Conclusion and Future Work
We proposed a three-module pipeline -Generate, Prune, Select for counterspeech generation against online hate speech. Empirical evaluation on three datasets demonstrates that our model is effective in producing diverse and relevant counterspeech.
Future works could include the following two directions: 1) stylistic counterspeech generation: Mathew et al. (2019b) find that different counterspeech styles/strategies may be needed for different hate speech topics and therefore, it would be interesting to develop new techniques to generate the most effective style of counterspeech for each hate topic. We think this could be a natural extension to our proposed model, since we can utilize a style classifier in the Candidate Pruning module. 2) system deployment: studying the real social impacts of automatic counterspeech generation in reducing online hate speech via system deployment and the actual activity monitoring can directly inform research in this area. Reproducibility: Our code is available at https: //github.com/WanzhengZhu/GPS.

Ethical Considerations
We recognize that studying counterspeech generation necessarily requires us to confront online content that may be offensive or disturbing. However, deliberate avoidance does not eliminate such problems (Sap et al., 2020). Since the effectiveness of counterspeech has already been widely studied in Section 4.1, our work makes a positive step towards automating the process, which could potentially educate hate speakers and mitigate hate speech online. Besides, the automation process could help reduce the amount of human work and therefore, potential harm to human moderators (Barrett, 2020;Zhu et al., 2021). In addition, the collective analysis over large corpora and counterspeech can also be insightful for educating people on reducing the usage of hate speech consciously or unconsciously in their language. Risks in deployment: The deployment of counterspeech generation (e.g., (de los Riscos and D'Haro, 2020)) should be done after paying attention to several ethical aspects some of which we list below.
• Social and racial bias (Sap et al., 2020): Does the model have any pragmatic implications which project unwanted social or racial biases and stereotypes onto online users? • Fairness (Mitchell et al., 2019;Corbett-Davies et al., 2017): can the model ensure fairness for different demographic groups or speakers of different forms/dialects/vernaculars of English? • Failure cases: are there any failure cases, which could further incite more aggressive hate speech? It is crucial to ensure that counterspeech deployment does not escalate a given hateful situation. • Evaluation metrics (Corbett-Davies et al., 2017): the present study improves upon prior works by more comprehensive evaluations on diversity, relevance and language quality. However, there is a chance that the three criteria are sufficient for deployment in a realistic setting and there may be additional criteria associated with their effectiveness. • Potential nefarious side effects and misuse potential (Lau et al., 2020): how to ensure that our model is not misused for other unwanted purposes?
Given the limited scope of the present study, we call for attention to these aspects by way of welldesigned experiments before deploying counterspeech generation bots.
Regulatory standpoint on the present study: Institutional Review Board (IRB) gave us clear feedback on what is considered human research and thus subject to IRB review. Analyses relying on user-generated content do not constitute humansubject research, and are thus not the purview of the IRB, as long as 1) the data analyzed are posted on public fora and were not the result of direct interaction from the researchers with the people posting, 2) there are no private identifiers or personally identifiable information associated with the data, and 3) the research is not correlating different public sources of data to infer private data. 7 All of these conditions apply to the present study. Additionally, the hate speech and counterspeech instances were secondary data, previously collected by Qian et al. (2019);Chung et al. (2019) and the annotators in our study were evaluating the quality of the generated sentences only.

Risks in annotation:
The data we use in this paper were posted on publicly accessible websites, and do not contain any personally identifiable information (i.e., no real names, email addresses, IP addresses, etc.). The annotators were undergraduate assistants in the lab receiving research credit for their annotation and were blind to the systems they were annotating. They were warned about the offensive content before they read the data, and were informed that they could quit the task at any time if they were uncomfortable with the content. We measure distinct n-grams (Dist-n) (Li et al., 2016), Entropy (Ent-n) (Zhang et al., 2018) and Self-BLEU (Zhu et al., 2018) for diversity. Dist-n reflects the vocabulary diversity by simply dividing the number of unique n-grams by the total number of n-grams of model output. One limitation of Dist-n is that it fails to accommodate the frequency difference of n-grams. To accommodate the frequency difference of n-grams, we also use the Entropy metric (Zhang et al., 2018), which reflects how evenly the empirical n-gram distribution is.
Though Dist-n and Ent-n evaluate the vocabulary diversity well, they fail to evaluate the interresponse diversity. For instance, they favor responses with diverse n-grams even when they are highly similar with the rest of the responses. Therefore, to accommodate such inter-response diversity, we resort to use Self-BLEU (Zhu et al., 2018) to evaluate how one response resembles the rest in a generated collection of responses. Self-BLEU regards one generated sentence as the hypothesis and the other generated sentences as the reference, and calculates the BLEU score for every generated sentence. Therefore, the smaller the Self-BLEU, the better the diversity.

A.1.2 Relevance
Most existing works measure relevance implicitly by BLEU and ROUGE, a set of metrics evaluating syntactic similarity between the ground truth and the generated output. They assume that the ground truth is highly relevant to the conversational input (i.e., it refers to the hate speech in our task) and therefore, the "closer" the generated output is to the ground truth, the more relevant the output is to the hate speech instance.
Explicit relevance evaluation (i.e., relatedness between the conversational input and the generated output) has been studied in only a few existing works. For instance, See et al. (2019b) and Zhang et al. (2020b) ask human annotators to evaluate relevance explicitly.  propose to use HIT-Q and HIT-R, two hit rate based metrics which require hand-crafted rules. For automatic metrics, Gao et al. (2019) propose to use "Precision" to measure relevance. However, we consider "Precision" inappropriate in our problem setting, because it only measures the relationship between the generated output and the ground truth, but not the relationship between the generated output and the conversational input.
Since there is no consensus on which automatic metric best serves the purpose of explicit relevance, we select BM25 (Manning et al., 2008) -a relevance estimation function widely used in information retrieval. Besides, we follow existing works to evaluate implicit relevance by measuring BLEU and ROUGE for syntactic similarity, and Mover-Score and BERTScore for semantic similarity.

A.1.3 Language Quality
GRUEN (Zhu and Bhat, 2020) is the only existing open-source unsupervised metric that measures the language quality of generated text. It requires no reference to compare with and has been shown to correlate well with human annotations on a variety of language generation tasks. In order to see how the robustness of baseline neural models changes with the number of epochs, we plot relevance and diversity measured by automatic metrics against the number of epochs for Seq2Seq and MMI in Figure 5. For each sub-figure, the middle line indicates relevance while the other two lines indicate diversity.
We note that the diversity increases with the number of epochs, until converges at about 100 epochs. Surprisingly, the relevance has a spike in the initial few training epochs and then converges to a lower score at about 50 epochs. We inspected the results where the spike occurs and observed that the model learns to produce only a few general repetitive counterspeech (e.g., "Hi there, please refrain from using derogatory comments in the thread. They are hurtful and unwanted. If you continue, Admin will be alerted.") to all hate speech. Such general responses, though result in high relevance scores (e.g., BLEU and ROUGE), are not yet effective due to the lack of diversity. With more training epochs, the models learn to produce more diverse responses at the cost of reduced BLEU and ROUGE.
Note that all results in this paper (e.g., Table  2) are reported when both relevance and diversity stabilize (i.e., 100 epochs of training). Qian et al. (2019) report higher BLEU and ROUGE scores than our results in Table 2 for the Seq2Seq model and we suspect that their reported results were obtained with only a few epochs of training.

A.3 Efficiency Comparison
We implemented all models in Python 3.7 and conducted all the experiments on a computer with twenty 2.9 GHz Intel Core i7 CPUs and one GeForce GTX 1080 Ti GPU. We report the average training time on three datasets. Seq2seq: 4.2 hours; MMI: 7.8 hours; BART: 7.1 hours; SpaceFusion: 16.2 hours (running on the CPUs only); GPS: 4.0 hours. We observe that our model requires lower or similar training cost, compared to the baselines. Table 5 presents case studies on the generated response for different models. In cases (a) and (b), both BART and our model make reasonable responses, whereas Seq2Seq and MMI produce only nonsense. In cases (c)-(e), Seq2Seq, MMI and BART generate general and safe responses while our model directly targets the bad words (e.g., "twat", "fairy gay faggot") in the hate speech, and even shows understanding and kind warnings to the hate speaker in case (c). Therefore, our model may make the hate speaker feel their voices have been heard and is considered closer to human-like moderators. Moreover, we find BART sometimes identifies wrong hate words (In case (d), the hate word is "twat" while BART refers to "troll" in the response. In case (e), the hate word is "fairy gay faggot" while BART refers to "kike".). The incorrect referral could potentially make the hate speakers irritated and become even more offensive. Table 6 presents human annotation guidelines and examples on the three dimensions. The interannotator reliability scores are 0.50, 0.46, 0.36 for diversity, relevance and language quality respectively.