Towards Knowledge-Grounded Counter Narrative Generation for Hate Speech

Tackling online hatred using informed textual responses - called counter narratives - has been brought under the spotlight recently. Accordingly, a research line has emerged to automatically generate counter narratives in order to facilitate the direct intervention in the hate discussion and to prevent hate content from further spreading. Still, current neural approaches tend to produce generic/repetitive responses and lack grounded and up-to-date evidence such as facts, statistics, or examples. Moreover, these models can create plausible but not necessarily true arguments. In this paper we present the first complete knowledge-bound counter narrative generation pipeline, grounded in an external knowledge repository that can provide more informative content to fight online hatred. Together with our approach, we present a series of experiments that show its feasibility to produce suitable and informative counter narratives in in-domain and cross-domain settings.


Introduction
Standard approaches for online hate mitigation generally rely on content moderation, ranging from deletion of hate content and suspension of user accounts to shadow banning. However, these approaches may draw limits on free speech and diverse opinions. An alternative approach is to directly intervene in the conversation with counter narratives. A Counter Narrative (CN) is a nonnegative response to a Hate Speech (HS), targeting and contradicting extreme statements with factbound arguments or alternative viewpoints (Benesch, 2014;Schieb and Preuss, 2016). Such strategy seeks to de-escalate the conversation, disengage from hateful sentiment and encourage mutual understanding through exchange of opinions. Many Non-Governmental Organizations (NGOs) specialized in hate countering are already adopting this approach by training operators to compose counter narratives. According to NGO guidelines 1 , proper CNs should also include credible evidence.

HS:
The world would be a better place without Muslims. They are only killing and raping our children. CN1: Evidence for this? This is not true. How can you say this about an entire faith? CN2: On the contrary, most children abuse is operated by people they know: a relative, family friends, sports coach, someone in a position of trust and authority. Besides, Muslims help people -A Muslim woman rushed to help the victims of a triple stabbing in Manchester on New Year's Eve. CN3: You are truly one stupid backwards thinking idiot to comment on Muslims like that. In Table 1, we present an HS along with several CNs. Although CN1 and CN2 are both appropriate responses, not all appropriate CNs are equally effective (Silverman et al., 2016;Tuck and Silverman, 2016). CN2 is expected to be more effective as it is tailored to the HS and demonstrates contextualized and persuasive statements with supporting facts, knowledge, and logical reasoning (Habernal and Gurevych, 2016) rather than a simple generic request for evidence as in CN1. Conversely, CN3 is considered as a natural but inappropriate and aggressive response.
In this context, scaling manual response to massive amounts of online hatred is an undeniably challenging task for NGO operators. Recently, studies have started to focus on data and strategies for the automatic generation of counter narratives in an end-to-end fashion (Chung et al., 2019;Qian et al., 2019;Tekiroglu et al., 2020;Chung et al., 2020). However, the seminal studies, based on training models using only the HS-CN data samples, do not directly address the required properties for proper and effective CNs such as providing credible evidence and knowledge with facts, statistics or examples. While such models possess knowledge to a certain extent through utilizing existing generative language models (LMs) such as GPT-2 (Radford et al., 2019), this knowledge (e.g., events or news) will get out-dated over time. Furthermore, such models are able to create plausible but not necessarily true arguments -a problem known as content hallucination - (Zellers et al., 2019;Solaiman et al., 2019). A hallucinated example of a fact-bound CN, generated by GPT-2 model fine-tuned as done by Tekiroglu et al. (2020), is provided in Table 2. Therefore, in this paper we investigate the generation of knowledge-bound counter narratives, which had never been studied yet to the best of our knowledge. We hypothesized that knowledge infusion can not only leverage generating informative CNs but also handling hate speech from unseen domains (i.e. hate targets) without further training (e.g., train on Islamophobia and test on Antisemitism).
HS: Muslims are invading our country CN: Actually, there were 16,938,000 Muslims in the UK in 2016, so if you exclude London, that is actually increasing by 2%, which doesn't seem very significant. Table 2: Hallucinated CN generated by GPT-2 that is fluent and credible (according to Office for National Statistics, the Muslim population is just above 3M).
To this end, we explore methodologies to generate informative CNs using external knowledge. In particular, we propose an extension of knowledgegrounded generation approaches by adopting an intermediate step where we generate keyphrases for retrieving needed knowledge. So, we first train a counter narrative keyphrase generator, then the generated keyphrases are employed for selecting relevant knowledge sentences. Finally, pre-trained LMs are fine-tuned on the relevant knowledge sentences, together with the HS input, to produce knowledge-augmented CNs. Our extensive experiments on CN generation, including both automatic and expert evaluation, demonstrate that the presented approach produces more specific and tailored responses both for in-domain and zero-shot cross-domain configurations as compared to other approaches, such as standard LMs, that are simply fine-tuned for the task without the use of external knowledge.
As our main contribution, we show that: (i) external knowledge can boost informative CN generation, (ii) keyphrase generation improves the quality of retrieved documents, (iii) silver knowledge is utilizable for the task when no gold knowledge is available, (iv) knowledge-bound models are advantageous while tackling zero-shot cross-domain generation especially if (v) using injection of knowledge in large pre-trained LMs.

Related Work
In this section we review three main research topics that are relevant for fighting hatred online: (i) studies on CN effectiveness in hate countering, (ii) counter-argument generation and (iii) knowledgeguided generation.
Hate countering. Employing counter narratives has shown to be an effective strategy in hatred intervention on social media platforms. Studies have focused on identifying successful counter narratives (Benesch et al., 2016a,b), evaluating their efficacy (Schieb and Preuss, 2016;Silverman et al., 2016;Ernst et al., 2017;Munger, 2017), and analyzing counter speaker accounts characteristics (Mathew et al., 2018). In particular, by analyzing conversations from Twitter, Wright et al. (2017) show that some arguments among strangers induce favorable changes in discourse and attitudes. Counter-argument generation shares similar objectives as CN generation, i.e., to produce the opposite or alternate stance of a statement, but the latter faces peculiar difficulties such as the absence in HS of explicit or well-structured 'arguments' (e.g., "Islam is a disease") and the limited amount of data available for training. Studies usually focus on domains with large discussions, e.g., politics (Hua and Wang, 2018) and economy (Le et al., 2018). The closest work to ours is counter-argument generation with external knowledge augmentation by Hua et al. (2019). Our approach differs from theirs in three aspects: (i) we explore generating queries to extract knowledge for grounding CN with, (ii) pre-trained generative models are utilized for leveraging the knowledge present, (iii) our approach requires less manipulation over knowledge. Knowledge-guided generation. There is a growing interest in exploiting external knowledge to generate informative responses for applications such as dialog systems (He et al., 2017;Young et al., 2018) and question answering (Das et al., 2017;Saha et al., 2019). Previous approaches inject knowl-edge through topic phrases (Fan et al., 2019), structured knowledge graphs (Zhou et al., 2018) andunstructured texts (Dinan et al., 2019;Hua et al., 2019).

HS-CN Dataset
To the best of our knowledge, there is no highquality hate speech -counter narrative dataset available yet where CNs are explicitly paired with relevant knowledge. Since constructing such dataset with a decent-size would be too costly and out of the scope of the present paper 2 , we resort to a "reverse-engineering" strategy such that we automatically paired relevant knowledge with an already existing high quality CN dataset. We chose CONAN (Chung et al., 2019), which is a dataset niche-sourced to expert NGO operators offering high quality CNs, and the best and most diverse material among the other CN datasets (Tekiroglu et al., 2020).

Architecture
Our architecture, illustrated in Figure 1, consists of a knowledge retrieval module that retrieves sentence-level relevant knowledge, and a generation module that generates a counter narrative. Specifically, the knowledge retrieval module first prepares variants of a query Q for a given hate speech HS using two strategies: query extraction (Q hs ) and automatic query generation (Q gen ). Then, the obtained queries are employed to search for relevant knowledge articles via a search engine. Finally, it uses a sentence selector to filter and rank the most relevant sentences as the relevant knowledge (KN) from the retrieved articles. For the counter narrative generation module, we finetuned several LMs that take a HS and the ranked knowledge sentences KN as input and output a corresponding counter narrative.
2 Obtaining access to a pool of trained NGO operators is very complicated, furthermore keeping track of their search activity and the material they used during CN production would require long and complex data collection sessions that might span several months.

Knowledge Retrieval Module
The knowledge retrieval module in the architecture incorporates a knowledge repository, query construction sub-module, and a knowledge sentence selection sub-module.

Knowledge Repository
Previous approaches on introducing external knowledge for dialog generation have exploited unstructured and structured knowledge. Since no structured knowledge is available for the hate speech domain, we rely on unstructured textual knowledge in the format of articles, which allows for updating the knowledge repository easily. Considering that the proliferation of HS is also triggered with targetrelated events (e.g., terrorist attacks), being able to update the knowledge, such as news articles, would let us produce proper CNs that contain the latest statistics or evidence from the current events.
We include Newsroom (Grusky et al., 2018) and WikiText-103 (Merity et al., 2017) to our knowledge repository. WikiText-103 is a large collection of 28,595 full Wikipedia articles covering over 103 million words. Newsroom consists of 1.3 million articles extracted from major news publications between 1998 and 2017, featuring over 6.9 million words.

Query Construction
To construct comprehensive and proper queries to search for relevant knowledge for the data pairs, we applied two strategies: (i) query extraction and (ii) query generation. In both strategies, the query is composed of keyphrases that can be defined as the important and topical phrases from a text (Turney, 2000).
Query extraction. We extracted keyphrases from CONAN dataset using Keyphrase Digger (Moretti et al., 2015), a multilingual keyphrase extraction system that uses statistical measures and linguistic information, and is proven to be one of the best systems for unsupervised settings 3 . Following the knowledge retrieval strategy using input argument by Hua et al. (2019) for counter argument generation, we first obtained the HS keyphrases to construct the initial query Q hs . However, HSs from CONAN mostly contain hateful and simplistic Figure 1: Architecture of knowledge grounded generation with extracted (green solid arrow) and generative (dotted arrow) queries (topical phrases) that are exploited to retrieve relevant knowledge. The knowledge sentences extracted together with input HS are fed to CN generation. We give the example of generative approach. phrases in comparison to the input arguments used by Hua et al. (2019) that can be rich in content 4 . Therefore, in the HS-CN scenario, we hypothesize that the keyphrases from Q hs alone would not be sufficient for relevant knowledge search especially for mapping the knowledge onto training data.
To this end, we also extracted keyphrases from CN together with HS to increase the possibility that the retrieved knowledge sentences contain pieces of information found in the ground truth. Hence, the second query Q hs∪cn contains CN keyphrases for the relevancy to the target CN and HS keyphrases for preserving the hate context. We investigated the effects of various keyphrase query configurations in terms of HS relevancy and Q hs∪cn is proven to be the best configuration (See Appendix A.1 for more details.).
Query generation. Since the best query configuration Q hs∪cn cannot be available at test time, we need a way to obtain keyphrases that serve as CN cues for searching knowledge sentences during the CN generation. To this end, we built a query generation model that takes HS as input and outputs a comma-separated list of CN keyphrases, which is then used as Q gen . Our aim is to obtain an approximation of Q hs∪cn via Q hs∪gen at the test time.
The model is trained using Transformer (Vaswani et al., 2017) architecture as it has obtained state-of-the-art performances for generation tasks (Dinan et al., 2019;Ghazvininejad et al., 2018). For the training data, we used CONAN dataset and discarded the CNs that are less than 10 words, since they are usually generic, poor in terms of argumentative content and cannot provide a meaningful search (e.g., "No they are not -prove this?", "What does that even mean?", "Any evidence?"). Ac-cordingly, we kept 4038/1257/1257 instances for train/dev/test set. The train set includes the pairs marked as original in the dataset, and all translated pairs from French and Italian; the dev set consists of one paraphrase of each original HS and its CNs; and the test set contains the rest of the paraphrased HSs. The training inputs are represented as HS [HS end token] KP [KP end token], where KP is the list of keyphrases extracted from the gold CN.
The model has been trained following the configuration of the base model in (Vaswani et al., 2017): with 6 transformer layers, 8 transformer heads, embedding size of 512, hidden size of 2048, dropout rate of 0.1, batch size of 64 for 100 epochs. The training time lasted around 7 hours. All experiments in this paper have been conducted on a Nvidia Tesla V100 GPU. For decoding, we used nucleus sampling (Holtzman et al., 2020) with a p value of 0.9.
We report keyphrase generation results in terms of BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) against the keyphrases extracted from gold test CNs. We obtained a score of 0.162 for BLEU-2 and a score of 0.353 for ROUGE-L. Although both scores can be considered as low, this is due to the open ended nature of the set of possible CN keyphrases for a given HS. Example queries for a single pair, extracted from its HS, its CN, and generated with the keyphrase generation model are shown in Table 3.

Knowledge Sentence Selection
We use Solr 5 to index the articles and retrieve those relevant to a given query based on the similarity between the articles and the query using BM25 (Robertson et al., 1995). Once the queries have been obtained either through extraction or generation, they are presented to Solr for retrieving  Table 3: Examples of KN retrieved using queries extracted from HS (Q hs ), generated (Q gen ) and created from both HS and CN keyphrases (Q cn ).
the 25 top-ranked articles. Next, we used spaCy sentence segmentation 6 to split an article into sentences. Similar to Zhang et al. (2020), given a query Q we score each sentence x i in the set of articles D independently, using ROUGEL-F1 (Lin, 2004) as in Equation 1.
In the final step, we distilled the knowledge by keeping the top 5 knowledge sentences that have the highest scores among 25 top-ranked articles. Instead of a more stringent filtering, such setting has been applied to grant a better variety of source articles and corresponding distilled sentences. We refer to such automatically associated sentences as "silver knowledge".

Counter Narrative Generation Module
Large pretrained LMs require less amount of highquality data to be fine-tuned on downstream tasks while providing strong performances and they already store large amount of factual and commonsense knowledge from their training data (Petroni et al., 2019). To this respect, we built the following models: (1) GPT-2 KN , obtained by fine-tuning GPT-2 on CONAN data paired with KN; (2) GPT-2 KN,M T , by fine-tuning GPT-2 KN in a multi-task learning fashion for learning to distinguish CNs from HS as next utterances; (3) XNLG (Chi et al., 2020) for its ability to copy information to the output (in our case the retrieved KN to be copied to the CN). We expect all three models to attend over the HS and retrieve KN and look for the relevant snippets to be recovered while generating a CN.

Models
The training HS-CN pairs are represented as HS [HS end token] KN [KN end token] 6 https://spacy.io/universe/project/ spacy-sentence-segmenter CN [CN end token]. Each model is trained with Q hs∪cn and then tested on Q hs , Q gen , and Q hs∪gen . We also tested the models with Q hs∪cn to define an oracle scenario with an upperbound performance when the data can only be paired with silver knowledge.
GPT-2 KN . We fine-tuned the GPT-2 7 medium model for 3 epochs with a batch size of 2048 tokens. We used Adam optimizer with a learning rate of 5e-5. At inference time, responses were generated employing nucleus sampling with a p value of 0.9, conditioned on HSs and corresponding KN.
GPT-2 KN,M T . Since we noticed that GPT-2 occasionally produces responses that contain fragments of abusive language, we combined the language modeling objective with a next-sentence prediction objective for fine-tuning GPT-2 in a multitask setting, inspired by Wolf et al. (2018). Nextsentence prediction adopts a linear classification layer added to the last layer of the transformer language model and then applies a cross-entropy loss to classify a proper next response to the input HS from 2 distractors randomly selected from HS. We used Adam optimizer with a learning rate of 5e-5 and empirically fine-tuned it for 1 epoch and the same sampling strategy as GPT-2 KN has been applied.
XNLG is a pre-trained Transformer-based language model trained on Wikipedia dumps with two relevant objectives for our task: to obtain contextual representations and to recover a given input. We fine-tuned XNLG 8 for generating counter narratives on all layers with a batch size of 10 for 100 epochs. We used Adam optimizer with a learning rate of 1e-4. We tokenized and removed accent from the entire dataset and applied the same BPE codes used by Chi et al. (2020). For KN and CN we kept the first 256 tokens, while setting the HS to 70 tokens, which is the maximum length of hate speech in the dataset. We experimented with various decoding methods and adopted beam search with a beam-width of 3 for the best performing setting (details in Appendix A.2).
Baselines used for comparison are: (1) nonpretrained Transformer without knowledge using the same hyper-parameters as keyphrase generation model; (2) GPT-2 without knowledge following the same configuration as GPT-2 KN ; (3) Candela (Hua et al., 2019), an LSTM-based state-of-the-art knowledge-driven architecture for argument generation. Since CONAN is relatively small, we hypothesize that a pre-training procedure 9 on data from a similar task (argument generation) can be beneficial for generalization and porting knowledge. Thus, we first pre-trained Candela architecture on argument generation dataset (Hua et al., 2019), following the configuration described in the paper. We then fine-tuned the model for 20 epochs on CONAN with KN using Q hs as it is done in the original setting of Candela.

Results for the Silver Knowledge Test Set
We report BLEU-2 (B-2) and ROUGE-L (R-L) scores for all proposed models and baselines in Table 4 on the test split of CONAN that we automatically paired with silver knowledge using various queries. We also measure the capability of each model to produce novel responses with respect to the training data by Jaccard similarity (Wang, 2018), and diverse responses for the given input by repetition rate (RR) (Cettolo et al., 2014).
Among our models, GPT-2 KN yields the highest B-2 and XNLG the highest novelty, diversity, and R-L. The notably improved novelty achieved by knowledge-grounded models indicates the benefits of adding knowledge on producing CNs, in comparison to the baselines -particularly Transformer TRF. On the other hand, the quantitative performance of XNLG does not reflect its true performance in terms of quality. A quick glance at the output CNs showed that XNLG model copies almost everything from KN to the output instead of a proper CN generation, increasing the novelty and diversity scores. The issue can easily be observed from the average numbers of words and sentences in the XNLG output in comparison to the outputs of the other models presented in Table 4. GPT-2 KN,M T falls behind among our models in terms of RR, B-2, and R-L, still providing a competitive novelty. Regarding Candela, while it obtained similar performances to our models in terms of R-L and B-2, the generation is repetitive and less novel.
As for the testing with different query types, Q hs∪gen induces more novel responses than Q hs∪cn and Q hs . While XNLG yields the highest novelty with Q hs (0.824), it can be explained again with the problem of copying the whole KN, which is more varied due to the less restrictive search using only HS.
The oracle query Q hs∪cn , in which we deliberately provide the best knowledge possible through the keyphrases containing also from the gold CN, yields the best R-L scores among the query variations of knowledge-grounded models. Among all the models, Q hs∪cn also leads to the best B-2 through GPT-2 KN , and the best R-L through XNLG, as we have anticipated. Finally, Q hs∪gen outperforms Q hs and Q gen over most metrics, hinting at the advantages of using generated queries together with hate context for silver-knowledge retrieval.
We have also conducted complementary experiments by taking into consideration the design choices and the various phenomena in our study. Since in our test set, in line with CONAN, a HS can be paired with more than one CN, Q hs would retrieve the same KN for all the target CNs of the same input HS. Contrarily, we obtain a different set of KN using queries Q gen , Q hs∪gen and Q hs∪cn for each target CN. Therefore, we also report an evaluation on unique HS-CN pairs, where a single target CN has been randomly chosen for each HS, among all query types in Appendix A.4. Finally, to simulate Candela configuration (that uses only Q hs ) also with the other models, we run an additional set of experiments where we used Q hs for retrieving the knowledge for training samples. The results are reported in Appendix A.3.

Results for the Gold Knowledge Test Set
To isolate the effect of the knowledge retrieval strategies from the knowledge-grounded generation performances, we conducted a second evaluation on a newly crafted test set paired with gold standard knowledge. In this evaluation, in addition to stereotypical islamophobic in-domain (i.e.   in-target) scenario, we also explore the effect of knowledge infusion on cross-domain (i.e. crosstarget) CN generation under zero-shot setting. We hypothesize that having a system trained to make use of substandard silver knowledge to generate proper CNs for a given context, could be robust to cross-domain zero-shot conditions. Therefore, we organized a data collection session with an expert operator in writing CNs. In this session, 50 islamophobic HSs randomly sampled from CONAN and 144 new cross-target HSs (covering misogyny, antisemitism, racism, and homophobia) are provided along with the knowledge retrieved by Q hs∪cn queries. The expert is tasked with composing a suitable CN using the corresponding knowledge as much as possible. Thus, we could obtain a gold test set 10 . in which the input knowledge can certainly be found in the CNs.
We tested all models with gold knowledge indomain and cross-domain test cases. Results are 10 We release the gold test set at https://github. com/marcoguerini/CONAN. given in Table 5. For in-domain scenario, as we have anticipated, knowledge grounded models yield better performances in B-2 and R-L in comparison to the silver knowledge test setting. Especially with the striking jump in the performance of GPT-2 KN , we can confirm the proper infusion of the given knowledge to the generated CNs. As for cross-domain tests, GPT-2 KN still yields better performance than baselines while the performance for all models (except for XNLG) drops due to unseen events during training. All GPT-2 variations present better diversity performances on the crossdomain setting as compared to both in-domain and silver-knowledge settings. Regardless of domains, XNLG yields fallaciously high scores due to its extensive copying. A cross-target generation from the models can be seen in Table 7. More examples in-/cross-domain generations from all the models are provided in Appendix A.6.
Human evaluation. We further resort to human evaluation to assess the final generation quality of each model. For this reason we perform human evaluation of generation using gold knowledge, to rule out the effect of possible noise in the knowledge that may result from the retrieval process.
Our models were evaluated by 3 expert operators from the NGO Stop Hate UK. The annotators are already experienced, and specifically trained, in reading hateful content and writing CNs for online hate countering 11 . The annotators are instructed to assess all generated pairs in gold knowledge test sets in terms of suitableness to the HS, informativeness, and intra-coherence of CN regardless of HS. Each score is on a scale of 1 (the least) to 5 (the most). To avoid possible bias and hints towards models, we normalized the pairs (e.g., lowercase and space between words and punctuation) and divided them into 3 partitions of randomized files for experts (See Appendix A.5 for annotation instruction). Each expert was given 388 pairs, resulting in a total of 1164 pairs for evaluation. To avoid excessive workload annotators were allowed to complete the task over multiple sessions at their preference.
Results are reported in Table 6. We also computed l's Tau-b (Kendall, 1938) to measure the annotators' agreement towards the model ranking for each aspect. The high correlations indicate a strong concordance among the annotators (threshold tau-b > 0.35). Regardless of domains, annotators consider XNLG generations as the most informative and GPT-2 KN generations as the most suitable. TRF yields a reasonable suitableness and coherence since it tends to memorize the training CNs, almost behaving like a retrieval system on human responses. However, such behavior can be fatal in cross-domain settings. Candela fails to generate suitable cross-domain CNs despite preserving the intra-CN coherence. While GPT-2 and GPT-2 KN generations are found almost equally coherent, the lower suitableness and informativeness of GPT-2 output (2.26 and 1.92) for cross-domain as compared to GPT-2 KN (2.51 and 2.29) encourages the grounding CNs in knowledge.

Discussion
Our findings suggest that a large pre-trained LM with knowledge injection is preferred to alleviate the demand for gold data and improves in-/cross-domain generations. GPT-2 KN outperforming GPT-2, which becomes more clear with every 11 The compensation for annotation work met with EU regulations.  increase in the quality of provided KN (i.e., from silver Q gen to silver Q hs∪gen to gold in-domain), indicates that our model trained on silver knowledge learns to grasp relevant KN to the given HS instead of random copying. As for GPT-2 KN,M T , since the model is trained to distinguish CN from HS, certain typical counter responses are prone to appear. Through a manual analysis, we confirmed that its responses tend to be simpler (e.g., negation: "terrorism is not a religion.", "It's not true...") than the responses of other GPT-2 variant models. Future work can explore alternative multitask learning frameworks to improve the generation. From a practical perspective, we can conclude that fine-tuning a pretrained language model for conditional generation, utilizing silverknowledge sentences retrieved through a query generation model is the most promising approach for knowledge-bound CN generation. In fact, (i) among various models we tried, GPT-2 with knowledge infusion is shown to produce the most informative and suitable output. (ii) At deployment time, using input context together with generated keyphrases becomes prominent with more novel and diverse CNs and (iii) for new hate-targets in a zero-shot scenario, the generation model can produce coherent and suitable CNs without retraining if a reasonable set of input knowledge is provided. Moreover, this approach can be applied to other scenarios, such as dialogue response generation, since it does not require any extra data collection or annotation.

Conclusion
Online hate speech intervention is a challenging problem and research on counter narrative generation is still in its infancy. In this work, we have proposed methods for improving counter narrative generation to fight hatred online, incorporating external knowledge retrieved through extracted and generated keyphrases. To this end, a dataset of hatespeech/counter-narrative pairs was augmented with relevant knowledge to train systems that are able to HS: Jews are different from non-Jews not because of their beliefs, but because of their physical nature. "Jewishness" and malevolence is inherited through blood. Human: Today one of the biggest concerns is the reinvention of clinical concepts of race, based on inaccurate generalizations. The notion that the superiority of the white race was part of the natural order was deviously reinforced by the rise of modern "scientific" racism in the late 19th century. Despite the strong objections of many of our contemporaries, this idea of a "jewish race" is still going around. GPT-2: Muslims are not different from non-Muslims because of their faith. GPT-2KN : I think that the notion of race is a generalization based on ignorance. GPT-2KN,MT : All that is wrong is that we cannot accept that violence is necessary. XNLG: today , one of the biggest concerns is the reinvention of clinical concepts of race , based on inaccurate generalizations about the prejudice to certain physical characteristics and civilizations . despite the strong objections of many contemporary contemporaries , the notion of fixed " race " -packages of physical and behavioral characteristics produce suitable and informative arguments. Our experiments on in-/cross-domain generation indicate that the generated responses can meet these desiderata. As future work we plan to test other query generation approaches (e.g., exploit humancrafted queries in an interactive setup or the use of pre-trained LMs) to improve knowledge selection and to test other architectures for the final generation step.

Acknowledgments
This work was partly supported by the HATEME-TER project within the EU Rights, Equality and Citizenship Programme 2014-2020. We are deeply grateful to Stop Hate UK and its volunteers for their help and effort in evaluating the output of our systems.

Ethical Considerations
While we believe that counter narratives are a better tool than content moderation in fighting hate speech (e.g. they do not hinder freedom of speech), still the automatic generation of CNs should be taken with care. Since this work aims at presenting a methodology for knowledge-bound counter narrative production through neural approach, several ethical consequences should be considered.
First, neural models may still produce substandard counter narratives containing abusive language or negative content. To mitigate this issue, possible solutions include (1) integrating in the pipeline a classifier or a human reviewer for validation and possible post-editing (Tekiroglu et al., 2020), (2) detoxification techniques for controllable generation methods (Gehman et al., 2020), and (3) discarding undesirable content from the corpora used for training (Raffel et al., 2020), even if the appropriate criteria for such purpose are still investigated.
Second, while our approach reduces the risks of content hallucination, an additional step, where the accuracy of the generated text is checked against the provided knowledge (Nie et al., 2019;Dušek and Kasner, 2020), would provide further robustness to the system. Third, natural language generation models may still induce unintended social biases. This issue can be moderated by measuring/promoting fairness in models and data employed (Blodgett et al., 2020), and designing bias triggers (Sheng et al., 2020) or regularization methods (Bordia and Bowman, 2019;Corbett-Davies et al., 2017) for controllable bias.
To sum up, while some additional automated techniques may help in maintaining generation quality, human evaluation should always be considered as the foremost solution, at least for delicate tasks such as 'real' hate countering on social media platforms. For this reason we advocate that generation systems should be used as a suggestion tool for NGO operators, to make their countering work more effective. In this way there is always a "human moderator" taking the final decision (Chung et al., 2019).

A.1 Analysis on Keyphrase Extraction Configurations
We conducted a preliminary manual analysis to investigate the effects of various keyphrase extraction configurations. We randomly sampled 48 hate speech and counter-narrative pairs from CONAN dataset and extracted the keyphrases. Then, we retrieved the KN (see Section 5.3) with the queries Q hs and Q hs∪cn . On the other hand, we also wanted to inspect the condition with the keyphrases only from CN, i.e., Q cn . For each sample and condition, annotators have assigned a score for relevance to the hate speech in the scale of 1 to 5; 1 meaning no-relevance, and 5 perfect relevance. As a result, we have noticed that Q cn is the worst condition, i.e., non-optimal, having an average score of 2.30. The analysis shows that it causes the loss of context related to HS, bringing information mostly from whole another topic. For instance, especially when CNs are rather generic, often, no lexical hint can be found related to the topic of Islamophobia (e.g., "Do you have any proof?"). Indeed, Q hs provides an apparently better average score (3.46) since it provides a better context to search for. However, as expected, the best score (3.77) has been obtained through Q hs∪cn , i.e, optimal, verifying our hypothesis of utilizing both HS and CN keyphrases for training.

A.2 Preliminary Analysis on Decoding Methods for XNLG
To find a suitable decoding method for our task, we generated CNs with 3 candidate settings: beam search with a beam-width of 3 and top-k sampling with a k value of 8 and 10. For each setting we utilized KN retrieved with both non-optimal (Q cn ) and optimal (Q hs∪cn ) queries. Then we sampled 120 HS-CN pairs and served them to three experts in CN writing for evaluating the generation on a scale of 1 (the worst) to 5 (the best) in terms of suitableness and informativeness. Suitableness measures if the generated CN is relevant to the HS and informativeness evaluates the amount of information (e.g. statistics and facts) enclosed in the CN. The results reported in Table 8 reveal a clear difference between beam search and top-k sampling regardless of KN being optimal or non-optimal. In a manual investigation, we observed that the generation using both beam search and top-k sampling generally can copy some pieces of information from the given KN, while top-k seems to replace part of the text with slightly relevant and uncommon words. Hence, copying the right knowledge pieces through the decoding strategy is a key factor instead of diverging from the knowledge solely for the sake of lexical diversity. Therefore, based on the results, we adopt beam search with a beamwidth of 3, which is shown to be the most suitable and informative, for decoding method in our experiments.

A.3 CN Generation with Q hs
In this section we report the CN generation results of our knowledge-bound models trained and tested with Q hs . We applied the same hyperparameter configurations as the models trained with Q hs∪cn described in Section 6.1. The results are given in Table 9. In contrast to the baselines (i.e., models without knowledge and Candela), all models obtained higher novelty with Q hs . The repetition rate, on the other hand, is not improved since the models exploit the same knowledge for multiple test samples due to the repeated HSs with different CNs in the test set.
We also observed that for GPT-2 KN and GPT-2 KN,M T the generation with Q hs is more repetitive and less novel compared to the generation applying queries Q hs∪gen (as shown in Table 4). This result demonstrates the viability and necessity of using generated queries, as potential CN prompts, along with HS context.

A.4 Unique HS Test Set Analysis
Concerning that one HS can be paired with different CNs in the test, we further conducted an evaluation on a unique set by keeping each unique HS and one randomly selected CN among its CNs. The unique HS set lets us perform a fairer comparison among query configurations especially for Q hs   with models employing beam search (i.e. XNLG and Candela). The results are given in Table 10. For XNLG and GPT-2 KN,M T , we observed an increase in the novelty and diversity with Q hs and Q gen on the unique HS set over the whole test set. As for GPT-2 KN , while diversity improves for all query configurations, we did not observe an increase in novelty through Q hs . For Candela, while novelty also increases, the diversity does not improve.

A.5 Human Evaluation Instruction
Three experts are tasked with CN generation evaluation. We briefed the annotators the aim of the evaluation, and then we explained common and appropriate CNs along with 10 examples of HS-CN pairs. The annotators are instructed to evaluate the generation with intuition. The generation is evaluated in terms of 3 aspects: (i) suitableness: How suitable the counter narrative is under evaluation for its hate speech, i.e. coherence and relevance of a counter narrative with regard to the hate speech.
(ii) informativeness: How informative/specific the counter narrative is as a response for its hate speech. (iii) intra-coherence: How coherent the counter narrative is under evaluation as a response, i.e. internal coherence of a counter narrative regardless of its hate speech. In Table 11 we give the annotation examples with the score explanations provided to the annotators.
HS: Muslims do not want to assimilate. They demand that everyone shows respect for Islam, but they don't respect our laws. One day they will oppress us with the Sharia and polygamy. CN: Do you really believe that they are a problem? I have a job. evaluation: suit.: 3, info.: 1, cohe.: 1 Note: The first sentence is relevant but not the second one. There is no coherence between 2 sentences. CN: As far as they do not force us to follow sharia laws, we should not be concerned with them following their laws. evaluation: suit.: 5, info.: 3, cohe.: 5 Note: The CN is an appropriate and coherent response, but not providing a substantial amount of information or facts. CN: Muslims should not be forced to assimilate, since it is not right and no one wants that. And polygamy is illegal and forbidden in UK and Muslims actually respect this ban. evaluation: suit.: 5, info.: 5, cohe.: 5 Note: The whole CN is perfectly suitable for the given HS. It is highly informative with the facts and internally coherent.