Human-in-the-Loop for Data Collection: a Multi-Target Counter Narrative Dataset to Fight Online Hate Speech

Undermining the impact of hateful content with informed and non-aggressive responses, called counter narratives, has emerged as a possible solution for having healthier online communities. Thus, some NLP studies have started addressing the task of counter narrative generation. Although such studies have made an effort to build hate speech / counter narrative (HS/CN) datasets for neural generation, they fall short in reaching either high-quality and/or high-quantity. In this paper, we propose a novel human-in-the-loop data collection methodology in which a generative language model is refined iteratively by using its own data from the previous loops to generate new training samples that experts review and/or post-edit. Our experiments comprised several loops including diverse dynamic variations. Results show that the methodology is scalable and facilitates diverse, novel, and cost-effective data collection. To our knowledge, the resulting dataset is the only expert-based multi-target HS/CN dataset available to the community.


Introduction
The proliferation of online hatred has became an alarming issue (Williams, 2019) threatening not only the well-being of target individuals and groups, but also of society as a whole. While authorities establish regulations and policies, social media platforms take actions against hate speech mostly through moderation activities, such as content removal, account suspension, or shadowbanning, at the risk of hindering the freedom of expression. Meanwhile, Non-Governmental Organizations are qualifying volunteers for responding to online hate to promote human dignity and understanding in society. Such responses, i.e., Counter-Narratives (CN), are non-aggressive textual feedback using credible evidence, factual arguments, alternative viewpoints, and are considered as an effective strategy (Benesch, 2014;Schieb and Preuss, 2016) to confront hate speech while respecting the human rights (Kiritchenko et al., 2020).
However, the vast amount of online hate speech makes an effective manual intervention impossible, which motivates a line of NLP research focusing on semi or fully automatized CN generation solutions 1 . In recent years, several CN collection strategies and datasets have been proposed addressing the data-hungry nature of current state of the art generation technologies (Mathew et al., 2018;Qian et al., 2019;Chung et al., 2019).
Considering the shortcomings of the existing collection strategies (that grant either quality or quantity, but not both), we present an approach to produce high quality CNs for multiple hate targets while reducing the need for expert intervention. To this end, we build on top of the previous hybrid data collection strategies, aiming to increase efficiency while maintaining the requirements of data quality, novelty and diversity. In particular, we start from the work by Tekiroglu et al. (2020) that uses an author-reviewer framework in which the authora generative language model -is tasked with generating HS/CN pairs while a pool of human reviewers filter and possibly post-edit the produced output. In the present work we propose to further reduce the data collection effort by closing the pipeline and feeding the post-edited output back to the language model in order to regularly update it and improve 1 In our view the generation process can be fully automatic but generation systems need human supervision and should not be fully autonomous, at least for delicate tasks such as hate countering on social media platforms. For this reason we advocate that generation systems should be used as suggesting tool for NGO operators, to make their countering work more effective. In this way there is always a "human moderator" taking the final decision (Chung et al., 2019). Furthermore, this approach is also in line with de Lima Salge and Berente (2017)'s Ethical framework, since this "suggesting tool" configuration grants compliance with their rules. the quality of the generated pairs. Our experiments comprised of two sessions, spanning a period of 6 months. In the first session we set up a 'simple' human-in-the-loop (HITL henceforth) procedure and iterated it several times, measuring at each loop the performance of the whole framework according to relevant metrics. In the second session we run several additional loops in which we test different strategies (i.e. author configurations) to improve the data collection according to the given metrics. Findings show that the HITL framework is scalable, allowing to obtain datasets that are adequate in terms of diversity, novelty, and quantity. Moreover, this framework improves on previous hybrid data collection strategies, reducing at each loop the post-editing effort of the human reviewers or the number of discarded examples (session one). On the other hand, with dynamic adaptation, possible unwanted behaviors or flaws of the data collection can be handled at each loop by simply varying the author configuration (session 2). The final dataset contains 5000 HS/CN pairs in English Language, covering multiple hate targets, in terms of race, religion, country of origin, sexual orientation, disability, or gender. To the best of our knowledge, this is the first multi-target expert-based HS/CN dataset constructed through a semi-automatic mechanism and can be downloaded at the following link: https://github.com/marcoguerini/CONAN.

Related Work
With regard to hatred countering, we will focus on three research aspects relevant for the present work, i.e. (i) publicly available datasets for detection, (ii) publicly available datasets for countering, (iii) approaches for hybrid data collection.
Hate detection datasets. Several datasets for hate detection have been presented, most of which rely on material collected from SMPs, such as Twitter (Waseem and Hovy, 2016;Waseem, 2016;Ross et al., 2017), Facebook (Kumar et al., 2018, What-sApp (Sprugnoli et al., 2018), and forums (de Gibert et al., 2018). While the above datasets focus on a classification task, Mathew et al. (2020) released a dataset annotated with rationales to improve hate speech interpretability and Sap et al. (2020) proposed the Social Bias Inference Corpus (SBIC) annotated with the description of the biases implicitly present in the language. For a more extensive review, we refer the reader to Poletto et al. (2020) and Vidgen and Derczynski (2020).
Hate countering datasets. While several social studies proved that counter-narratives are effective in hate countering (Benesch, 2014;Silverman et al., 2016;Schieb and Preuss, 2016;Stroud and Cox, 2018;Mathew et al., 2019), only few works have focused on data collection for CN generation. Mathew et al. (2018) focus on crawling, following the intuition that CNs can be found on SMPs as responses to hateful expressions. Qian et al. (2019) propose a crowdsourcing methodology where crowd-workers (non-expert) are instructed to write responses to hate content collected from SMPs. The study by Chung et al. (2019) also relies on outsourcing CNs writing, but via nichesourcing, using NGO operators expert in CN production.
Hybrid models for data collection. Given the data-hungry nature of current NLP technologies, one line of research has recently focused on advanced hybrid models for data collection. Wallace et al. (2019) proposed using model interpretation to guide humans in the creation of adversarial examples for factoid question-answering systems. Dinan et al. (2019) and  perform a data collection with HITL for detecting offensive language. In both studies, the dynamic procedure is shown to be successful in reducing model error rate across rounds.  point out that the HITL approach has multiple advantages over the static data collection: design flaws can be addressed during the construction of the dataset and annotators' work is optimized, since it is guided by the feedback from the model. Finally Tekiroglu et al. (2020) propose a hybrid approach where an LM is trained on a seed datasets of HS/CN pairs to generate new pairs that are then validated and post-edited by annotators.

Methodology
In Figure 1 we present the pipeline of our methodology. Following the idea presented by Tekiroglu et al. (2020), we have an author module built using GPT-2 language model (Radford et al., 2019) and fine-tuned on a seed dataset of HS/CN pairs. The author produces novel HS/CN candidates while the reviewer(s) filter and eventually post-edit them. We iterate this data collection several times, at each loop reviewed examples are added to training data and the author is fine-tuned from scratch again on all available data. In the following sections we describe the main elements used in our procedures.

Seed dataset
To start the process, we built a seed dataset of 880 HS/CN pairs by nichesourcing its collection to 20 experts from two different NGOs. We named this dataset V 1 . The methodology for collecting V 1 closely replicates the one presented by Chung et al. (2019). In particular we first created a list of prototypical hate texts -with the help of an NGO expert -for the following hate targets: DISABLED, JEWS, OVERWEIGHT, LGBT+, MUSLIM, WOMEN, PEOPLE OF COLOR, ROMANI, MIGRANTS. We then prepared two online data collection forms: in the first, NGO operators were asked to respond to examples selected from the prototypical hate text list, in the second they were asked to write their own HS/CN pairs. This data collection session lasted roughly one month.

Sessions
Our experiments were run in two separate and subsequent sessions, meant to explore different aspects of the HITL approach.
In the first session, after using V 1 for the initial fine-tuning of GPT-2, we iterated the data collection 4 times, keeping the author-reviewer configuration as close as possible to the original one presented by Tekiroglu et al. (2020). Loops are numbered sequentially as V 2 ...V n . At each loop, we acquired 500 examples of accepted and eventually post-edited HS/CN pairs 2 . To obtain a new set of 500 pairs (V i ) we fine-tuned GPT-2 every time from scratch using 2 The only exception is V2 that accounts for 620 pairs to have a round number of examples by reaching 1500. V 1 ...V i−1 as training data and administered the generated samples to reviewers until the target number was reached. In total we iterated the procedure 4 times reaching V 5 for a total of 3000 pairs.
In the second session, we tested several alternative author configurations to ameliorate some unwanted behaviors/trends that emerged during the first session. We ran 4 additional data collection loops, this time in parallel (i.e. all starting from V 5 dataset) instead of an iteration. For each loop, represented as V 6,{conf ig name} , we collected 500 HS/CN pairs reaching a total of 5000 examples.

Author Models
In our experiments all models are variants of the author (GPT-2), obtained by changing the way it is fine-tuned or conditioned. For consistency, each model is trained using the same hyperparameter configurations. In particular, we used GPT-2 medium model, fine-tuned for 3 epochs with a batch size of 1024 tokens and a learning rate of 2e-5. Each pair has been represented as < |startof hs|>HS<|endof hs|> <|startof cn|> CN <|endof cn|> for the training. At the generation time, Nucleus Sampling (Holtzman et al., 2019) has been utilized with a p value of 0.9. For the standard configurations we use only < |startof hs|> for conditioning. Given an HS tag, the models produce a chunk of text, which is a list of HS/CN pairs. These pairs are then cleaned from the special tokens and administered to the reviewers for evaluation and possible post-editing.

Reviewers
We recruited 3 annotators, from a pool of internship students, as reviewers over a period of 18 weeks to filter and post-edit the generated pairs after an extensive training procedure.
Training. Annotators underwent a training for 2 weeks, so that they became "experts" on HS/CN post-editing. The training included: (i) reading and discussing NGO guidelines and public documentation describing the activity of CN writing for hate countering, (ii) reading all V 1 pairs to better comprehend the attributes of counter narratives, (iii) reading a sample of 100 HS/CN pairs that have been post-edited by an expert to see concrete examples of post-editing activity, (iv) performing a practice session of CN post-editing and discussing it with an expert NGO operator.
Instructions. We adapted the reviewing instructions from Tekiroglu et al. (2020). In particular, for each pair, we asked the operators: (a) to approve it without any modifications if it was a valid pair, (b) if the pair was not perfect, but easily amendable, to modify it, (c) if the CN is completely irrelevant, or does not follow NGO's guidelines, to discard the pair regardless of HS quality, (d) whenever there are facts or statistics in the CN, check veracity of the information to avoid possible LM hallucination effects. We further instructed the annotators to provide a hate target label for each accepted pair. The labels were useful both for analysis and for the subsequent label-based generation strategies present in V 6 . In Table 7 we give an example of GPT-2 output and its post-edited version.
HS: Transgenders should rape our children CN: This is not true. Maybe they are worried because of the rise in hate crimes, incidents of which are down to 28 percent, since 2014. HS pe : Transgenders want to rape our children. CN pe : This is not true. Maybe you should be worried about the rise in hate crimes against queers, incidents of which are almost doubled since 2014. TARGET: LGBT+ Table 1: An HS/CN example generated by GPT-2 and the post-edited version with hate target annotation.
Mitigation procedure. We applied an adapted version of the guidelines by Vidgen et al. (2019) to safeguard the annotators' well-being against the risk of harmful consequences of working with abusive content (present in the HSs and possibly in generated, not well-formed CNs). To this end we first made sure that annotators understood the prosocial aspects of the research and explained them the purpose of their annotation activity in details. Then we instructed the annotators to work no more than 2/3 hours per day and take regular breaks, by adjusting their workload as needed. Finally, we had meetings and feedback from the annotators on a weekly basis to let possible problems or distress emerge. This procedure was repeated throughout the whole data collection campaign.

Metrics
To understand the 'diachronic' behavior of our HITL methodology across iterations, the following metrics have been computed at the end of each loop over the newly obtained pairs. Imbalance degree measures the difference between a perfectly-balanced distribution of the hate target categories and the actual unbalanced datasets; we use Imbalance Degree (ID) since it is specifically devoted to the multi-class scenario (Ortigosa-Hernández et al., 2017). Datasets that are balanced over multiple hate targets could allow building more representative CN generation models.
Acceptance Rate is the percentage of pairs accepted by the reviewers (either untouched or postedited) over the total number they scrutinised. It represents an overall estimate of the ability of the framework to produce reasonable-quality material.
HTER is originally a measure of post-editing effort at sentence level translations (Specia and Farzindar, 2010). We adopted it to the measure reviewers' effort in terms of the average number of edits over the accepted pairs. An upper-bound threshold value of 0.4 is used to account for easily post-editable pairs (Turchi et al., 2013).
Novelty measures how different two collections of texts are from each other, and it is grounded on Jaccard similarity. We utilized it to compute the originality present in V i with respect to the training data collected in previous loops (Dziri et al., 2019;Wang and Wan, 2018).
Repetition Rate measures the intra-corpora quality in terms of language diversity by considering the rate of non-singleton ngram types it contains (Cettolo et al., 2014;Bertoldi et al., 2013). We use it to measure the ability of the framework to provide diverse and varied examples. Repetition Rate (RR) has the advantage of being independent from corpus size, so it can be used to directly compare different versions of our dataset.
Vocabulary Expansion is a measure we introduce to serve two main objectives: (i) quantifying the contribution of the author and the reviewers, by focusing on new tokens appeared at each loop (e.g. the term "peace" was introduced for the first time by annotators in V 2 ), (ii) quantifying the presence of cross-fertilization, i.e. tokens that appear for the first time in version V n for a particular target, but they were present in a version antecedent to V n for the other targets (e.g. the term "peace" for the target JEWS appears at V 4 but it was already present for the target MUSLIM in V 2 ). The algorithm for computing Vocabulary Expansion is described in Appendix A.1.

Session One
In session one, all the versions of the dataset V 2 ...V 5 are generated using GPT-2 V i , where the fine-tuning is performed on all previous versions of the dataset V 1 ...V i−1 as explained earlier.
To produce HS/CN pairs, the author conditioning is performed using only <|startofhs|> tag and collecting all the generated material provided that each pair is encapsulated with the proper tags.
For the analysis, we computed the metrics described in Section 4 on the HS/CN pairs obtained in each loop using micro-averaging (in Appendix A.4, Table 5 we report all results in detail). To isolate the possible effect of target-class imbalance, macro averages were also calculated; similarly, to account for element-wise differences we calculated micro averages for HS and CN sets separately 3 .
Discussion. Considering our objective of collecting quality material in an efficient way, we first focus on the ratio of accepted pairs and the postediting effort in each loop. As shown in Figure 2, the percentage of accepted pairs tends to increase across the loops, for both the pairs that are postedited ("modified") from 35.8 in V 2 to 50.1 in V 5 and the ones accepted without post-editing ("untouched") from 1.5 in V 2 to 10.9 in V 5 . At the same time, the average post-editing effort of the reviewers tend to decrease across the versions, as depicted in Figure 3. To ensure that the decrease in HTER is not due to the increasing ratio of untouched pairs to the total number of accepted pairs, we computed the HTER for the modified pairs alone. Consistently with the overall trend, HTER for modified pairs also declines, indicating that the data collection loops succeeded not only in reducing the reviewer effort, but also in improving the quality of the generated material to be postedited. Notably, after V 3 the HTER falls below the 0.4 acceptability threshold as defined in (Turchi et al., 2013) for the AMT scenario ( Figure 3). In view of this analysis, we can conclude that the efficiency of data collection is increased by HITL as compared to a static approach that does not retrain the author module (that can be represented by V 2 ).
Regarding the evaluations with the quality metric Repetition Rate (Figure 3), it increases from V 2 on signifying a decrease in the lexical diversity of the generated data. Moreover, we observed a consistent trend for the scores of the second quality metric, i.e. Novelty (Figure 4). Similar to the diversity, novelty of the collected data also decreases across the versions, regardless of the dataset against which the novelty is computed. Particularly, the change in the cumulative novelty represents how the vocabulary becomes less and less enrichable as the loop number increases, indicating a possible saturation point where novel material is highly difficult to obtain. Finally, the distribution of hate targets shows a worsening also in terms of ID that increases from a score of 2.2 in V 1 to 4.5 in V 5 (see Figure 2) with some targets becoming predominant while others slowly disappearing. More details on each target distribution per loop are given in Appendix A.2, Figure 11.
As for pair length, throughout the loops we found that "untouched" pairs are usually shorter (30.7 tokens on average) than the other accepted pairs (37.3 tokens on average before post-editing). During the discussion sessions, annotators reported that the "untouched" pairs are not only shorter but also somewhat stereotypical, with a small novelty added to the overall dataset (e.g. "you cannot say this about an entire religion", "It's unfair to say this about an entire religion").

Session Two
Given the problems emerged during the loops of the first session (i.e. higher efficiency but lower quality at each loop), we organized an additional session to test several parallel methodologies to ameliorate them. The description of the V 6 configurations are as follows: V 6,SBF : The model GPT-2 V 5 is conditioned with novel offensive speeches extracted from SBIC corpus (Sap et al., 2020). We chose this resource since: (i) it contains several thousand of social media posts containing biases and stereotypes spanning the same target categories with our study, (ii) for each post it provides an 'implied statement' that closely resembles a 'prototypical hate speech' on which we trained our system. We sampled the same number of 'implied statements' for each target that maps to our labels 4 among the ones annotated with 'the intent behind the statement was to offend' and/or 'the post could be offensive to someone'. We provide the statements as conditions by appending them to <|startof hs|>. V 6,LAB : The model is conditioned specifying on which hate target it should focus on. In this configuration, we trained a variant of GPT-2 V 5 that takes into account the target label, and modified the original representation of our training data accordingly. In particular we accommodate hate target information within the starting token: <|startof hs: target label|>. Table 4 in Appendix we provide the mapping we used. V 6,ARG : We fine-tuned GPT-2 on a dataset of argumentative pairs collected from Kialo 5 , an online debate platform for constructive and rational discussions among peers that has been exploited recently by the NLP community (Durmus et al., 2019a,b;Scialom et al., 2020). Each discussion in Kialo is represented as a tree of arguments in which a child node is connected to its parent via a "pro" or "con" relation. Extracting all the claims connected by a "con" relation, we obtained a dataset of 128178 argument pairs covering a broader domain as compared to HS/CN pairs. We then fine-tuned GPT-2 for 1 epoch over the argumentation dataset with the standard hyperparameters. Preliminary experiments showed that the best strategy was to represent these pairs with the same format as ours to facilitate transfer of task characteristics and argumentative knowledge. Then this model was again fine-tuned using the standard V 1 ...V 5 data. At inference time, conditioning has been performed using lists of unique HSs from the V 1 ...V 5 data. V 6,M IX : The last model is obtained by blending the three previous versions together, i.e. first finetuning on Kialo dataset, second fine-tuning using target label notation on V 1 ...V 5 data, conditioning using SBIC offensive speeches.

In
Bearing in mind the problems emerged during Session One, our first goal in Session Two was to balance the dataset with respect to the hate targets (i.e. reducing ID score). To this end the conditioning always takes into account the hate target label (with respect to 7 targets: JEWS, LGBT+, MUSLIM, WOMEN, DISABLED,PEOPLE OF COLOR, MIGRANTS) either explicitly as in V 6,LAB or V 6,M IX , or implicitly as in V 6,SBF and V 6,ARG . In addition, to better balance the number of pairs for each target, we administered only the first 5 pairs of each generated chunk to the reviewers.
Discussion. All the applied methodologies allow for a better balancing of data in terms of hate targets, yielding an average ID score of 2.3 for the V 6 configurations in comparison to the ID score of 4.5 for V 5 6 . As shown in Figure 5 -left, all V 6 configurations have a slightly higher acceptance rate than V 5 7 . Thus introducing novel material or data representation in fine-tuning stages has no strong perturbation effect. Second, and more interestingly, we observe a significant variation in the ratio of untouched and modified pairs to all the reviewed samples: for all V 6 approaches while there is a strong decrease in ratio of untouched pairs (Figure 5, right), there is a significant increase in those modified (see Figure 5, left). In other words these models were able to produce a higher amount of suitable, albeit non perfect, pairs. In particular, comparing V 6 configurations we can observe that for the untouched pairs the highest acceptance rate is achieved via V 6,ARG with 6.37% accepted pairs, whereas for the modified pairs V 6,M IX yields the highest percentage, with 66.15% of the pairs accepted.
Concerning the reviewer's effort, we see that the overall HTER increases for the all V 6 approaches ( Figure 6, left). Considering that we had a lower number of untouched and a higher number of modified pairs this was expected, and if we turn to the HTER of modified pairs alone we see that there is a smaller difference between V 5 and V 6 HTER. Even more interestingly, the HTER scores of all V 6 configurations, even if higher than V 5 , are still below the acceptability threshold value of 0.4 defined earlier. Going into details, amongst the V 6 configurations, HTER reaches its lowest value in V 6,ARG , for both the modified and untouched pairs: since it was conditioned using gold HS material, this result is expected. As opposed to the other models, V 6,LAB is conditioned only with a label representation and not with actual HSs. This affected negatively the post-editing effort, as we can notice a higher HTER for this configuration. Moreover, V 6,LAB has a smaller amount of untouched pairs, so we expected HTER to spike up. With regard to data quality (see Figure 7), we see that all V 6 strategies succeed in increasing the novthe plots, using a linear regression model over V1...V5. elty both with respect to V 5 and expected V 6 (the dashed line) , except for V 6,ARG , possibly due to its conditioning with HSs from V 1 ... V 5 . Therefore, we also computed the novelty for CN set alone to discard the effect of HS on the metric. In this setting, all V 6 configurations reach a novelty between 0.741 and 0.745, as compared to a CN novelty in V 5 of 0.737 (as in Appendix A.3). The effect of gold HS conditioning in V 6,ARG can also be spotted in the lowest HTER results in Figure 6. The highest increase in novelty is recorded for V 6,M IX , reaching a score of 0.76; also novelty scores computed with respect to V 5 and V 1 confirm the result.
All V 6 configurations succeeded in reaching an RR lower than both V 5 and expected V 6 (the dashed line). It is interesting that V 6,LAB has the highest RR among the V 6 configurations, possibly because it was not built using any external knowledge, but only with a different label representation. On the other hand, V 6,ARG configuration, for which an initial argumentation fine-tuning has been performed, has the lowest RR (5.474). From this analysis we can conclude that V 6 configurations are better at producing sub-optimal material but worse at producing perfect material. Still the general quality of the pairs (in terms of novelty and RR) in Session Two is much higher than before, exhibiting the desired behavior for which these strategies were introduced.

Vocabulary analysis.
We report vocabulary expansion findings in Figure 8. For each loop V 2 ...V 5 the average percentage of new words injected into the dataset by the author model (GPT-2) is higher than the average percentage of new words inserted by the three reviewers during post-editing. Both trend-lines, even if slightly decreasing are not converging, implying that fine-tuned GPT-2 is not reaching a "saturation point" and is continuously adding new material. This trend is in line with the decrease in novelty. On the other hand, instructions asked for a minimal post-edit, so the reviewers have less opportunity to inject new material than the author and the decrease is consistent with the decreasing HTER.
As for the percentage of words generated by the author model pertaining to the same target, we see an increasing trend throughout the generations due to the cumulative nature of the metric. Still, the presence of words first observed in other targets throughout the loops, shows that the crossfertilization phenomenon generated by GPT-2 persists. This desired feature shows the ability of GPT-2 to learn the 'domain' vocabulary of other targets and to use it effectively for the target under analysis, in a way that is approved by the humans -reviewers 8 . Finally, we can remark that V 6 versions are able to increase both the injection of novel words from GPT-2 and also its cross-fertilization ability.

Qualitative Analysis
During our exploratory experiments and the discussion sessions with the annotators, several 8 Even though we opted for a distinction in terms of source (same target, other target) for the 'not-new' words generated by the author module, we chose not to have the same distinction for the reviewers' 'not-new' words, since we cannot assess if the reviewer was (or not) aware of the presence of a word in previous versions of the dataset. HS: "We should kill all the jews." CN: "There are many alternatives to removing jews, such as converting them to another religion (e.g. Buddhism)." HS: "All homosexuals are trying to undermine the family by trying to impose their sexual preferences." CN: "Thus far, people of various sexual orientations have attempted suicide twice as often as heterosexuals. Among youth, those that seriously contemplate suicide are at almost three times the rate of heterosexual.
LGB youth are almost five times as likely to have attempted suicide compared to heterosexual youth. Tell me honestly: who do you think is imposing sexual orientation to whom?" HS: "I don't like muggleborns, they are likely to do violence." CN: "We do not say that muggleborns are less likely to commit crimes. We are saying that they are almost certainly not the case. " interesting subjects have emerged, which can initiate future work.
Argumentation and Counter Narratives. In order to obtain even more novelty in produced pairs, V 6,ARG model could be used without fine-tuning on the HS/CN dataset under the assumption that a counter argument is the same as a counter narrative. Still, the ability to argument on a variety of topics is not enough to provide a meaningful CN when prompted with an HS. A CN also presuppose values, so -for example -a logically valid argument is not necessarily an acceptable CN, as the first example in Table 2 shows (produced by GPT-2 fine-tuned only on Kialo arguments).
New arguments or new paraphrases. One question that emerged is whether GPT-2 is able to produce novel arguments or it is just a very sophisticated paraphrasing tool. During the discussion sessions with annotators and also by manual analysis, we could find CNs that contained genuinely novel arguments, which were not present in the training data but produced by GPT-2. In the second example in Table 2, the novel argument is about capsizing the "imposing the homosexual agenda" argument by providing data on "suicidal attempts among homosexual youth".
Novel hate targets and general knowledge. GPT-2 proved to be able to generate HS/CN pairs also for unseen targets, including intersectional ones (e.g. "black women"). Still the lack of a "commonsense knowledge" can produce funny results that are beyond the scope of hallucination (Zellers et al., 2019;Solaiman et al., 2019), such as the third example in Table 2, where GPT-2 addresses muggleborns (target of hate in Harry Potter books).

Conclusions
In this paper we presented a novel HITL methodology for data collection based on an author-reviewer framework. This methodology puts together an LM and a set of human reviewers, where the LM is refined iteratively, using data from previous loops that have been validated by experts. Experiments show that as loops are iterated, efficiency in data collection increases (acceptance rate and HTER metrics) while the dataset quality decreases in terms of novelty and diversity metrics. For this reason we experimented with additional dynamic loop adaptation that are able to increase the overall quality of the dataset without hindering the efficiency significantly.

A.1 Vocabulary expansion algorithm
The pseudo-code for the vocabulary expansion metric described in Section 4 can be found in Algorithm 1. For each version and target, we define two following sets of words: V OCAB pe : words from the post-edited pairs V OCAB gen : words from the generated pairs A word is considered novel when it is not present in the collective vocabulary of the previous versions: V OCAB(V 1,...,i−1 ).
Algorithm 1: Vocabulary expansion for each target for each version V i do for each word w in V i do if w in V OCABpe and w in V OCABgen then author w ←w if author w in V OCAB(V1,...,i−1) then if author w in same target V OCAB then same target author w ←author w else other target author w ←author w else novel author w ←author w else reviewer w ←w if reviewer w in V OCAB(V1,...,i−1) then not novel reviewer w ←reviewer w else novel reviewer w ←reviewer w Each word is assigned to one of the following sets: Author-novel, Author-same-target, Authorother-target, Reviewer-novel, Reviewer-not-novel. Considering the size in terms of words of each set, we calculate the percentages for each target and version, so that we are able to obtain the vocabulary expansion scores as macro average percentages.

A.2 Additional material for Session One
In this section, we present the most interesting results that we have obtained by analysing only the HS or the CN sets.
While HTER calculated on CN alone shows a clear decreasing trend (Figure 9 on the left), the results for HS alone are less consistent yielding higher scores for V 3 and V 4 . This can be mostly explained with the different approaches of postediting the HSs by the annotators, which include the possibility to rewrite it entirely when needed. On the other hand, the decreasing trend of HTER for HS starting from V 3 , resulting in a lower score in V 5 than the one calculated on CN only, could be due to the increasing frequency of prototypical HSs. This implication is confirmed by the higher RR scores for HSs as compared to CNs, which grow faster for the former than the latter ( Figure 9 on the right). Moreover, the increasing number of prototypical HSs contributes to the novelty scores for HSs only being lower than those of CNs and decreasing more rapidly (Figure 10). In Figure 11 the target distribution at each loop of Session One is shown, in Table 3 the frequencies of targets in the final dataset are displayed. The MUSLIMS target covers a significant percentage of the generations in every loop and consists of more than the half of the pairs V 5 . In fact it is expected to cause even more imbalanced productions in the next loops. JEWS, MIGRANTS and DISABLED targets diminish over the loops, while the other targets can be considered as stable.

A.3 Additional material for Session Two
Concerning Session Two, the results for CNs are in line with the conclusions drawn in the paper for HS/CN pairs. The same holds for HSs, the only exception being for the cumulative novelty of V 6,ARG HSs, as can be seen in Figure 13 and in Table 6. As explained earlier in Section 6, this effect is due to the use of hate speeches from the training set for conditioning GPT-2. This result also corresponds to HSs from V 6,ARG having lower HTER ( Figure 12) and a higher RR (Figure 14).

A.4 Tables
In Table 5, the main results calculated on the HS/CN pairs are displayed. In Table 6, respectively, the results calculated on HS only and CN only are shown.  Table 4: Label mapping for V 6,SBF . Starred items are considered as "other targets" in Figure 11.