HARALD: Augmenting Hate Speech Data Sets with Real Data

The successful completion of the hate speech detection task hinges upon the availability of rich and variable labeled data, which is hard to obtain. In this work, we present a new approach for data augmentation that uses as input real un-labelled data, which is carefully selected from online platforms where invited hate speech is abundant. We show that by harvesting and processing this data (in an automatic manner), one can augment existing manually-labeled datasets to improve the classification performance of hate speech classification models. We observed an improvement in F1-score ranging from 2.7% and up to 9.5%, depending on the task (in-or cross-domain) and the model used.


Introduction
Hate speech detection (offensive, abusive, toxic) is of interest to academic researchers in a variety of domains, including computer science (Machova et al., 2020) and sociology (Davidson et al., 2017).It is also of interest to online social platforms that wish to maintain certain standards of discourse or are obliged to do so by law in some countries.
Hate speech is commonly defined as "any communication that disparages a target group of people based on some characteristics such as race, color, ethnicity, gender, sexual orientation, nationality, religion, or other characteristic" (Nockleby, 2000).
Detecting hate speech may be difficult because the manifestation of speech as hate speech depends on a non-trivial interaction between various circumstances such as the topic, the context, the timing, outside events, and the identity of the speaker and recipient (Schmidt and Wiegand, 2017).
The typical way hate speech detection is approached is as a supervised-learning classification task, where lexical features and other features (e.g.word-embedding) are used to train a classifier (Schmidt and Wiegand, 2017;Spertus, 1997;Razavi et al., 2010b).
To that end, access to labeled corpora is essential.Since hate speech has many facets and there are few (or none) universal "gold-standard" datasets, authors usually collect and label their data.The size of collected corpora varies considerably ranging from around 100 labeled comments (Dinakar et al., 2012) to several thousand (Van Hee et al., 2015;Djuric et al., 2015).
Collecting and annotating hate-speech data is challenging and extremely time-consuming for two main reasons.First, there are much fewer hateful than benign comments present in randomly sampled data.Second, to manually annotate a data set, either expert annotators or crowd-sourcing services, such as Amazon Mechanical Turk, are employed.While crowd-sourcing has obvious advantages for this task, the annotation quality of non-expert annotators was demonstrated to be poorer than that of experts (Nobata et al., 2016;Ross et al., 2017).
There were several attempts to deal with these problems.(Waseem and Hovy, 2016a) proposed to select the text to be annotated by looking for topics that are likely to contain a higher degree of hate speech.They collected 136,052 tweets, about 10% were annotated (16,914 tweets), and about a third of them were labeled as hate speech.While this increased the proportion of hate speech posts, it focuses the resulting data set on specific topics and thus hinders the generalizability to other domains (Wiegand et al., 2019).
Another possible solution to the aforementioned two challenges may be found in data augmentation; this avenue is intensively developed for example, in computer vision but "relatively under-explored" in NLP, where the generation of effective augmented examples is "less obvious" (Feng et al., 2021).

Our Contribution and Method
This paper presents a new data augmentation pipeline for offensive/hate speech data called HAR-ALD, which stands for Hate Augmentation with ReAL Data.Unlike common data augmentation methods that generate synthetic data (using GANs or other generative models) HARALD outputs an endless stream of relevant real data written by a huge number of authors, rich with various stylistic, grammatical, and semantic forms.Our method hinges upon the existence of online platforms where people are explicitly asked to be abusive.One such platform is the subreddit r/RoastMe/, in which users upload a picture and ask their peers to "roast them", with the intention to develop a thicker skin by withstanding the abusive speech (the logo of the subreddit is "the thicker the skin, the better the roast ").See the appendix for an excerpt from RoastMe.
To validate HARALD's usefulness, we conducted the following experiment, inspired by the cross-domain evaluation of (Wiegand et al., 2019).We harvested from r/RoastMe/ a total of 3700 messages and assigned each message a hate score, the output of the last GELU layer of a pre-trained BERT model for hate speech detection (Caselli et al., 2020).We then sorted the messages and took the top 1,000 as the positive class of the RoastMe (RM) dataset.We then selected six well-known datasets of hate/offensive speech to fine-tune the BERT-base-uncased model (Devlin et al., 2018) on each of the datasets separately (see Section 4 for details on the datasets).We tested the cross-domain performance of each of the six models on the other five datasets.We then repeated this experiment, but now we added another fine-tune step with the RM dataset.We also conducted an in-domain (crossvalidation) test.All detail in Section 5.
We observed an improvement in macro F1-score ranging from 2.7% and up to 9.5%, depending on the task (in or cross domain) and on the model that was used (see Table 3).For the (Waseem and Hovy, 2016a) dataset, we obtained a 4.1% improvement when using RM in the in-domain task, improving the F1 score from 0.74 to 0.77.For comparison, the GAN-based pipeline of (Cao and Lee, 2020) improved the F1 score from 0.77 to 0.78 on the same dataset (1.2% improvement).

Related Work
Data augmentation methods have been explored to address the imbalance of datasets challenge in supervised classification tasks.Noise injection or attribute modification techniques were commonly applied to generate synthetic data for image and sound classification tasks (Shorten and Khoshgoftaar, 2019;Tran et al., 2017;Salamon and Bello, 2017).However, such techniques do not extend to text due to the categorical nature of words and the sequential nature of text.
There are very few works that explored data augmentation in hate speech detection.(Rizos et al., 2019), and similarly (Ibrahim et al., 2018), explored various data augmentation techniques for hate speech: substituting words, swapping word positions, and neural generation using RNN (Sutskever et al., 2011).
Each of these methods has its limitations.It is challenging to find suitable semantically similar words in the fast-evolving social media platforms; swapping words' positions may harm the coherence of the sentence.
The authors of (Cao and Lee, 2020) propose a GAN methodology, HateGAN, to augment two data sets.They train LSTM and CNN models on the augmented datasets and show a 5% improvement in F1 score.They also show that HateGAN outperforms (Rizos et al., 2019).
In (Dixon et al., 2018), real non-toxic text was harvested similarly to us, but for the task of mitigating unintended biases in text classification.One has to note, though, that most online text is non-toxic, so automatically harvesting toxic or non-toxic text is by no means equivalent tasks.
Our work differs from these works in several key aspects.(1) HARALD produces real rather than synthetic data, the distribution of which is different than the dataset to be augmented.Previous work generates synthetic data from the existing dataset and makes a point that the data has the same distribution as the data to be augmented.(2) We train SOTA hate speech classification models, BERT, while weaker models such as LSTM or CNN were used in previous work.( 3) HARALD improves at a more challenging task -cross-domain prediction.We surmise that the fact that RM has a different distribution than the original dataset plays a key factor in improving the prediction results.(4) We evaluate HARALD in six different datasets, while previous work used a maximum of three.
Finally, let us discuss the subreddit r/RoastMe.RoastMe presents an intriguing case of how alternative norms can emerge in online communities, allowing behaviors that are otherwise condemned as inappropriate to be reframed as acceptable.In this community, users post photos of themselves with the explicit expectation of being mocked or ridiculed by others.RoastMe is not alone, with similar subreddits such as r/ToastMe, and r/Judgemeplease.The norms and values of the RoastMe community were studied, for example, in (Kasunic and Kaufman, 2018;Allison et al., 2019).In (Sodhi et al., 2021), RoastMe was used for the task of style transfer, rephrasing slurs as compliments and vice versa.

Data
We turn to describe the RM dataset and the other six datasets that we augmented using RM in order to evaluate the performance of HARALD.All datasets appear in the project's GitHub page (Ilan and Vilenchik, 2022).
The RoastMe (RM) dataset.That paper overall supports our thesis (to quote, "r/RoastMe, a comedy-focused subreddit of the parent site reddit.com,wherein members post photos of themselves to be ridiculed by other members; the site generally encourages harsh and offensive forms of humor in these interpersonal exchanges").
We harvested 3700 comments from the Roastme using the PRAW API.We removed comments with less than three words, and cleaned them from links, emojis, stop words, and punctuation marks, leaving us with 3,500 comments.We then used the Hate-BERT from (Caselli et al., 2020), further fine-tuned on the Kaggle dataset (see below), to assign each RM comment a hate score (output of last GELU layer).We sorted the comments in descending order and took the top 1,000 as the positive class of the RM dataset.
The RoastMe dataset also contains a negative class to keep the train and test balanced after augmentation.We sampled 3,500 non-offensive Reddit comments from the (Qian et al., 2019) dataset (see below), ranked them using the same BERT model, and took the 1,000 least hateful.
For the cross-domain experiment, we used the following five datasets, also used in (Wiegand et al., 2019), plus the dataset of (Qian et al., 2019).The datasets were cleaned in the same manner as RM.The five datasets are imbalanced to different degrees.To control for the effect of dataset imbalance on the results of the cross-domain test, we downsampled the negative class to match the positive class.Due to computational limitations, we also down-sampled the positive class in the larger sets.
The Kaggle dataset (Kaggle, 2014) contains 312,737 Wikipedia comments, 22,468 of them offensive, labeled with five hate-speech labels (e.g.toxic, abusive, etc).We treat a comment as hate speech (the positive class) if at least one of the five labels is true.We randomly sampled 5,000 comments from each class to form our Kaggle dataset.
The Razavi dataset (Razavi et al., 2010b) contains 1,525 messages, 1,038 non-offensive and 482 "flame", that is offensive texts.We down-sampled the non-offensive class to match the size of the offensive class, giving a total of 964 comments.The data itself is available at (Razavi et al., 2010a).The Waseem dataset (Waseem and Hovy, 2016a) contains 16,907 tweets, 1,970 labelled with racism, 3,379 with sexism and all the rest (11,559) nonoffensive.The online data (Waseem and Hovy, 2016b) contains only tweet ids and labels.We used Twitter's API to recover the text of 795 offensive tweets (sexism and racism) and 3,699 nonoffensive tweets.We then down-sampled the nonoffensive class to match the size of the offensive class, giving us a total of 1590 tweets.
The Kumar dataset (Kumar et al., 2018) consists of 15,000 Facebook posts and comments, out of them 3,419 tagged as overtly aggressive, 5,296 as covertly aggressive, and 6,285 as non-aggressive.We randomly sampled 5,000 aggressive (overtly and covertly), and 5,000 non-aggressive comments.The authors communicated the data privately after filling out an online application form.
The Offensive Reddit dataset (Qian et al., 2019) consists of 5,020 conversations in which offensive comments are tagged.We sampled 3,230 offensive comments.For the negative class, we sampled 3,230 comments from the political classification task (Washam, 2019) and comments that we harvested from subreddits about fitness and food.

Evaluation
We evaluated the quality of our pipeline by augmenting the six hate/abusive speech datasets described in Section 4.    1 and 2, by averaging over the last column in the Cross-Domain (CD) setting, and over the diagonal in the In-Domain (ID) setting.Improvement in % when using RM is in parenthesis.
Due to limited computational resources (we had one G-Force RTX 3090 GPU), we could not train BERT on the entire dataset.Therefore we broke the large datasets into four random parts (each includes about 1500-2000 comments).The figures appearing in the table are the average over these four-folds.In each fold, we used 80% of the data for training and 20% for validation.In the two small datasets, Waseem and Razavi, we used the entire dataset, repeating four times train (80%) and validation (20%), each time on a randomly sampled 80% of the dataset.
Experiment 2 is identical to Experiment 1, but this time the starting point is the basic pre-trained HateBERT model from (Caselli et al., 2020) finetuned with each of the datasets.The results of Experiment 2 are in Table 2.
We also tested the in-domain prediction task in both experiments using 4-fold cross-validation.The results are the diagonal of Tables 1 and 2.
Table 3 summarizes the results of the two tables and shows that augmenting the dataset using RM yielded an overall improvement ranging between 2.7% to 9.5%, depending on the setting (in-domain or cross-domain) and on the initial BERT model.This should be compared to the 5% in-domain average improvement in (Cao and Lee, 2020).Table 4 is the same for imbalanced dataset, and shows higher scores and improvements, especially for Cross Domain task (7.9% for Bert and 5.4% for HateBert).
Experiments 3 and 4 are identical to 1 and 2, respectively.The only difference is that now all datasets D i and D j were imbalanced to a 30-70 ratio (30% hate speech) to facilitate a more realistic scenario where the positive class is in the minority.For lack of space, we only give the summary of the results, Table 4. Compared to the balanced setting, we notice that the improvement in the imbalanced setting when using RM is larger in most cases.
All differences in Tables 1,2 between F1 scores after and before augmentation, except one case (in Experiment 4), were verified using a paired t-test and came out significant.

Discussion
In this work we have shown that invited abusive speech, which is written humorously, is useful for data augmentation.Our work suggests that humans can produce actual hate speech even without the appropriate psychological conditioning of the brain (such as anger, hate, antagonism, etc).This may hint at some universal properties of hate speech that do not depend entirely on certain emotional states of the mind.We leave this last thought as a gate to further multidisciplinary psycholinguistic research, which may shed more light on the phenomenon of hate speech and how to identify it better using automatic tools.
In trying to get a deeper insight into how exactly did RM help in the cross-domain test, we identified two meta-classes of datasets: randomly sampled datasets with boosting of abusive comments (for example Kaggle and Founta and) and datasets that were selected by topics or key words that were assumed to be assoicated with hate and offensive speech (Waseem and Kumar).
We found that the first group was characterized more by a direct and clear offensive style, while the second group was by a more indirect and fuzzier offensive style.The highest rate of improvement due to RM augmentation was when we trained on a dataset from the second group and tested on a dataset from the first group (improving both in false positive rate and false negative).For example, training on Kumar and testing on Kaggle, we observed an improvement of 25% following RM augmentation.We surmise that the improvement is because RM contains, by its nature, clear and direct offensive comments that complement that missing part in the original dataset.When trained and tested on datasets from the first group, RM mainly contributed to reducing FN-rate, perhaps because it "bridged" the gap between two distributions with its rich and diverse content.
Finally, another angle that our work did not attend to is that of unintended biases (Dixon et al., 2018).If one agrees that unintended biases impair generalizability, then our cross-domain improvement results put forth the premise that augmenting with RoastMe decreased such biases.This point deserves separate in-depth exploration.

Limitations
In our cross-domain evaluation, we did not have sufficient compute power to train the classifier on the entire dataset.To this end, we broke the date into chunks, training and testing on random chunks each time.This is similar to a cross-validation procedure, which is not necessary in a cross-domain experiment.It may be that the results will change slightly when using the entire dataset.
We did not check the usefulness of other invited hate speech platforms.There is /r/toastme/, r/Rateme/ and probably other platforms where abusive speech is the norm.Therefore we can't say if, in general, such invited hate speech is useful or if we simply got lucky with RoastMe.We surmise that the latter is not the case.
Finally, we ensured that all the datasets in our experiment were balanced (or imbalanced, but to the same degree).We did not check the usefulness of data augmentation using RM for differently imbalanced train and test datasets.Figure 1: An example for two roasts (highlighter in yellow) from our RoastMe dataset and the photo they were directed to.We Can see that the first comment is more overtly offensive and the second is more covertly offensive.This illustrates the diversity of the roasts, which may have been the key to the improvement of the classification model.

Table 1 :
The code and datasets can Experiment 1: Cross-domain (CD) and In-Domain (ID) macro-F1 score for the BERT cased-uncased fine-tuned with train dataset (row) and tested on test dataset (column).

Table 3 :
Summary of Tables