Robust Hate Speech Detection in Social Media: A Cross-Dataset Empirical Evaluation

The automatic detection of hate speech online is an active research area in NLP. Most of the studies to date are based on social media datasets that contribute to the creation of hate speech detection models trained on them. However, data creation processes contain their own biases, and models inherently learn from these dataset-specific biases. In this paper, we perform a large-scale cross-dataset comparison where we fine-tune language models on different hate speech detection datasets. This analysis shows how some datasets are more generalizable than others when used as training data. Crucially, our experiments show how combining hate speech detection datasets can contribute to the development of robust hate speech detection models. This robustness holds even when controlling by data size and compared with the best individual datasets.


Introduction
Social media has led to a new form of communication that has changed how people interact across the world.With the emergence of this medium, hateful conduct has also found a place to propagate online.From more obscure online communities such as 4chan (Knuttila, 2011) and Telegram rooms (Walther and McCoy, 2021) to mainstream social media platforms such as Facebook (Del Vigna12 et al., 2017) and Twitter (Udanor and Anyanwu, 2019), the spread of hate speech is an on going issue.
Hate speech detection is a complex problem that has received a lot of attention from the Natural Language Processing (NLP) community.It shares a lot of challenges with other social media problems (emotion detection, offensive language detection, etc), such as an increasingly amount of user generated content, unstructured (Elsayed et al., 2019) and constantly evolving text (Ebadi et al., 2021), and the need of efficient large scale solutions.When dealing with hate speech in particular, one has to consider the sensitivity of the topics, their wide range (e.g.sexism, sexual orientation, racism), and their evolution through time and location (Matamoros-Fernández and Farkas, 2021).Understanding the extent of the problem and tracking hate speech online through automatic techniques can therefore be part of the solution of this ongoing challenge.One way to contribute to this goal is to both improve the current hate speech detection models and, crucially, the data used to train them.
The contributions of this paper are twofold.First, we provide a summary and unify existing hate speech detection datasets from social media, in particular Twitter.Second, we analyse the performance of language models trained on all datasets, and highlight deficiencies in generalisation across datasets, including the evaluation in a new independently-constructed dataset.Finally, as a practical added value stemming from this paper, we share all the best models trained on the unification of all datasets, providing a relatively small-size hate speech detection model that is generalisable across datasets. 1  Content Warning The article contains examples of hateful and abusive language.The first vowel in hateful slurs, vulgar words, and in general profanity language is replaced with an asterisk (*).

Related Work
Identifying hate speech in social media is an increasingly important research topic in NLP.It is often framed as a classification task (binary or multiclass) and through the years various machine learning and information sources approaches have been utilised (Mullah and Zainon, 2021;Ali et al., 2022;Khanday et al., 2022;del Valle-Cano et al., 2023).A common issue of supervised approaches lies not necessarily with their architecture, but with the existing hate speech datasets that are available to train supervised models.It is often the case that the datasets are focused on specific target groups (Grimminger and Klinger, 2021), constructed using some specific keyword search terms (Waseem and Hovy, 2016;Zampieri et al., 2019), or have particular class distributions (Basile et al., 2019) that leads to a training process that may or may not generalise.For instance, Florio et al. (2020) analysed the temporal aspect of hate speech, and demonstrate how brittle hate speech models are when evaluated on different periods.Recent work has also shown that there is a need to both focus on the resources available and also try to expand them in order to develop robust hate speech classifiers that can be applied in various context and in different time periods (Bourgeade et al., 2023;Bose et al., 2022).
In this paper, we perform a large-scale evaluation to analyse how generalisable supervised models are depending on the underlying training set.Then, we propose to mitigate the relative lack of generalisation by using datasets from various sources and time periods aiming to offer a more robust solution.

Data
In this section, we describe the data used in our experiments.First, we describe existing hate speech datasets in Section 3.1.Then, we unify those datasets and provide statistics of the final data in Section 3.2

Hate Speech datasets
In total, we collected 13 datasets related to hate speech in social media.The datasets selected are diverse both in content, different kind of hate speech, and in a temporal aspect.
Measuring hate speech (MHS) MHS (Kennedy et al., 2020;Sachdeva et al., 2022) consists of 39,565 social media (YouTube, Reddit, Twitter) manually annotated comments.The coders were asked to annotate each entry on 10 different attributes such as the presence of sentiment, respect, insults and others; and also indicate the target of the comment (e.g.age, disability).They use Rasch measurment theory (Rasch, 1960) to aggregate the annotators' rating in a continuous value that indicates the hate score of the comment.
Call me sexist, but (CMS) This dataset of 6,325 entries (Samory et al., 2021) focuses on the aspect of sexism and includes social psychology scales and tweets extracted by utilising the "Call me sexist, but" phrase.The authors also include two other sexism datasets (Jha and Mamidi, 2017;Waseem and Hovy, 2016) which they re-annotate.Each entry is annotated by five coders and is labelled based on its content (e.g.sexist, maybe-sexist) and phrasing (e.g.civil, uncivil).
Hate Towards the Political Opponent (HTPO) HTPO (Grimminger and Klinger, 2021) is a collection of 3,000 tweets related to the 2020 USA presidential election.The tweets were extracted using a set of keywords linked to the presidential and vice presidential candidates and each tweet is annotated for stance detection (in favor of/against the candidate) and whether it contains hateful language or not.
HateX HateX (Mathew et al., 2021) is a collection of 20,148 posts from Twitter and Gab extracted by utilising relevant hate lexicons.For each entry, three annotators are asked to indicate: (1) the existence of hate speech, offensive speech, or neither of them, (2) the target group of the post (e.g.Arab, Homosexual), and (3) the reasons for the label assigned.
Offense The Offense dataset (Zampieri et al., 2019) contains 14,100 tweets extracted by utilising a set of keywords and categorises them in three levels: (1) offensive and non-offensive; (2) targeted/untargeted insult; (3) targeted to individual, group, or other.

Automated Hate Speech Detection (AHSD)
In this dataset, (Davidson et al., 2017) the authors utilise a set of keywords to extract 24,783 tweets which are manually labelled as either hate speech, offensive but not hate speech, or neither offensive nor hate speech.
Hateful Symbols or Hateful People?(HSHP) This is a collection (Waseem and Hovy, 2016) of 16,000 tweets extracted based on keywords related to sexism and racism.The tweets are annotated as on whether they contain racism, sexism or neither of them by three different annotators. 2re You a Racist or Am I Seeing Things?(AYR) This dataset (Waseem, 2016) is an extension of Hateful Symbols or Hateful People? and adds the "both" (sexism and racism) as a potential label.Overlapping tweets were not considered.
HatE HatE (Basile et al., 2019) consists of English and Spanish tweets (19,600 in total) that are labelled on whether they contain hate speech or not.The tweets in this dataset focus on hate speech towards two groups: (1) immigrants and (2) women.
HASOC This dataset (Mandl et al., 2019) contains 17,657 tweets in Hindi, German and English which are annotated on three levels: (1) whether they contain hate-offensive content or not; (2) in the case of hate-offensive tweets, whether a post contains hate, offensive, or profane content/words; (3) on the nature of the insult (targeted or un-targeted).
Detecting East Asian Prejudice on Social Media (DEAP) This is a collection of 20,000 tweets (Vidgen et al., 2020) focused on East Asian prejudice, e.g.Sinophobia, in relation to the COVID-19 pandemic.The annotators were asked to labelled each entry based on five different categories (hostility, criticism, counter speech, discussion, nonrelated) and also indicate the target of the entry (e.g.Hong Kongers, China).
Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior (LSC) The dataset (Founta et al., 2018) consists of 80,000 tweets extracted using a boosted random sample technique.Each tweet is labelled as either offensive, abusive, hateful, aggressive, cyberbullying or normal.

Unification
Even though all of the datasets that were collected revolve around hate speech, there are major differences among them in terms of both format and content.We attempt to unify the datasets by standarizing their format and combining the available content into two settings: (1) binary hate speech classification and (2) a multiclass classification task including the target group.We note that in cases where the original annotation results were provided, we decided to assign a label if at least two of the coders agree on it and not necessarily the majority of them.This approach can lead to a more realistic dataset and contribute in creating more robust systems (Antypas et al., 2022;Mohammad et al., 2018).

Initial preprocessing
For each dataset collected, a simple preprocessing pipeline is applied.Firstly, any non-Twitter content is removed; despite the similarities between the content shared in various social media (e.g.internet slang, emojis), Twitter displays unique characteristics, such as the concept of retweets and shorter texts, which differentiate it from other platforms such as Reddit or Youtube (Smith et al., 2012).Moreover, as our main consideration is hate speech in the English language, we exclude any non-English subset of tweets, and also verify the language by using a fastText based language identifier (Bojanowski et al., 2017).Finally, considering that some datasets in this study utilise similar keywords to extract tweets, we remove near duplicated entries to avoid any overlap between them.This is accomplished by applying a normalisation step where entries that are considered duplicated based on their lemmatised form are ignored.Also, all URLs and mentions are removed.
As a final note, three of the datasets (HSHP, AYR, LSC) were dehydrated using the Twitter API since only their tweet IDs and their labels were publicly available.Unfortunately, a significant number of tweets (≈ 10, 000) were no longer available from the API.

Binary Setting
The majority of the datasets collected are either set as a binary hate classification task and no further preprocessing is applied (HTPO), or offer a more fine-grained classification of hate speech (e.g.Ha-teX, CMS) where we consider all "hate" subclasses as one.In general, when a dataset focuses on a spe-cific type of hate speech (e.g.sexism) we map it as hate speech.Notable exceptions are: (1) The MSH dataset, where a continues hate score is provided which is transformed into a binary class according to the mapping proposed by the original authors.
(2) Datasets that consist of offensive speech but also provide information about the target of the tweet.In these cases, (Offense), we consider only entries that are classified as offensive and are targeting a group of people and not individuals.Our assumption is that offensive language towards a group of people is highly likely to target protected characteristics and thus be classified as hate speech.
(3) Finally, only entries classified as hate speech were considered in datasets where there is a clear distinction between hate, offensive, or profound speech (LSC, AHSD, HASOC).All data labelled as normal or not-hateful are also included as not-hate speech.

Multiclass Setting
Having established our binary setting, we aggregated the available datasets aiming to construct a more detailed hate speech classification task.As an initial step, all available hate speech sub-classes present were considered.However, this led to a very detailed but sparse hate taxonomy, with 44 different hate speech categories, but with only a few entries for some of the classes (e.g."economic" category with only four tweets present).Aiming to create an easy-to-use and extendable data resource, several categories were grouped together.All classes related to ethnicity (e.g.Arab, Hispanic) or immigration were grouped under racism, while religious categories (e.g.Muslim, Christian) were considered separately.Categories related to sexuality and sexual orientation (e.g.heterosexual, homosexual) were also grouped in one class, and tweets with topics regarding gender (men, women) constitute the sexism class.Finally, all entries labelled as "not-hate" speech were also included.To keep our dataset relatively balanced we also ignored classes that constitute less than 1% of the total hate speech data.Overall, the multiclass setting proposed consists of 7 classes: Racism, Sexism, Disability, Sexual orientation, Religion, Other, and Not-Hate.It is worth noting that tweets falling under the Other class do not belong to any of the other five hate speech classes.

Statistics and Data Splits
In total, we collected 83,230 tweets, from 13 different datasets (Table 1), of which only 33% are classified as hate speech.This unified dataset may seem imbalanced but it is commonly assumed that only around 1% of the content shared on social media contains hate speech (Pereira-Kohatsu et al., 2019).When considering the multiclass setting, the hate speech percentage decreases even more with only 26% of tweets labelled as a form of hate speech, with the religion class being the least popular with only 709 entries.
The data in both settings (binary & multiclass) are divided into train and test sets using a stratified split to ensure class balance between the splits (Table 2).In general, for each dataset present, we allocate 70% as training data, 10% as validation, and 20% as test data.Exceptions to the aforementioned approach are datasets where the authors provide a preexisting data split which we use.

Evaluation
We present our main experimental results comparing various language models trained on single datasets and in the unified dataset presented in the previous section.

Experimental Setting
Models.For our experiments we rely on four language models of a similar size, two of them being general-purposes and the other two specialized on social media: BERT-base (Devlin et al., 2019) and RoBERTa-base (Liu et al., 2019) as generalpurpose models; and BERTweet (Nguyen et al., 2020) and TimeLMs-21 (Loureiro et al., 2022) as language models specialized on social media, and particularly Twitter.There is an important difference between BERTweet and TimeLMs-21: since BERTweet was trained from scratch, TimeLMs-21 used the RoBERTa-base checkpoint as initialization and then continued training on a Twitter corpus.An SVM classifier is also utilized as a baseline model.
Settings.Aiming to investigate the effect of a larger and more diverse hate speech training corpus on various types of hate speech, we perform an evaluation on both the binary and multiclass settings described in Section 3.2.Specifically, for the binary setting we fine-tune the models selected first on each individual dataset, and secondly while using the unified dataset created.For the multiclass setting, we considered the unified and the HateX dataset, which includes data for all classes.In total, we fine-tuned 54 different binary 3 and 8 multiclass models.
Training.The implementations provided by Hugging Face (Wolf et al., 2020) are used to train and evaluate all language models, while we utilise Ray Tune (Liaw et al., 2018) along with HyperOpt (Bergstra et al., 2022) and Adaptive Successive 3 MMHS dataset was used only for the training/evaluation of the unified dataset as it is lacking the not-hate class Halving (Li et al., 2018) for optimizing the learning rate, warmup steps, number of epochs, and batch size, hyper-parameteres of each model. 4valuation metrics.The macro-averaged F1 score is reported and used to compare the performance of the different models.Macro-F1 is commonly used in similar tasks (Basile et al., 2019;Zampieri et al., 2020) as it provides a more concrete view on the performance of each model.

Datasets
For training and evaluation, we use the splits described in Section 3.2.4.As described above, for each language model we trained on each dataset training set independently, and in the combination of all dataset-specific training sets.The results on the combination of all datasets are averaged across each dataset-specific test set (AVG), i.e., each dataset is given the same weight irrespective of its size.In addition to the datasets presented in Section 3.1, we constructed an independent test set (Indep) to test the robustness of models outside existing datasets.
Independent test set (Indep).This dataset was built by utilising a set of keywords related to the International Women's Day and International Day Against Homophobia, Transphobia and Biphobia and extracting tweets from the respected days of 2022.Then, these tweets were manually annotated by an expert.In total 200 tweets were annotated as hateful, not-hateful, or as "NA" in cases where the annotator was not sure whether a tweet contained hate speech or not.The Indep test set consists of 151 non-hate and 20 hate tweets and due to its nature (specific content & expert annotation) can be leveraged to perform a targeted evaluation on models trained on similar and unrelated data.While we acknowledge the limitations of the Indep test set (i.e., relative small number of tweets and only one annotator present), our aim is to use these tweets, collected using relatively simple guidelines 5 , to test the overall generalisation ability of our models and how it aligns to what people think of hate speech.
5 Annotator guidelines are available in Appendix A.

Binary Setting
Table 3 displays the macro-F1 scores achieved by the models across all test sets when fine-tuned: (1) on all available datasets (All), (2) on the best overall performing model trained on a single dataset, and (3) on a balanced sample of the unified dataset of the same data size as (2).When looking at the average performance of ( 1) and ( 2), it is clear that when utilising the combined data, all models perform considerably better overall.This increased performance may not be achieved across all the datasets tested, but it does provide evidence that the relatively limited scope of the individual datasets hinder the potential capabilities of our models.An even bigger constrast is observed when considering the performance difference on the DEAP subset, which deals with a less common type of hate speech (prejudice towards Asian people), where even the best performing single dataset model achieves barely 19.79% F1 compared to the worst combined classifier with 49.27% F1 (BERT All / BERT HTPO).
To further explore the importance of the size and diversity of the training data we train and evaluate our models in an additional settings.Considering the sample size of the best performing dataset for each model, an equally sized training set is extracted from all available data while enforcing a balanced distribution between hate and not-hate tweets (All*).Finally, we make sure to sample proportionally across the available datasets.The results (Table 3) reveal the significance that a diverse dataset has in the models' performance.All models tested perform on average better when trained on the newly created subsets (All*) when compared to the respective models trained only on the best performing individual dataset.Interestingly, this setting also achieves the best overall scores on the Indep.set, which reinforces the importance of balancing the data.Nonetheless, all the transformers models still achieve their best score when trained on all the combined datasets (All) which suggests that even for these models, the amount of available training data remains an important factor of their performance.

Multiclass Setting
Similarly to our binary setting, utilising the combined datasets in the multiclass setting enhances the models' performance.As can be observed from Table 4, all the models struggle to function at a satisfactory degree when trained on the HateX subset only.In particular, when looking at the "disability" class, none of the models manage to classify any of the entries correctly.This occurs even though "disability" entries exist in the HateX training subset, albeit in a limited number (21).This behaviour suggests that even when information about a class is available in the training data, language models may fail to distinguish and utilise it.Imbalanced datasets are a common challenge in machine learning applications.This issue is also present in hate speech, in this case exacerbated given the nature of the problem (including a potential big overlap of features between classes) and the lack of resources available.

Analysis
In this section, we dissect the results presented in the previous section by performing a cross-dataset comparison and a qualitative error analysis.

Cross-dataset Analysis
Figure 1 presents a cross-dataset comparison of the language models used for the evaluation.The heatmap presents the results of the models finetuned and tested for all dataset pair combinations.All models evaluated tend to perform better when they are trained and tested on specific subsets (left diagonal line on the heat-maps).Even when we evaluate models on similar subsets, they tend to display a deterioration in performance.For example both CMS and AYR datasets deal with sexism but the models trained only on CMS perform poorly when evaluated on AYR (e.g.BERTweet-CSM achieves 87% F1 on CSM, but only 52% on AYR).Finally, it is observable again that the models trained on the combined datasets (column "all") display the best overall performance and attain consistently high results in each individual test set.When analysing the difficulty of each individual dataset when used as a test set, DEAP is clearly the most challenging one overall.This may be due to the scope of the dataset, dealing with East Asian Prejudice during the COVID-19 pandemic, which is probably not well captured in the rest of the datasets.When used as training sets, none of the individual datasets is widely generalisable, with the results of the model fine-tuned on them being over 10 points lower than when fine-tuned on the unified dataset in all cases.

Qualitative Error Analysis
Aiming to better understand the models' results we perform a qualitative analysis focusing on entries miss-classified by our best performing model, TimeLMs-All.
Multiclass.When considering the multiclass setting, common errors are tweets that have been labelled as hateful, e.g."U right, probably some old n*gga named Clyde" is labelled as racism and "@user @user she not a historian a jihadi is the correct term" as religion, but the model classifies them as not-hate.However, depending on the context and without having access to additional information (author/target of the tweet) these entries may not actually be hateful.Binary In the binary setting, the model seems to struggle with entries such as "Meanwhile in Spain..#stopimmigration" and "This is outrageous.
Congress should be fired on the spot.#BuildThat-Wall #stopwastingmytaxdollars" where both entries are classified as hate but are labelled as not-hate.Similarly to the previous case, the classification of such tweets without additional context is a difficult task.While these tweets have hateful undertones, they may not be necessarily hate speech without considering them in their broader context.
Finally, when looking at the classification errors of TimeLMs-AYR (trained only on sexist and racist tweets) the need of diverse training data becomes apparent.For example, TimeLM-AYR fails to classify as hate speech the tweets "@user that r*tarded guy should not be a reporter" and "I'm going to sell my iPhone and both my Macs, I don't support f*ggots." as hate speech in contrast to TimeLMs-All which classifies the tweets correctly as hateful.

Conclusion
In this paper, we presented a large-scale analysis of hate speech detection systems based on language models.In particular, our goal was to show the divergences across datasets and the importance of having access to a diverse and complete training set.Our results show how the combination of datasets make for a robust model performing competitively across all datasets.This is not a surprising finding given the size of the corresponding training sets, but the considerable gap (e.g.70.7% to 61.0% in Macro-F1 for the best TimeLMs-21 performing model) shows that models trained on single datasets have considerable room for improvement.Moreover, even when controlling for data size, a model trained on a diverse set instead of a single dataset leads to better overall results.
As future work, we are planning to extend this analysis beyond English, in the line of previous multilingual approaches (Ousidhoum et al., 2019;Chiril et al., 2019;Bigoulaeva et al., 2021), and masked language models by including, among others, generative and instruction-tuning language models.In addition to the extensive binary-level evaluation, recognising the target group is a challenging area of research.While in Section 4.3.2,we provided some encouraging results, the results could be expanded with a unified taxonomy.

Ethics Statement
Our work aims to contribute and extend research regarding hate speech detection in social media and particular in Twitter.We believe that our efforts to contribute on the ongoing concerns around the status of hate speech on social medial.
We acknowledge the importance of the ACM Code of Ethics, and are committed on following it's guidelines.Our current work, uses either publicly available tweets under open licence and does not infringe any of the rules of Twitter's API.Moreover, given that our task includes user generated content we are committed to respect the privacy of the users, by replacing each user mention in the texts with a placeholder.

Limitations
In this paper, we have focused on existing datasets and a unification stemming from their features.The decisions taken to this unification, particularly in the selection of dataset and target groups, may influence the results of the paper.
We have focused on social media (particularly Twitter) and on the English language.While there has been extensive work on this medium and language, the conclusions that we can take from this study can be limiting, as the detection of hate speech involves other areas, domains and languages.In general, we studied a particular aspect of hate speech detection which may or not be generalizable.
Finally, due to computational limitations, all our experiments are based on base-sized language models.It is likely that larger models, while exhibiting similar behaviours, would lead to higher results overall.

A Annotation Guidelines
In the following we present the guidelines provided to the annotator for the independent test set (Section 4.2).
• labelled as "0" ("not-hate-speech") if it does not contain hate speech as defined above.
• labelled "NA" if the coder is not sure whether the tweet contains hate speech or not.
The annotation should be based only on the text content of the tweet.This means that the coder should not follow any URL/media links if present.

Figure 1 :
Figure 1: Macro-averaged F1 score for each dataset/model combination.The X axis indicates on which dataset the model was trained while the Y axis indicates the test set used to evaluate it.AVERAGE indicates the result by averaging across all datasets, and all represents the aggregated training set including all datasets.

Table 1 :
Distribution of tweets gathered across hate speech datasets, including those where the target information is available (multiclass).

Table 3 :
Macro-averaged F1 scores across all hate speech test sets and our manually annotated set (Indep).For each model, the table includes: (1) the performance of the model trained on all the datasets (All); (2) the performance of the model when trained on a balanced sample of all datasets of the same size as the best single-dataset baseline (All*); and (3) the best overall performing model trained on a single dataset (BERTweet: MHS, TimeLMs: AYR, RoBERTa: AYR, BERT: HTPO, SVM: MHS).The best result for each dataset and model is bolded.

Table 4 :
F1 score for each class in the multiclass setting when trained on all the datasets (All) and when trained only with HateX.Macro-average F1 (AVG) is also reported.

Table 5 :
Table5lists the best hyperparameters for each of the models used in the evaluation.Best hyper-parameters for models trained on the combined datasets for the binary and multiclass settings.