Latent Feature-based Data Splits to Improve Generalisation Evaluation: A Hate Speech Detection Case Study

With the ever-growing presence of social media platforms comes the increased spread of harmful content and the need for robust hate speech detection systems. Such systems easily overfit to specific targets and keywords, and evaluating them without considering distribution shifts that might occur between train and test data overestimates their benefit. We challenge hate speech models via new train-test splits of existing datasets that rely on the clustering of models’ hidden representations. We present two split variants (Subset-Sum-Split and Closest-Split) that, when applied to two datasets using four pretrained models, reveal how models catastrophically fail on blind spots in the latent space. This result generalises when developing a split with one model and evaluating it on another. Our analysis suggests that there is no clear surface-level property of the data split that correlates with the decreased performance, which underscores that task difficulty is not always humanly interpretable. We recommend incorporating latent feature-based splits in model development and release two splits via the GenBench benchmark.


Introduction
Developing generalisable hate speech detection systems is of utmost importance due to the environment in which they are deployed.Social media usage is rapidly increasing, and the detection of harmful content is challenged by non-standard language use, implicitly expressed hatred, a lack of consensus on what constitutes hateful content, and the lack of high-quality training data (Yin and Zubiaga, 2021a).When developing hate speech detection models in the lab, it is, therefore, vital to simulate evaluation scenarios requiring models to generalise outside the training context.'In the wild', NLP models may encounter text from different periods Figure 1: A UMAP projection of BERT's representations, showing the proposed train-test split, that is constructed by grouping clusters in the latent space.(Lazaridou et al., 2021), authors (Huang and Paul, 2019) or dialects (Ziems et al., 2022), including unseen words (Elangovan et al., 2021) and words whose spelling changed or was obfuscated (Serra et al., 2017).Performing successfully on this data despite such distributional changes is called out-ofdistribution (o.o.d.) generalisation.
How can the ability to generalise best be measured?Despite recent work illustrating that i.i.d.testing does not adequately reflect models' generalisability (e.g.Søgaard et al., 2021), evaluation using randomly sampled test sets is still the status quo (Rajpurkar et al., 2016;Wang et al., 2018Wang et al., , 2019;;Muennighoff et al., 2023).Potentially, this is because obtaining and annotating new data is expensive, and it is hard to define what o.o.d.data is (Arora et al., 2021).For humans, properties like input length (Varis and Bojar, 2021) or spelling mistakes (Ebrahimi et al., 2018) might determine difficulty.But this need not be the same for models.Evaluating models using a notion of modeldependent difficulty is gaining some traction (e.g.Godbole and Jia, 2022) but still remains largely unexplored.
Contributing to that line of work, we propose a method that reuses existing datasets but splits them in a new way by relying on models' latent features.
We cluster hidden representations using k-means and distribute clusters over the train and test set to create a data split.An illustrative example of such a split is shown in Fig. 1.We present two variants (SUBSET-SUM-SPLIT and CLOSEST-SPLIT).While this method is in principle applicable to any classification problem, we experiment with four language models and two hate speech datasets (that include Reddit, Twitter and Gab data).The results suggest that these splits approximate worst-case performance.Models fail catastrophically on the new test sets, while their performance on independent test data is on par with other systems trained on i.i.d.training sets.The difficulty is relatively stable across different models.We analyse the data splits through correlation analyses, and do not find one clear surface-level property of the data split to be predictive of split difficulty.This underscores that model-based difficulty can be quite elusive.We release two of our data splits for inclusion in the GenBench benchmark.
The remainder of this work is structured as follows: Section 2 elaborates on related work, followed by the introduction of the hate speech datasets (Section 3) and the proposed splitting method (Section 4).Section 5 presents model evaluation results, Section 6 analyses the splits in detail, and we conclude in Section 7. The GenBench eval card can be found in Appendix A.

Related Work
This section discusses related work on o.o.d.generalisation evaluation (Section 2.1), followed by a discussion on why generalisation is a persisting challenge in hate speech detection (Section 2.2).

Generalisation evaluation
It is now well-established within NLP that models with high or even human-like scores (e.g.Chowdhery et al., 2022) on i.i.d.splits do not generalise as robustly as the results would suggest.This has been demonstrated using synthetic data (i.a.Lake and Baroni, 2018;McCoy et al., 2019;Kim and Linzen, 2020) and for natural language tasks (i.a.Sinha et al., 2021;Søgaard et al., 2021;Razeghi et al., 2022).Alternative methods of evaluation have become more prominent, such as testing with different domains (e.g.Tan et al., 2019;Kamath et al., 2020;Yang et al., 2022) and adversarial testing, using both human-written (Kiela et al., 2021) and automatically generated adversarial examples (e.g.Zhang et al., 2020;Chen et al., 2019;Gururangan et al., 2018;Ebrahimi et al., 2018).
However, these types of evaluation require collecting or creating new data points, which is not always feasible for datasets that have been in use for years.Re-splitting existing datasets in a noni.i.d.manner makes more efficient use of existing datasets, and, accordingly, new data splits have been developed, that typically use a feature of the input or the output to separate train from test examples.Splits that rely on the input use, for example, word overlap (Elangovan et al., 2021), linguistic structures (Søgaard, 2020), the timestamp (Lazaridou et al., 2021), or the context of words in the data (Keysers et al., 2019) to generate a split.Similarly, Broscheit et al. (2022) maximise the Wasserstein distances of train and test examples.Alternatively, one can evaluate generalisation using output-based non-i.i.d.splits: Naik et al. (2018) analyse the predictions of a model to find challenging phenomena, and Godbole and Jia (2022) re-split a dataset based on the predicted log-likelihood for each example.
The splitting method we propose relies neither on the discrete input tokens nor the output, but instead uses the internal representations of finetuned models.

Hate speech detection
With the rise of social media platforms, hate speech detection gained traction as a computational task (Jahan and Oussalah, 2023), leading to a wide range of benchmark datasets.Most of these datasets rely on data from social media platforms, such as Reddit (Qian et al., 2019;Vidgen et al., 2021), Twitter (ElSherief et al., 2021), Gab (Qian et al., 2019;Mathew et al., 2020), or Stormfront (de Gibert et al., 2018).This work is restricted to hate speech classification using a Reddit dataset (Qian et al., 2019) and a Twitter and Gab dataset (Mathew et al., 2020), which we will elaborate on in Section 3.
Augmenting datasets or evaluating whether a model overfits to particular users or data sources requires annotated data.However, these characteristics are often unavailable due to privacy requirements or because the annotations were not included in the dataset release.Therefore, this work aims to find a data split that can evaluate generalisation without such annotations, relying instead only on a model's internal representations.

Data
We develop and evaluate our splitting method using the following two hate speech datasets.

Reddit
We use a widely used topic-generic Reddit dataset, proposed by Qian et al. (2019).The dataset includes 22,317 examples.Each example in the dataset is labelled as either hate (23.5%) or no-Hate (76.5%).The dataset was collected from ten different subreddits by retrieving potential hate speech posts using hate keywords taken from ElSherief et al. (2018).The hate keywords correspond roughly to the following categories: archaic, class, disability, ethnicity, gender, nationality, religion, and sexual orientation.The data is structured in conversations that consist of at most 20 comments by the same or different authors.These comments were manually annotated with hate or noHate, with each annotator assigned five conversations.

HateXplain
The second dataset is HateXplain (Mathew et al., 2020), which is also topic-generic and widely used.It contains 20,148 examples from Twitter and Gab.Posts from the combined collection were filtered based on a lexicon of hate keywords and phrases by Davidson et al. (2017); Mathew et al. (2019); Ousidhoum et al. (2019).The selected posts were then manually annotated.HateXplain examples are labelled as either hateful (31%), offensive (29%) or normal (40%), as proposed by Davidson et al.  (2017).Offensive speech differs from hate speech in that it uses offensive terms without directing them against any person or group in particular.All offensive and hate examples are annotated with the community that they target.These communities include, among others, Africans, Jewish People, Homosexuals and Women, and we use them for further analysis of our data splits in Section 6.

Methodology
Our proposed splitting strategy, for which we introduce two variants, is detailed in Section 4.1.We evaluate our splits through comparisons to a random splitting baseline and on external test sets.We discuss the corresponding experimental setups in Section 4.2.

Constructing Data Splits
The construction of the data splits involves three steps, that are depicted in Fig. 2. In step 1, the method extracts the latent representations of inputs from a language model that was finetuned on the task using one of the hate speech datasets mentioned above.In step 2, the data is clustered based on these representations and clusters are assigned to either the train or the test set.In step 3, language models are then trained and evaluated on this new split.In addition to the obtained test set, the language models are also evaluated on independent test data, that was set aside for this purpose. 2he key idea behind the approach is that language models implicitly capture salient features of the input in their hidden representations, where inputs with similar properties are close together (Thompson and Mimno, 2020;Grootendorst, 2022).Assigning clusters to the train and test set thus accomplishes separation based on latent features, and by finetuning we ensure that the clusters separate examples based on task-specific features.
Obtaining Hidden Representations We finetune a language model for the given task, using the independent test data as validation set to optimise hyperparameters.We then obtain latent representations for each input example, leveraging the representation of the [CLS] token after the final layer as a representation of the input, as is commonly done (e.g.May et al., 2019;Qiao et al., 2019).
Since for high-dimensional data, distance metrics fail to accurately capture the concept of proximity (Beyer et al., 1999;Aggarwal et al., 2001) and tend to overly rely on individual dimensions (Timkey and van Schijndel, 2021) we conduct experiments with low-dimensional representations and full-dimensional ones.To this end, we either project the full representations into d U -dimensional spaces using UMAP post-training (McInnes et al., 2020), or obtain d B -dimensional representations by introducing a bottleneck in the model between the last hidden layer and the classification layer.The bottleneck is a linear layer that compresses the hidden representations, forcing the model to encode the most salient latent features into a lowdimensional space before classifying the examples.
Clustering and Splitting the Data Each representation from step 1 gives the position of an input example in the latent space.The examples are clustered in this space using the k-means algorithm (Lloyd, 1982).
Hyperparameters of the k-means clustering can be found in Table 3.After clustering, each cluster is assigned to either the train or the test set, keeping two constraints: A fixed test data size (we choose 10%) and train and test set need to have equal class distributions.Without equal class distributions, it would be unclear whether changes in performance are due to the increased difficulty of the test set, or the changes in label imbalance.A partition of the dataset that fulfils these constraints will be referred to as target in this work.
To reach the target test set, two algorithms, SUBSET-SUM-SPLIT and CLOSEST-SPLIT, are designed to decide how to split the clusters.Both algorithms lead to an under-representation of parts of the latent space in the model's training set, but whilst SUBSET-SUM-SPLIT might under-represent smaller, potentially distant pockets of the latent space, CLOSEST-SPLIT under-represents a single connected region.The algorithms are explained in detail below.
Method 1: SUBSET-SUM-SPLIT The constraints on the class and test ratios explained above, and the additional constraint of keeping whole clusters together can be described by the Subset Sum Problem (Kellerer et al., 2004).In this setting, the Subset Sum Problem can be modified to a multidimensional Subset Sum Problem: The multidimensional target consists of the number of desired test examples for each class in the dataset.The task is then to select a subset of the clusters, such that the number of examples for each class sums up to the desired target.To improve the chances of reaching the desired target, the Subset Sum Problem is solved for k = 3 to k = 50 clusters and the solution closest to the desired target using the smallest k is taken as the test set.If the closest solution does not match the exact target sum, examples from another randomly selected cluster are used to complete the test set.Note that the clusters in the test set do not necessarily lie close to each other in the latent space, as this is not a constraint for this algorithm.
Method 2: CLOSEST-SPLIT In contrast to the SUBSET-SUM-SPLIT, the CLOSEST-SPLIT aims to put as much distance as possible between the train and test clusters.This leads to an even bigger underrepresentation of parts of the latent space in the training set.Once the clusters have been computed, their centroids are calculated.The cluster that lies farthest away from all the other clusters is identified and added to the test set.If the size of the farthest cluster exceeds the target test set size, the next farthest cluster is taken instead.Cosine similarity between cluster centroids is used as the distance measure.Then nearest neighbour clustering with the cluster centroids is performed, as long as the size of the test set does not exceed the target size.When this nearest-neighbour clustering is finished, individual examples that are closest to one of the test set centroids are added to the test set until the target size is reached.As for the SUBSET-SUM-SPLIT, the algorithm is performed for k = 3 to k = 50 clusters.k is selected such that the number of individual examples added is minimised.
Model Evaluation Having obtained data splits based on four language models and hidden dimensions with different sizes, the first way of evaluating models is by finetuning the language models on their respective SUBSET-SUM-SPLIT and CLOSEST-SPLIT.The hyperparameters used for finetuning are listed in Table 4, Appendix B, and we estimate d U and d B by varying their values for the Reddit dataset.We compare the results obtained with the proposed data splits to a baseline split, which takes the same examples but splits them randomly while maintaining class proportions.Random splits are generated using three different seeds, and the proposed data splits are obtained with three different clustering seeds.For each data split involved, the models are trained with three seeds that determine the classifier's initialisation and the presentation order of the data.The results are averaged accordingly.
The evaluation metrics are accuracy and F1scores.For the Reddit dataset, the F1-score is the score of the hate class, whereas for HateXplain, the F1-score is macro-averaged over the three classes.
To better understand the robustness of the results, we perform an additional set of experiments on the most challenging data splits observed, to answer the following questions: 1.Is split difficulty driven by the input or by taskspecific latent features?For the Reddit data, we split the dataset based on task-agnostic hidden representations obtained from pretrained models to analyse whether task-specific representations (i.e.representations finetuned on the task) are needed to create challenging data splits.2. Do models trained on new splits perform on par with conventional models on independent data?Using HateXplain, we test the finetuned models on the independent test data that was set aside earlier to ensure that the newly obtained train data is still informative enough for test data sampled according to the original distribution.3. Is the difficulty of the data splits modelindependent?We also examine whether a split obtained by the hidden representations of a specific model is also challenging for other models using HateXplain data.

Results
We now turn to evaluating models' performance on our newly proposed splits.

Performance on Challenging Splits
We compare the performance of models trained on a random split to models trained on the CLOSEST-SPLIT and SUBSET-SUM-SPLIT.The random split performances are presented in Table 1.For the binary Reddit dataset, performance on random splits is high for all four models with F1-scores for the hate class of around 82%.The performance on the three-way HateXplain dataset is comparably lower, with macro F1-scores of around 65%.For both datasets, these results are on par with (or surpass) baselines from prior work, upon which we elaborate in Appendix D.1.In addition to varying the dimensionalities, we consider using the models' pretrained representations (without further finetuning) to examine whether the latent features must be task-specific to challenge our models.Task-specific representations are, indeed, vital, as is shown in Fig. 8, Appendix D.2.

New Data Splits Reveal Catastrophic Failure
Both SUBSET-SUM-SPLIT and CLOSEST-SPLIT lead to an under-representation of parts of the latent space in the model's training set and we hypothesised that this leads to a challenging data split.Indeed, the empirical results show significant performance drops when training models on these splits in comparison to random splits.
Fig. 3a shows the performance drops for the Reddit dataset.For the SUBSET-SUM-SPLIT, F1scores for the hate class drop significantly for all four models, but with a high variation between different cluster seeds.For the CLOSEST-SPLIT, test set performance drops even further and more consistently without much variation between cluster seeds: F1-scores for the hate class are mostly between 0 and 25%.4 Fig. 3b displays performances for HateXplain, which similarly shows a drop in performance for SUBSET-SUM-SPLIT and CLOSEST-SPLIT.CLOSEST-SPLIT leads to F1-scores that are on par with or below random guessing, resulting from drops of around 36%.
Overall, the CLOSEST-SPLIT is more challenging than the SUBSET-SUM-SPLIT.Moreover, the bottleneck-based splits generally lead to the most stable results, i.e., the variance between different cluster seeds is the lowest.In some cases performance drops below the random guessing baseline; this happens when a model fails to predict some class completely, defaulting instead to one of the other classes.In summary, the new splits lead to drastic performance drops for both datasets and across all four models.

Independent Test Set Performance
We now take the most challenging split observed (CLOSEST-SPLIT with d B = 50) and further analyse the behaviour of models trained on this split for the HateXplain dataset, which is the most widely used dataset as well as the most challenging one.
From the results in Section 5.1 it is clear that CLOSEST-SPLIT reveals weaknesses in these models, since the models struggle to generalise to the split's test data.The question remains whether the test set obtained by the new splitting methods is harder or whether the new splitting method leads to very simple or perhaps even incomplete training sets, thereby preventing the models from learning the task.To this end, we evaluate the models trained on the training data obtained from  a CLOSEST-SPLIT on the 10% independent test data that was set aside earlier (Section 4.1).The results show that models achieve similar performance on the independent test data as the models trained and tested on random data, strengthening the hypothesis that CLOSEST-SPLIT training data is informative enough to learn the task.Results for these experiments are reported in Fig. 4. 5

Cross-Model Generalisation
The previous results have shown that CLOSEST-SPLIT leads to challenging test sets.To show the robustness of these splits, we also examine whether these test sets are generally difficult or only for the model used to develop the split-i.e.we examine cross-model generalisation.The results of the cross-model evaluations can be seen in Fig. 5.They show that data splits developed using one model are indeed also challenging for other models, although the personalised splits are slightly more challenging.These results do not only strengthen the robustness of the challenging data split, but have also practical implications: The data-splitting pipeline only needs to be carried out with one model and multiple models can be assessed and compared with the same split. 5The validation accuracy for the models trained on CLOSEST-SPLIT is for most splits around 5 points higher than the accuracy on the validation set of the random data split-i.e. the models perform normally during training as suggested by the validation data.

Analysis
The performance of models deteriorates heavily when using the proposed splits.This section analyses the generated splits; first examining the surfacelevel properties of the resulting train and test sets, and then taking a closer look at two specific splits by visualising the datapoints in the train and test sets.Additionally, an analysis of the topics in the train and test sets can be found in Appendix E.2.

Correlation Analysis: Relating Splits' Features to Performance Drop
For the most challenging split variant, CLOSEST-SPLIT, we investigate the correlation of performance drops compared to the random splits (including three random splits with 0 drop) and surfacelevel properties of the data split.The properties' implementation is explained in detail in Appendix E.1.We firstly consider task-agnostic features: 1) the unigram overlap between the train and test set, 2) the input length in the test set and 3) the number of rare words in the test set.Secondly, task-specific properties are computed: 1) The number of under-represented hate keywords from the lists used by the dataset's creators (see Section 3), 2) the number of under-represented target communities retrieved from the HateXplain annotations, and 3) a quantification of the distributional shift of data sources (Twitter and Gab are present in HateXplain) in the train and test set using the Kullback-Leibler Divergence of token distributions (Kullback and Leibler, 1951).
Table 2 presents the results of this analysis.For the Reddit Dataset, the only significant correlation (bold) is the number of under-represented key-   word categories in the training data.Task-agnostic features do not correlate with the decreased performance of models on the CLOSEST-SPLIT for the Reddit data.In contrast, for the HateXplain dataset, task-agnostic features do play a role: The biggest (negative) correlation can be observed for the unigram overlap (bold): The higher the unigram overlap between train and test set, the closer the performance is to the random split F1-score.
Another smaller correlation exists concerning the number of rare words in the test set: The more rare words, the more challenging the split.Similar to the Reddit dataset, a significant, albeit weak, correlation exists between the decreased performance and the number of keyword categories that are under-represented in training data.
Taken together, these results suggest that the properties associated with performance drops differ from dataset to dataset.This implies that CLOSEST-SPLIT cannot easily be replicated based on taskspecific or task-agnostic features.Using latent representations instead helps uncover weaknesses in models that are otherwise not easily identified.

Visualisation of Hidden Representations
We now take a closer look at two specific data splits for the HateXplain dataset by visualising their hidden representations.For this analysis, we select the CLOSEST-SPLITS obtained with representations with d B = 50 for BERT and RoBERTa, which are more commonly used than HateBERT or BERTmedium.We make these splits available via the GenBench Collaborative Benchmarking Task.6b).This suggests that the model overfits its decision boundaries to train set-specific features and, therefore, fails to predict the correct classes in the test set.Developing models using CLOSEST-SPLIT in addition to random splits might thus lead to models that are more robust to such overfitting.

Conclusion
Hate speech detection systems are prone to overfitting to specific targets of hate speech and specific keywords in the input, complicating the detection of more implicit hatred and harming the generalisability to unseen demographics.Yet, in addition to those known and interpretable vulnerabilities, systems may have less obvious weaknesses.The data splitting method we developed aims to highlight those.Our splitting method is based on the clustering of internal representations of finetuned models, thus making the splits task-and dataset-specific.We proposed two variants (SUBSET-SUM-SPLIT and CLOSEST-SPLIT) that differ in how they assign clusters to the train and test set.
The latter variant, in particular, led to consistent catastrophic drops in test set performance, when compared to a random split.Moreover, while each split was developed using the hidden representations from a specific model, we identified that this result generalises when developing the split using one model, and evaluating it using another.The analyses of the resulting data splits showed that the properties of the train and test sets differ from dataset to dataset.Since no property clearly correlates with decreased model performance for both datasets, CLOSEST-SPLIT cannot be easily replicated based on data splits' surface-level properties, and using latent representations is crucial to reveal the weaknesses we observed in the models.
We encourage future work to consider evaluations using the CLOSEST-SPLITS we release for HateXplain, in order to develop more robust systems, but also emphasise that even though our results were specific to hate speech detection, the methodology can be more widely applied.To challenge models beyond i.i.d.evaluation, we do not need costly data annotations.Instead, we can start by relying on systems' latent features to simulate train-test distribution shifts.

Limitations
We identify three main limitations of our work: 1.The scope of our work: the splitting methodology we developed can be applied to a wide range of tasks, but we only experimented with hate speech detection.Future work is required to confirm the method's wider applicability.Moreover, even though we aim to use the challenging split to improve generalisation, we have not yet made efforts in this direction.

Generality of conclusions:
We experimented with a limited set of model architectures, all of which resemble one another in terms of their structure and the (pre-)training data used.Different models or training techniques could lead to less challenging splits, or splits with significantly different properties.At the same time, we did demonstrate that the split's difficulty is not model-specific (see Section 5.3), and observed that under variation of random seeds CLOSEST-SPLIT consistently leads to performance drops across four models and two datasets.
3. Naturalness of the experimental setup: we created an artificially partitioned data split and have no guarantee that the generalisation challenges that language models encounter when deployed in real-world scenarios resemble our splits.However, given that our approach simulated a worst-case scenario, demonstrated by catastrophic failure in performance, we are hopeful that models that are more robust to our train-test shift are also more robust to realworld variations in test data.

Ethics Statement
By its very nature, hate speech detection involves working closely with hurtful and offensive content.This can be difficult for researchers.However, considering the severe consequences when hate speech models fail on unseen data and people are confronted with harmful content, it is all the more important to improve the generalisation ability of models and protect others.While our work intends to contribute to generalisation evaluation in a positive way, we do not recommend using our data splits as representative of generalisation behaviour 'in the wild', but recommend them for academic research instead.While standard and random splits often overestimate realworld performance, our splits are likely to underestimate it, and can in this way reveal real weaknesses.Our splits are designed to improve academic research on the robustness of language models and contribute to improving the generalisation ability for NLP tasks.
Prior to conducting work with potentially harmful hate speech data, this project obtained approval from the Research Ethics committee at the authors' local institution.

Figure 2 :
Figure 2: Overview of the proposed splitting method.

Figure 3 :
Figure 3: Performance of models trained on the SUBSET-SUM-SPLIT and CLOSEST-SPLIT .The errorbars show the standard error between cluster seeds.Horizontal lines indicate performance for models trained and tested on a random split.

Figure 4 :
Figure 4: Performance of models trained on training data determined by the CLOSEST-SPLIT and evaluated on the test data of the CLOSEST-SPLIT and on independent test data (HateXplain dataset).Horizontal lines indicate performance for models trained and tested on a random split.Errorbars show the standard error between cluster seeds.

Figure 5 :
Figure 5: F1-scores for HateXplain on a CLOSEST-SPLIT (d B = 50).Comparison of models trained on the data split obtained with their respective hidden representations (diagonal) and on data splits obtained from representations of other models.

Figure 6 :
Figure 6: Hidden representations for tertiary classification using the CLOSEST-SPLIT for the HateXplain dataset.
The CLOSEST-SPLIT assigns clusters of hidden representations that are spatially close to the test set.While the clustering is conducted on highdimensional representations, a 2-dimensional projection by UMAP (McInnes et al., 2020) can give an intuition about why these data splits are challenging.Fig. 6a shows RoBERTa's representations for the HateXplain dataset.A decision boundary can be observed, with mostly offensive examples on the left, noHate examples in the middle and hate examples on the right.Based on this illustration, the CLOSEST-SPLIT picks a pocket of (mixed) examples between the noHate (dark blue) and hate (dark green) regions to be the test set.This is mirrored in the F1-scores of the different classes.The hate test examples lie closest to the corresponding region, and the F1-score is the highest at 47.0.Similarly, for the noHate class, the F1-score is relatively high at 38.28.The offensive class, with test examples farther away, only has an F1-score of 11.88.The same phenomenon can be observed for a BERTbased CLOSEST-SPLIT (Fig. 3

Table 2 :
Pearson correlation between data split properties and models' F1-score drops in comparison to random splits.Correlations with a p-value < 0.05 are marked with *.Some analysis methods are datasetspecific and cannot be computed for both datasets.