Towards Weakly-Supervised Hate Speech Classification Across Datasets

As pointed out by several scholars, current research on hate speech (HS) recognition is characterized by unsystematic data creation strategies and diverging annotation schemata. Subsequently, supervised-learning models tend to generalize poorly to datasets they were not trained on, and the performance of the models trained on datasets labeled using different HS taxonomies cannot be compared. To ease this problem, we propose applying extremely weak supervision that only relies on the class name rather than on class samples from the annotated data. We demonstrate the effectiveness of a state-of-the-art weakly-supervised text classification model in various in-dataset and cross-dataset settings. Furthermore, we conduct an in-depth quantitative and qualitative analysis of the source of poor generalizability of HS classification models.


Introduction
Due to a growing concern about its impact on society, hate speech (HS) recognition recently received much attention from the NLP research community (Bilewicz and Soral, 2020).A large number of proposals on how to address HS as a supervised classification task have been put forward; see, among others, (Waseem and Hovy, 2016;Waseem, 2016;Poletto et al., 2021) and several shared tasks have been organized (Basile et al., 2019;Caselli et al., 2020).
However, while Transformers models such as BERT (Devlin et al., 2019) achieved impressive performance on various benchmark datasets (Swamy et al., 2019), recent work demonstrated that stateof-the-art HS classification models generalize poorly to datasets other than the ones they are trained on (Fortuna et al., 2020(Fortuna et al., , 2021;;Yin and Zubiaga, 2021), even when the datasets come from the same data source, e.g., Twitter.This casts a doubt on what we have achieved in the HS classification task.Fortuna et al. (2022) identify three main challenges related to HS classification: 1. the definitorial challenge: while the interpretation of what is HS highly depends on the cultural and social norms of its creator (Talat et al., 2022), HS research favours a universal definition; 2. the annotation challenge: due to the subjective nature of HS, the annotation also often depends on the context, the social bias of the annotator, and their familiarity with the topic (Wiegand et al., 2019), such that the annotators with different backgrounds tend to provide deviating annotations (Waseem, 2016;Olteanu et al., 2018), especially when not only the presence of HS is to be annotated, but also its category and the group it targets (Basile et al., 2019); 3. the learning and evaluation challenge: the common evaluation practice of the HS classification models assumes that the distributions of the training data and the data to which the model is applied are identical, which is not the case in reality; real-world HS data is relatively rare, while the strategies applied for the creation of HS datasets favor explicit HS expressions (Sap et al., 2020;Yin and Zubiaga, 2021), using search with explicit target keywords (Waseem and Hovy, 2016;Basile et al., 2019).
In order to address these challenges, we propose the use of extremely weak supervision, which uses category names as the only supervision signal (Meng et al., 2020;Wang et al., 2021): Extremely weak supervision does not presuppose any definition of HS, which would guide the annotation, such that when the interpretation of what is to be considered as HS is modified, we can retrain the model on the same dataset, without the need of re-annotation.Furthermore, when the data distribu-tion changes, the model can learn from unlabeled data and adapt to a new domain.
Our contributions can be summarized as follows: • We apply extremely weak supervision to HS classification and achieve promising performance compared to fully-supervised and weakly-supervised baselines.• We perform cross-dataset classification under different settings and yield insights on the transferability of HS datasets and models.• We conduct an in-depth analysis and highlight the potentials and limitations of weak supervision for HS classification.

Related Work
Since our goal is to advance the research on HS classification, we focus, in what follows, on the review of related work in this area and refrain from the discussion of the application of weakly supervised supervision models to other problems.
Standardizing different hate speech (HS) taxonomies across datasets is a first step in performing cross-dataset analysis and experiments.To this end, Fortuna et al. (2020) created a category mapping among six publicly available HS datasets.Furthermore, they measured the data similarity of categories in an intra-and inter-dataset manner and reported the performance of a public HS classification API on different datasets and categories.
Other previous work in cross-dataset HS classification followed similar experimental settings by training a supervised classifier on the training set of each dataset and reporting the performance on the corresponding test set and test sets from other datasets.For instance, Karan and Šnajder (2018) trained linear SVM models on 9 different HS datasets.They showed that models performed considerably worse on out-of-domain datasets.They further performed domain adaptation using the FEDA framework (Daumé III, 2007) and demonstrated that having at least some indomain data is crucial for achieving good performance.Similarly, Swamy et al. ( 2019) compared Linear SVM, LSTM, and BERT models trained on different datasets.They reported that some pairs of datasets perform well on each other, likely due to a high degree of overlap.They also claimed that a more balanced class ratio is essential for the datasets' generalizability.Fortuna et al. (2021) conducted a large-scale cross-dataset experiment by training a total of 1,698 classifiers using different algorithms, datasets, and other experimental setups.They demonstrated that the generalizability does not only depend on the dataset but also on the model.Transformer-based models have a better potential to generalize to other datasets, likely thanks to the wealth of data they have observed during pretraining.Furthermore, they built a random forest classifier to predict the generalizability based on human-engineered dataset features.The experiment revealed that to achieve cross-dataset generalization, the model must first perform well in an intra-dataset scenario.In addition, inconsistency in class definition hampers generalizability.Wiegand et al. (2019) and Arango et al. (2019) studied the impact of data bias on the generalizability of HS models, with the outcome that popular benchmark datasets possess several sources of biases, such as bias towards explicit HS expressions, topic bias, and author bias.The classification results dropped significantly when the bias is reduced.To this end, they proposed using cross-dataset classification as a way to evaluate models' performance in a more realistic setting.Gao et al. (2017) argued that the low frequency of online HS impedes obtaining a wide-coverage HS detection dataset.To this end, they proposed a two-path bootstrapping approach involving an explicit slur term learner and an LSTM (Hochreiter and Schmidhuber, 1997) classifier.The slur term learner is initialized with a list of hand-engineered seed slur terms and applies to an unlabeled dataset to automatically label hateful posts, which are used to train the classifier.The slur term learner and the classifier are trained iteratively in a co-training manner (Blum and Mitchell, 1998).
A distinct approach was proposed by Talat et al. (2018).This approach utilized multi-task learning (MTL) to enhance domain robustness.They trained a classifier on three distinct sets of annotations: Waseem and Hovy (2016), Waseem (2016), and Davidson et al. (2017).While MTL helps to prevent overfitting and may provide auxiliary fine-grained predictions, it requires annotating a dataset using different taxonomies, granularities, or aspects.
Our approach is most similar to Jin et al. (2022)'s, which also applied weakly-supervised learning on a target-domain dataset.However, their approach requires mining a list of 30 high-quality keywords for each category from a large labeled source-domain dataset.Moreover, they assume that the source and target datasets are labeled using the same HS taxonomy.

Weakly-Supervised HS Classification
In this section, we briefly introduce the basics of weakly supervised text classification and then discuss the cross-dataset classification we aim for.

Preliminaries: Weakly Supervised Text Classification
Weakly-supervised text classification eliminates the need for a large labeled dataset (Meng et al., 2018;Mekala and Shang, 2020).Instead, it trains classifiers using a handful of labeled seed words and unlabeled documents.While the human annotation effort is significantly reduced, weakly-supervised classification methods are sensitive to the choice of seed words, and the process to nominate highquality seed words is non-trivial (Jin et al., 2021).
More recently, Meng et al. (2020) and Wang et al. (2021) explored extremely weak supervision, where the model is given only the category name instead of manually curated seed words.Extremely weak supervision is well suited for hate speech detection because we may not know all the aspects of hate speech for a particular category or target group, or what a user may interpret as a HS statement that falls into a specific category.On top of that, extremely weak supervision often performs semantic expansion on the unlabeled dataset and automatically augments the category representation with new aspects (in the form of seed words).
We choose X-Class (Wang et al., 2021) as the primary weakly-supervised classification method, which matches or outperforms previous state-ofthe-art weakly-supervised methods on 7 benchmark datasets.X-Class first estimates category representations by incorporating words similar to each category.It represents each word by its averaged contextualized word embedding across the entire dataset.It then iteratively adds the most similar word to each category whose embedding has the highest cosine similarity to the category representation.The category representation is updated as a weighted average of the expanded keywords.Expressly, they assume a Zipf's law distribution (Powers, 1998) and weight the j-th keyword by 1/j.
where κ ℓ,j is the j-th keyword of category ℓ and s κ ℓ,j is its average contextualized embedding.X-Class also performs a consistency check and stops adding new words if a category's nearest words have changed.
Then, X-Class derives document i's categoryoriented document representation d i by weighting each word in the document based on its similarity to the category representations.Afterward, it clusters the documents using a Gaussian Mixture Model (GMM) (Duda and Hart, 1973) by initializing the category representations as cluster centroids.Finally, the most confident pseudo-labeled documents from each cluster are used to train a text classifier.
In our initial experiments, we observed that while GMM generally improves the pseudolabeling, the accuracy for some low-frequency categories tends to drop sharply.This is likely because GMM works as a global density estimator.Therefore, data of the more frequent categories may "attract" more weights and cause the category representation for low-frequency categories to diverge too much from its initial category representation.To address this problem, we introduce an additional representation-based prediction, which assigns document i to the category representation which has the highest cosine similarity: We denote GMM's category assignment for document i as 'ℓ gmm i '.Instead of pseudo-labeling most confident documents based on GMM only, we take the subset of confident documents to which GMM and representation-based prediction assign the same label (ℓ gmm i = ℓ rep i ).This ensures that the document is sufficiently close to the original category representation.We denote this modified version as 'X-Class Agree '.

Cross-Dataset Classification
In this work, we study cross-dataset classification, where we do not have any document labels in the target dataset.A dataset is characterized by its documents (and their underlying topics and word distributions) and taxonomy (list of categories). 1iven a single HS dataset with its corresponding categories, we can straightforwardly apply X-Class using the category names and an unlabeled dataset.On the other hand, both the data distribution and taxonomy may differ when we experiment on different datasets.There are three different cases for the relation between the taxonomies of the source and target datasets.
• 1-to-1: The target taxonomy is identical to the source taxonomy or a subset of it.• N-to-1: The target taxonomy differs from the source taxonomy, but each target category can be mapped to one or more source categories.• N-to-N: The target taxonomy differs from the source taxonomy, and some target categories cannot be mapped to any source category.
Supervised learning can be applied in the first two cases: We can create a category mapping from the target categories to the source categories, then use this mapping to either post-process the model predictions (converting predicted source categories to target categories) or relabel the dataset using the target taxonomy and retrain the model.However, in the last case, we cannot directly apply supervised learning without further data collection and annotation because we lack labeled data for at least some categories.In contrast, weakly-supervised methods do not require labeled documents and can readily utilize unlabeled documents in the target dataset to capture the underlying distribution.Furthermore, even when applied to a completely unseen dataset, it can also "relabel" the source dataset using the target taxonomy and bootstrap a classifier.

Datasets
We conduct experiments on two popular HS datasets that differ with respect to the data source and taxonomy of HS categories: the Waseem dataset and the SBIC dataset.The Waseem dataset (Waseem and Hovy, 2016) 2 contains 5,355 tweets with sexist and racist content.The dataset was annotated by the authors (inter-annotator agreement κ = 0.84) and reviewed by a domain expert (a gender studies student who is a non-activist feminist).The SBIC dataset (Sap et al., 2020) 3 contains 44,671 posts collected from different domains: Reddit, Twitter, and hate sites.It was annotated by crowdsource workers on Amazon Mechanical Turk.A small portion of the data is originally from the Waseem dataset (1,816 posts).We exclude these posts to avoid overlap between the two datasets.SBIC dataset does not set a predefined taxonomy for HS categories.Instead, annotators can indicate the target group with free-text answers.We select the most frequent six target groups that can be mapped to the categories in the Waseem dataset.While our proposed weakly-supervised learning method does not depend on category mapping, we select the SBIC categories that can be mapped to compare with supervised learning baselines.Table 1 shows this category mapping.

Sexist
Women; LGBT Racist Black; Jewish; Muslim; Asian We use the original train/dev/test split (75%/12.5%/12.5%) in the SBIC dataset and randomly split the Waseem dataset to 90%/10% into training and test sets.We apply standard preprocessing following Barbieri et al. (2020), including user mention anonymization and website links and emoji removal.

Compared Methods
We compare X-Class with two representative supervised learning baselines which are trained using the full labeled training dataset: • Support Vector Machines (SVM) (Cortes and Vapnik, 1995): We use scikit-learn's4 linear SGD classifier with default hyperparameters and tf-idf weighting.• BERT (Devlin et al., 2019): We fine-tune the bert-base-uncased checkpoint5 using the exact hyper-parameters to train the final classifier in X-Class (detailed in Section 4.3).
We also compare the performance of our model with the following baselines that do not require any document labeling:6

Experiment Settings
We use the official implementation of X-Class.9 The bert-base-uncased checkpoint is used to calculate the document representation and fine-tune the final classifier; the maximum number of keywords for each category is set to 100; and the 50% most confident pseudo-labeled documents from each category are used to train the final classifier.
To facilitate a fair comparison with supervised learning methods, we reimplemented the final classifier fine-tuning step using the HuggingFace Transformers trainer10 and performed a minimum manual hyper-parameter tuning (learning_rate=2e-5; num_epochs=6; weight_decay=0.05)on the SBIC dev set and applied them on both datasets.We set the max_length and batch_size to 64.
We merged the following original target groups in the SBIC corpus into "LGBT folks": "gay men", "lesbian women, gay men", "lesbian women", "trans women, trans men", "trans women".Table 3 presents the category names used by the models.We use the original category name except for "LGBT" because it does not occur in the dataset.Instead, we use "gay", the most frequently targeted subgroup in the dataset.As shown in Appendix A, X-Class expands to keywords representing other subgroups in the LGBT community.

Results of the Experiments
We report the accuracy and macro P/R/F 1 scores to quantify each method's performance.
In-Dataset Classification.We first validate the efficacy of the methods using the standard indataset setting, providing the corresponding training and test datasets.Table 4 displays the result.
As expected, BERT outperformed SVM among the supervised-learning baselines on both datasets.Interestingly, keyword voting using only the category name achieved high precision for the SBIC dataset.However, its recall is much lower than that of X-Class due to variations of expressions within the same category.Using X-Class keywords improved keyword voting's recall by 3.5% and 5.4% on the two datasets.However, the precision dropped significantly on the SBIC dataset, likely due to the noisier keywords.
WESTCLASS performs superior to keyword vot- ing baselines on the Waseem dataset, primarily due to its high recall of the "Racist" category.This demonstrates the advantage of semantic representation in neural models.However, its performance pales on the SBIC dataset, revealing its weakness in handling more complex cases that involve class imbalance and overlapping, which has been discussed in Wang et al. (2021) and Jin et al. (2022).LOT-Class demonstrates a similar trend, but performs worse on both datasets. 11We analyze the pseudolabeling accuracy of weakly-supervised baselines and X-Class in Appendix C. Comparing X-Class and X-Class Agree , we can see that our modification consistently improved the performance.Cross-Dataset Classification.We conduct crossdataset classification using the strongest supervised and weakly-supervised models and show the result in Table 5.Note that for the "Waseem → SBIC" setting, we cannot create a category mapping since the target dataset has more fine-grained categories.Therefore, supervised methods and X-Class using category mapping to post-process the predictions are not applicable.
When we train BERT and X-Class using only source-dataset documents, they both perform worse on the target dataset than the in-dataset results in Table 4.The performance drop is smaller for "SBIC → Waseem", likely because SBIC dataset contains representative posts for the Waseem categories.Surprisingly, retraining the models using the target taxonomy does not outperform post-processing using category mapping.However, when a category mapping is unavailable (as in the "Waseem → SBIC" case), retraining a weakly-supervised classifier using the target taxonomy is the only option for cross-dataset classification without manually annotating more data.

SBIC
An advantage of weakly-supervised methods is that they can utilize unlabeled documents from the target dataset when they are available.Although X-Class Agree still underperforms BERT when both are trained using the source dataset in the "SBIC → Waseem" experiment, it surpasses BERT by 3% in both accuracy and macro F 1 score when using unlabeled target-dataset documents 12 .
Again, X-Class Agree outperforms X-Class in all cases.Subsequently, we use X-Class to refer to X-Class Agree for brevity.

Analysis: What Makes Cross-Dataset
Classification Challenging?
As shown in Table 5, X-Class's performance dropped significantly in the "Waseem → SBIC" cross-dataset setting compared to using the SBIC training set.In this section, we try to uncover the causes of the performance drop.
We first plot the per-category F 1 score in Figure 1.We can see that the cross-dataset model achieved comparable performance as the in-dataset model for the four categories {Jewish, Muslim, Women, Black}.However, it failed in the two categories {Asian, LGBT}.Relevant unlabeled documents.Although the Waseem dataset is labeled using a more coarsegrained taxonomy, it may contain documents rele- 12 We can train weakly-supervised models using unlabeled target dataset, which is equivalent to the in-dataset setting (the X-Class Agree row in Table 4).
vant to some (but not all) fine-grained SBIC categories.Weak supervision usually pseudo-labels the unlabeled dataset to train a final classifier.Therefore, it will likely fail when documents related to a particular category are absent in the unlabeled dataset.We count the frequency of documents containing each category name in both datasets and present the results in Table 6.We can observe that the "Asian" category (from the SBIC dataset) is severely under-represented in the Waseem dataset.The word "Asian" occurs only 4 times, all in the context of "Asian women/girls".

Class
Waseem and Hovy (2016) conducted a lexical analysis and showed that their "Sexist" category is highly skewed towards women, and their "Racist" category is highly skewed towards Muslims and Jews. 13Coincidentally, these categories also perform the best in the "Waseem → SBIC" setting.Category understanding.Jin et al. (2021) argued that weakly-supervised classification and keyword mining are intrinsically related.The failure to identify relevant keywords will harm the category representation and, thus, the classification accuracy.Appendix A presents the full list of keywords X-Class added to the category representations in both in-dataset and cross-dataset settings.
A general observation is that X-Class tends to include fewer keywords in its category representation in the cross-dataset setting.Recall that it stops adding keywords once the consistency check is violated.We hypothesize that the mismatch between the dataset and the taxonomy caused the mined keywords to be noisier and more likely to fail the consistency check.
The four categories that perform the best in both in-dataset and out-dataset settings also contain better-quality keywords.In contrast, the "Asian" category's keyword in the cross-dataset setting is entirely off-topic due to its rare occurrence and collocation with words like "women" or "girls".The " LGBT" category contains many vulgar keywords with sexual references, which caused it to confuse with the "Women" category.Class definition vs. dataset.Previous studies tried to explain why HS classification models generalize poorly across datasets, the most frequently cited reasons being the lack of a standardized definition of hate speech (Waseem and Hovy, 2016;Fortuna et al., 2020Fortuna et al., , 2021) ) (2020) are among the few studies that re-annotated a dataset, providing quantitative analysis or comparing the models' performance.However, such studies only apply to a single dataset.Moreover, the annotation is usually a one-shot effort, influenced by multiple factors related to the annotation task setup and knowledge of annotators.There is no way to attribute how much of the performance drop is due to incompatible class definitions and the data distribution separately.
In weakly-supervised models, we can interpret the category representation (and associated keywords) as the class definition.Therefore, the class definition for the same taxonomy may differ depending on the dataset used to derive the category representation.Furthermore, we can approximate annotating a dataset with a different class definition by altering the category representation.
We designed an ablation study to train X-Class models using different combinations of datasets and class definitions.In Table 7, we present the results of three configurations in this study: 14 1) 14 All experiments use the target taxonomy, and all docu-Using source-dataset documents and category representations derived from the source dataset ("X-Class Agree retrain" in Table 5); 2) Using sourcedataset documents and category representations derived from the target dataset; 3) Using targetdataset documents and category representations derived from the target dataset ("X-Class Agree " in Table 4).
X-Class's cross-dataset performance substantially improved when provided with the category representation derived from the target dataset. 15 Only one factor is altered (either the category representation or the unlabeled training dataset) between the rows in Table 7.Therefore, we can conclude that the performance difference between rows #1 and #2 is due to different class definitions, while the performance difference between rows #2 and #3 is due to different data distributions.

Conclusions and Future Work
We applied extremely weakly-supervised methods to HS classification.We analyzed the transferability of HS classification models through comprehensive in-dataset and cross-dataset experiments and confirmed that weakly-supervised classification has several advantages over the traditional supervised classification paradigm.First, we can apply the algorithm across various HS datasets and domains with taxonomies that cannot be standardized using category mapping.Second, weakly-supervised models can readily utilize unlabeled documents in the target domain and do not suffer from domain mismatch problems.Lastly, weak supervision allows us to "reannotate" a labeled dataset using a different class definition to facilitate cross-dataset comparison, which was previously possible only at the cost of expensive manual annotation.ments are unlabeled.
15 Its average recall in the "Waseem → SBIC" experiment decreased sharply mainly because the category representation for the "Asian" category is far from the document representation (the Waseem dataset does not contain documents related to "Asian").The model did not predict any document as "Asian".
The presented work is only the beginning of applying weak supervision to HS detection.We can utilize richer category representations than bag-ofkeywords.However, such representations should be derived in an unsupervised or weakly-supervised manner to avoid depending on manually labeled datasets.A promising approach in this direction is Shvets et al. (2021), which extracts HS targets and aspects relying on open-domain concept extraction.
Lastly, we can study how well the model can generalize to previously unknown categories, a more challenging task often known as zero-shot classification (Yin et al., 2019) or open-world classification (Shu et al., 2017).

Limitations
This study utilizes a monolingual pre-trained language model (PLM) in the English language (bert-base-uncased).Although the weaklysupervised classification methods are not limited to a particular language, we have not explored applying the method to another language.Social media language use may differ significantly from the data used to train the PLM.Moreover, the presence of code-switching (Dogruöz et al., 2021) may also degrade a monolingual PLM's performance.We explored a RoBERTa checkpoint continually trained with 60M English tweets (Barbieri et al., 2020). 16owever, it does not yield better performance than BERT.We have not investigated whether it is due to the training regime or the dataset.
Moreover, in this work, we focus on classifying hate speech (HS) categories/target groups instead of HS detection (detecting whether a post contains hate speech or not).To perform hate detection and classification, we can either combine our method with another HS detection model in a pipeline or use an adaptation of weakly-supervised text classification incorporating the "Others" category such as Li et al. (2018) or Li et al. (2021).
Due to limited space, we prioritized in-depth analysis instead of a comprehensive evaluation.Therefore, we selected only two datasets (and twoway cross-dataset classification).We are working in parallel on extending this work to a longer-form journal article to cover more datasets and experimental results.
Recent work on large language models (LLMs) demonstrated that when the parameters scale to a certain level, language models exhibit a drastically-increased performance in zero-shot classification (Zhao et al., 2023).
We reported the performance of a moderately-sized bert-large-uncased zero-shot model because of limited computational resources and lack of access to commercial APIs.Larger language models will likely perform much better than this baseline.
Lastly, understanding HS sometimes requires cultural understanding or background knowledge.It may be difficult to determine the presence and category of HS when we take the post out of its context.For example, many "Sexist" posts in Waseem dataset are tweets related to the Australian TV show My Kitchen Rules (MKR), and below is a tweet labeled as "Sexist": Everyone else, despite our commentary, has fought hard too.It's not just you, Kat.#mkr

Ethics Considerations
Although weak supervision requires only unlabeled documents, we demonstrated that the model might fail when the training dataset does not contain data related to a particular category or target group.It is especially concerning because the target groups are often minorities and under-represented.Therefore, we recommend against "throwing" a weaklysupervised algorithm on a dataset and hope the model will work.Instead, we should evaluate a model thoroughly before applying it to the real world, such as manually examining the model's predictions, behavioral testing the model using a checklist (Ribeiro et al., 2020) or conducting unsupervised error estimation (Jin et al., 2021).
Table 9 shows the list of keywords in X-Class's category representation in the in-dataset setting (using the unlabeled documents and list of categories from the same dataset).Table 10 shows the list of keywords in X-Class's category representation in the cross-dataset setting (using the unlabeled Waseem dataset documents to induce category representations of SBIC dataset categories and vice versa).

B Reproducibility
Table 11 presents the hyper-parameters and their corresponding values to facilitate reproducing our result.
We use the bert-large-uncased model in Hug-gingFace as the base pre-trained language model for the zero-shot PET baseline.PET combines a pattern (or prompt/instruction) with the input text and prompts the model to predict the mask token.Unlike open-ended prompting, PET uses a list of hand-crafted verbalizers (candidate tokens).It classifies documents by assigning the category whose associated verbalizer receives the highest predicted probability.PET-style classification is especially beneficial for smaller PLMs, which do not possess a strong capability of instruction following (Schick and Schütze, 2021b;Ouyang et al., 2022).
We hand-crafted patterns and verbalizers based on our understanding of the tasks (without fine-tuning).
For Waseem dataset, we use the pattern "<text> This hate speech is based on <mask>" (verbalizers: gender/race), and for SBIC dataset "<text> The target group of this hate speech is <mask>" (verbalizers: women/black/Jews/gay/Muslims/Asian).

C Pseudo-Labeling
Being able to accurately pseudo-label documents is crucial to the success of weak supervision.We report the accuracy of pseudo-labeling by various weakly-supervised methods in  1 to derive the gold labels.We omit the "Waseem → SBIC" setting because we do not have gold labels.
We can see that the accuracy of pseudo-labeled documents is consistent with the model's performance on the test dataset (Table 4).Moreover, LOTClass and X-Class use the same underlying pre-trained language model (bert-base-uncased) in their final classifier, while WESTCLASS uses a more traditional convolutional neural networks architecture (Kim, 2014).The data pseudo-labeled by X-Class is substantially more accurate than the two baselines in both datasets.Comparing Table 8 and Table 4, we can observe that the pseudo-labeling accuracy has a more significant impact on the final classifier's accuracy than the model architecture.
We provide randomly sampled pseudo-labeled documents by X-Class in Table 12 (in-dataset) and Table 13 (cross-dataset).In general, the SBIC dataset contains more diverse and nuanced data.On the other hand, the Waseem dataset sometimes contains trivial slurs like "... I'm not sexist ...".The samples in the cross-dataset setting revealed that X-Class tends to wrongly categorize original "Sexist" posts in the Waseem dataset (which mainly target women) as "LGBT" and "Asian".

Asian
My economics professor just called these asians retards ✓ Korean idols are perfectly handsome, cute, funny, stylish, hot, know how to dance, have a wonderful voices.

✗
The Stock Exchange I like the NYSE just like the Ethiopian population count.Going down faster than ever.✗ But look at the reality disconnect.Burak says he is for freedom and against all slavery while at the ...

Jewish
Max Blumenthal is bad mouthing you.Not enough room at the top for all the self genocidal Jews.Israel Palestine

✓
The job Mohammed set Muslims is not done while Israel exists.✓ The Jews of Europe should just come to the US.Then the Europeans can allow Islam to take them backwards.

✓
LGBT RT I'm not sexist but right now I hate girls !!!! ✗ RT This is not sexist but I want to punch both of the girls from broad city workaholics ✗ RT This is why girls don't play football.Someone's feelings get hurt and boom, it's out of hand.Go ahead and call me sexist, ✗ Muslim You didn't recognize the irony of me using your method because you are an ignorant Muslim.

✓
And you lie again.

Figure 1 :
Figure 1: Comparing cross-dataset and in-dataset F 1 score of X-Class on the SBIC dataset.

✓
Women Molly make b*tches crazy ✓ The emoji I use when I'm flirting wit b*tches gt ✓ Rule number two don't trust nobody especially a b*tch with a hookers body ✓ Black It's funny when u get a black female mad they start acting ghetto then a b*tch ✓ White bitches be lovin da colored men ✗ Those ratchet broads don't represent me but YOUR white a*s can't talk trash about them 'cause you on racist sh*t "Black"' ✓ Jewish What was the Jewish pirate's nickname?Hooknose.✓ Do you know why do Jewish kids wait behind the chimney?Because they're waiting for their parents .✓ Why do Jews hate a*sholes?The can't stand gas chambers.✓ LGBT Any man even considering plastic surgery is gay gay gay imagine ya boy say "I'd get more hoes then u after I get these new lips" ✓ do it i f*cking dare you f*ggot ✓ What do you call a gay drive by?A fruit roll up.✓ Muslim RT Fuck Islam you pigfucking trash troll.✓ Bomb your house, Then f*ck your b*tch.All hail Al Qaeda ✓ What did the Muslim say after eating a bacon sandwich?That's the way Allah Allah I like it.

Table 1 :
Category mapping between the Waseem and SBIC datasets.
Table 2 presents the distribution of the posts in the two datasets.

Table 2 :
Distribution of the documents per dataset, with the posts that contain no word after post-processing removed.The average number of words per post in the Waseem dataset is 17.1 and in the SBIC dataset 20.0.

•
Majority class: Always predict the most frequent category in the training dataset.• Keyword voting (category name): Assign the category whose category name occurs most frequently in the document.Fall back to the majority class prediction if there is a tie or none of the keywords appear.
(Meng et al., 2018)-Class keywords): Same as above, but use the expanded keywords in X-Class's category representation and their associated weights.Assign the category which receives the highest score.•Zero-shotPET(Schickand Schütze, 2021a):Prompting a pre-trained BERT model using hand-crafted patterns and verbalizers to classify documents.We provide details of this baseline in Appendix B. • WESTCLASS(Meng et al., 2018)7 : CNNbased neural text classifier.It first generates pseudo documents with a generative model seeded with user-provided keywords for pretraining, then conducts self-training to bootstrap from unlabeled documents.We use

Table 3 :
Seed words used for each category and their frequency in the training dataset.We manually curated the seed words in X-Class's category representation and select the top-3 ranked keywords to train WESTCLASS.
Table 4: In-Dataset performance of various models.We highlight the best performances of supervised and weakly-supervised methods in bold.

Table 5 :
Cross-dataset performance of BERT and X-Class.Both models are trained using source dataset documents and tested on the target dataset.We highlight the best performances of supervised and weakly-supervised methods in bold.

Table 6 :
Frequency of each category name appearing in the Waseem and SBIC training datasets.

Table 7 :
Cross-dataset performance of X-Class using different unlabeled datasets and category representations.

Table 8 :
Pseudo-labeled dataset accuracy calculated against the gold-standard labels.The default method is X-Class unless otherwise specified.For the "SBIC → Waseem" setting, we use the category mapping in Table females girls ladies ch*cks wives men feminist lady girl chick feminists feminine males male gender feminism whores blonde virgins bitches guys hookers prostitutes sl*ts mens wh*re sl*t b*tch p*ssy prostitute virgin couples d*cks breast moms c*nts girlfriend wife sisters dudes attractive sexy betas partners she her beautiful genders lovers normies mothers boys man chads adult couple them fathers mensrights normie assholes they body someone bodies looking v*ginas loser dyk*y sister ones femaloid self mate material raped hooker Black black white colored blacks whites n*gro african negroes negros racial race racist races minorities color africans n*groids minority n*groid racism mixed brown n*ggers skinned blackman slaves peoples ghetto discrimination n*gger people whitey africa red yellow dark savages individuals civil poor disabled blind gorillas savage human folk nonwhite

Table 9 :
Full list of keywords in X-Class's category representation mined from in-dataset setting.

Table 10 :
Full list of keywords in X-Class's category representation mined from cross-dataset setting.

Table 11 :
Full list of hyper-parameters.The first block is hyper-parameters related to X-Class.The second block is hyper-parameters related to the classifier fine-tuning.* denotes values set based on data analysis.‡ denotes values set by manual hyperparameter tuning.† denotes values chosen based on our experience but are not tuned.All the rest of the parameters default from the X-Class repo.Sexist AbuserDynamics Girls should count themselves lucky if they got a guy that still shouts when his mad, means he cares ✓ I have a stupid man crush on Lloyd and I can't help it.mkr2015 mkr sexylloydmkr ✓ RT This lady is a terrible official.Not because she's a lady though.I'm not sexist.But because she's a stupid b*tch.Cavs ✓ Racist RT CNN reports on the murders of homosexuals by the Islamic State, but the problem isn t just ISIS, it s Islam ✓ Yes, there is even more rape in Muslim countries but it is not reported.✓ For example, the Quran says leaving Islam is not accepted.The Hadiths say kill anyone who leaves.

Table 12 :
Randomly sampled pseudo-labeled examples for each category in the in-dataset setting.Racist RT Wow the stupid n*gger in LeBron really came out there ✓ My Moor friends,no not black friends,but Moor friends said N*gger came from Nigeria...You are so lost..Stop tagging me... RT Remember the "yellow badge" Nazis used?Israel is making Muslim women carry a yellow badge order to pray in Al Aqsa.h ✓ Women RT I'm no sexist but the last thing I wanna read about is women's, football or cricket on the sky sports news app! controve ✓ RT Then I guess Feminism is just a sideshow as much as WWE wrestling in general.. Irony is off the c ✓ Are you even a real person?I'm not sexist.But Men are superior to women ✓ Black Can't forget it...never heard about it... ✗ ...with a flat face.The nose a bay window.

Table 13 :
The majority of Muslims were forced into it.✓ RT Arab slave trade 140 to 200 million non Muslim slaves from all colors and nationalities still happening today!✓ Asian Someone really needs to get the sniffer dogs onto Kat offherlips MKR ✗ MKR anyone can cook from a can girls.✗ Kat you don't look suspicious at all! MKR ✗ Randomly sampled pseudo-labeled examples for each category in the cross-dataset setting.