Mitigating Biases in Hate Speech Detection from A Causal Perspective

,


Introduction
Over the past few years, hate speech on social media has grown significantly in different forms.This causes serious consequences and societal impact on victims of all demographics (Mathew et al., 2021), largely affects their mental health (Saha et al., 2019) and even triggers real-world hate crimes (Relia et al., 2019).As a result, automated detection of hate speech becomes especially important and there have been numerous research studies conducted to characterize and detect these hateful or toxic contents (Zhou et al., 2021;Lahnala et al., 2022;Kotarcic et al., 2022).
However, current hate speech detection approaches often fail to be robust and generalizable, even to slightly different datasets on the same task, which largely prevents them from being applied in real-world applications (Vidgen et al., 2019).This might be due to the fact that current models are suffering from the spurious correlations between training data and labels (e.g., hateful) (Ramponi and Tonelli, 2022), which might lead to the biased treatment of vulnerable and minority groups such as African American Vernacular English Speakers and may exacerbate racism (Harris et al., 2022a).One example is shown in Figure 1.The frequent co-occurrence of identity words and the hateful label in the training set can bias the detectors (e.g., fine-tuned BERT) to make false predictions during inference.As a result, it is of great need to comprehensively understand and identify the spurious correlations in hate speech detection and further mitigate the bias caused by spurious correlations.
A growing amount of recent work has examined the spurious correlation in hate speech detection (Zhou et al., 2021;Kennedy et al., 2020;Sap et al., 2022), and found that a large number of tokens that target minority groups are highly correlated with the hateful label (Bender et al., 2021).As a result, methods like masking and removal of such tokens with the help of human annotations (Ramponi and Tonelli, 2022) have been proposed to mitigate these spurious correlations, together with finetuned large language models to generate artifacts to augment the training (Wullach et al., 2021;Hartvigsen et al., 2022).Despite these successes, they usually focus on identifying token-level spurious patterns in one specific dataset manually (Bender et al., 2021) while neglecting spurious correlations beyond tokens (i.e., in sentence levels, shown in the right part of Figure 1(b)) and lacking comprehensive studies across different datasets and domains.Moreover, these mitigations are often costly and timeconsuming as human annotation and fine-tuning large language models to generate numerous data are required.Thus, a systematic study is needed to first automatically identify spurious correlations in a given hate speech dataset and then effectively and efficiently mitigate them.
To fill in this gap, we first conduct a comprehensive analysis to discover spurious correlations from both token-level and sentence-level on 9 hate speech datasets in an automatic way.Specifically, we use mutual information with domain knowledge to identify token-level spurious correlations such as identity words and leverage context-free grammar to investigate sentence-level highly correlated grammar patterns.We further propose a novel metric called Relative Spuriousness (RS) to verify the spuriousness of discovered spurious correlations based on its influences on model prediction.Our analysis shows that token-level spurious patterns are usually more general that exist in almost all datasets while sentence-level spurious patterns are more dataset-specific.We further study how spurious correlations cause model bias from a casual perspective (as shown in Figure 2).one affecting the distribution between PLM's pretraining data and vulnerable identities, and another influencing the distribution between vulnerable identities and their context in hate speech datasets.To mitigate these two biases, we propose Multi-Task Intervention (MTI) and Data-Specific Intervention (DSI) for bias mitigation.MTI tries to mitigate biases from the pre-training corpus through training auxiliary tasks, while DSI focuses on eliminating biases originating from limited data and contains a framework that automatically detects, validates, and mitigates biases through a counterfactual generator.Experiments conducted on 9 hate speech datasets and out-of-domain challenge sets demonstrate the effectiveness and robustness of our proposal.
To summarize, our contributions are: (1) We automatically identify spurious correlations and comprehensively analyze them in hate speech detection from both token-level and sentence-level across 9 datasets.(2) We introduce a novel metric, Relative Spuriousness, to evaluate the spuriousness of identified spurious correlations at the sentence level.
(3) We study the bias caused by spurious correlations from a causal perspective.(4) We propose two strategies, MTI and DSI, to mitigate the biases and show consistent improvements in 9 datasets.

Related Work
Debias Hate Speech Detection Recent works (Yin and Zubiaga 2021; Wiegand et al. 2019;Kennedy et al., 2020;Ma et al., 2020;Gehman et al., 2020;Sap et al., 2020;Dreier et al., 2022;Stanovsky et al., 2019;Sridhar and Yang, 2022;Thakur et al., 2023;Ziems et al., 2023) have been studying the generalizability and biases for hate speech detection (Talat et al., 2018;AlKhamissi et al., 2022;Röttger et al., 2022;Bianchi et al., 2022).For instance, prior work found that existing hate speech detection models are biased against African American Vernacular English Speakers (Harris et al., 2022b;Sap et al., 2019) and certain identity words are highly correlated with these hateful labels (Bender et al., 2021;ElSherief et al., 2021a).A group of data augmentation methods (Sen et al. 2021;Sen et al., 2022;Hartvigsen et al. 2022) are proposed to mitigate the biases.For instance, Wullach et al. propose a simple data augmentation method using a finetuned GPT-2 model to generate 100k hate and non-hate data.Ramponi and Tonelli find highly correlated tokens and manually categorize them into different groups, and further mask or remove these annotated spurious artifacts.However, human annotations are not practical in large-scale or multi-platform scenarios.Different from these prior works that mainly focus on data-specific biases and mitigation methods, our method automatically finds these shortcuts and proposes Multi-Task Interventions on model levels.

Spurious Correlation in NLP
Increasing attention has been focused on spurious correlations in NLP tasks (Tu et al., 2020;McCoy et al., 2019;Jia and Liang, 2017;Niu et al., 2020;Wang and Culotta, 2020;Vaidya et al., 2020;Clark et al., 2020;Du et al., 2021).A line of work tries to evaluate the models' robustness on pre-defined shortcuts by proposing challenge test datasets (Mc-Coy et al., 2019;Rosenman et al., 2020;Röttger et al., 2021) or finding salient words (Simonyan et al., 2013;Pezeshkpour et al., 2022;Shrikumar et al., 2017;Han et al., 2020).Another line of work also intends to verify the spuriousness of extracted features.Gardner et al., 2021 suggests that all correlations between labels and low-level features are spurious.Eisenstein, 2022 claims these correlations will naturally appear in the majority of classification tasks and that domain knowledge is required to identify any correlations that may be harmful.Joshi et al. define the spuriousness based on the sufficiency of the feature and counterfactual intervention.However, their definition is based on an unbiased classifier to model the probability of sufficiency, which could not be practical in real applications.Besides, they do not use this metric to find spurious artifacts.Instead, they use domain knowledge to pre-define the potential feature.Furthermore, previous analyses (Zhang et al., 2018) of biases' influences on model predictions are limited.To this end, we propose a novel metric named Relative Spuriousness (RS) which considers the biases' influence on the model's decision-making.

Methods
This section describes how we identify, understand, and mitigate the spurious correlations in Hate Speech Detection (HS).

Identifying Spurious Correlations
Current HS models often suffer from spurious correlations (Ramponi and Tonelli, 2022;Bose et al., 2022;Wang et al., 2022): they tend to utilize prediction patterns that hold for the majority examples but do not hold in general.This might cause the model to be biased in applications (Ramponi and Tonelli, 2022;Gardner et al., 2021).In this section, we identify potential spurious correlations from two levels across multiple datasets.

Token-level Spurious Correlations
One specific label (e.g., hateful) might be highly correlated with certain tokens (e.g., "women") in the datasets (Wang et al., 2022;Ramponi and Tonelli, 2022).As a result, models might learn to make biased predictions only based on those token-level correlations while neglecting the whole semantic meaning.In order to identify such token-level spurious correlations, we utilize tokenlevel PMI-based searching methods (Ramponi and Tonelli, 2022) to find biased words in HS.Specifically, following Ramponi and Tonelli, 2022, we compute the PMI using the equation described in Appendix A.1 between every word and the hateful label.After that, we select the highest correlated words for further investigation across 9 datasets.This led to a large set of identity-related words (over 80, shown in Figure 1 as an example) that are highly correlated with the hateful labels while the words themselves are neutral.We also observe that such identity words are often common across different datasets.Thus, we treat these identity words as general token-level spurious correlations1 .

Sentence-level Spurious Correlations
Beyond token-level identity words, certain sentence-level patterns might also be highly correlated with specific labels that might be spurious correlations.For example, in a platform with discussion on politics, patterns such as The wall should and The wall is are usually connected with discontent with the US border wall.A large number of such speeches are hateful.However, these grammar patterns are completely neutral.To automatically discover these sentence-level spurious correlations, following Friedman et al., 2022, we induce a grammar for HS training data and obtain the maximum likelihood trees in an unsupervised manner.Specifically, we use a probabilistic context-free grammar (PCFG), which contains the distinguished start symbol S, terminal symbols V (words), non-terminal symbols N and the rules of the form A → B ∈ R, where A ∈ N ∪ S and B ∈ N ∪ V. We use the same parameterization and training methods as Kim et al., 2019.We then compute the mutual information between grammar patterns (non-terminal root) and labels and find highly-correlated ones.We define the patterns in terms of grammar subtrees.After we apply the above method to 9 hate speech datasets, we find that different datasets have different highly correlated patterns (Section 4.5.1 as examples).As a result, such patterns can be considered potential data-specific spurious correlations.

Spuriousness Validation
After discovering these sentence-level spurious correlations candidates, we further validate their spuriousness to identify clean spurious correlations.We define Relative Spuriousness (RS) based on the necessity of a feature's influence on the model's prediction.
Our definition depends on two properties of spurious features: a feature's existence significantly impacts a specific label on a biased model, and it is not important for an unbiased model ideally.
Definition 1 (Relative Spuriousness (RS)) The RS of a feature x i for the label y is: Here Compared with the prior work (Joshi et al., 2022), we consider a feature's relative influence on the model's prediction locally and globally, filling in the gap of considering features other than x i , which is a reasonable and practical metric to find spurious features.We select the features that have RS higher than a certain threshold as data-specific biases.The following sections investigate the origin of the above biases and propose two intervention approaches to mitigate them.

Understanding Spurious Correlations
With the identified spurious correlations in HS, it is then of great need to study how they might be generated and how the models might suffer from them by making biased predictions.In this section, we visualize the relations between spurious correlations and the caused bias from a causal perspective (Feder et al., 2021;Hardt et al., 2016;Kilbertus et al., 2017;Vig et al., 2020).Specifically, we use Structural Causal Model (SCM) (Pearl et al., 2000) which is a conceptual model that describes the causal mechanisms of a system, to recognize the confounders which are variables that influence both the dependent variable and independent variable, causing spurious correlations.We use directed acyclic graphs (DAGs) G = {V, f } to describe the causal relationships between different variables.V refers to variable nodes and edges denote causal relation function f .We visualize the SCM for HS detection described in Figure 2a, where identities I and the corresponding context C constitute the input data X.The Confounders Two confounders G 1 and G 2 influence their correlated variable's independence of generation distribution, causing spurious correlations.The two confounders are as follows.
• G 1 : is the confounder which affects the distribution of both P and vulnerable identities I.For example, the unbalance appearance frequencies of different races of people in the pretraining data for the pretrained language models may constitute this confounder.
• G 2 : influences the distribution of vulnerable objectives I and their context C. The training data's source may compose this confounder.For instance, de Gibert et al. 2018 collect data from Stormfront, a white supremacist forum, where mentions of people of color (POC) usually occur in hateful contexts.

Mitigating the Bias in HS
Based on the above two confounders that cause spurious correlations, we then propose two intervention techniques to mitigate the bias in HS.

Multi-Task Intervention (MTI)
The motivation for MTI is twofold.We define different tasks as different hate speech datasets D i = {x j i , y j i } n i j=1 and train on all the datasets {D i } N i=1 (see Section 4.1 for details on datasets) simultaneously using the same PLM and different classification heads.

Data-Specific Intervention (DSI)
To mitigate the influence of G 2 and inspired by current progress in counterfactual learning (Zeng et al., 2020;Sen et al., 2021;Eisenstein, 2022;Garg et al., 2019;Davani et al., 2021), we propose a counterfactual generator for DSI.Recent works (Mishra et al., 2022;Ouyang et al., 2022) discover that the powerful Large Language Models (LLMs) (Brown et  methods along with two token-level debias methods and DSI on 9 HS datasets.All these methods are based on MTI.All results are averaged over three runs using three random seeds.We conduct the significant test following Berg-Kirkpatrick et al., 2012 and the average estimate of p-value is 0.003 (< 0.01) between DSI and SOTA baselines, demonstrating the significant differences.2020;Touvron et al., 2023;OpenAI, 2023b) with instruction prompt can have outstanding performances in many NLP tasks.Therefore, we use off-the-shell GPT-3 with instructions as the backbone of the counterfactual generator.The overall framework of DSI is illustrated in Figure 3.After we find spurious features using the method described in Section 3.1.2,we employ different instructional prompts for general biases (identity) and data-specific biases.The details on prompt design can be found in Appendix A.4.In practice, we find that the counterfactual data may not always remain the same label as described in the prompt.Thus we further conduct a majority voting with the models trained on different HS datasets.We use a majority vote to make sure the generated counterfactual data are non-hateful, which is a way to further control the quality of the generated data.We then train the HS detector with the original training data plus the counterfactual ones.

Main results
The main experiment results are shown in Table 1 and 2, respectively.As shown in Table 1, we observe that LLMs' performances on different datasets vary greatly.Besides, the performance of few-shot LLMs can not outperform zero-shot consistently, which is different from other tasks.As a result, these results indicate that the off-the-shell LLMs is not always a good HS detector, which is consistent with the results in (Ziems et al., 2023).One of the reasons behind this may be that HS's grammatical structure may not conform to the standard grammatical norms.For example, the authors of HS online hardly use complete sentences.As a result, HS datasets contain a large proportion of such unregulated sentences, which may confuse the off-the-shell GPT model.Among the LLMs we use, text-davinci-003 performs the best while Chat-GPT has the worst performance.On the other hand, Besides, we find that token-level DA methods can even impair the overall performance, indicating that these methods are not as effective as in other tasks.On the other hand, three hiddenlevel DA methods do have a positive impact on the performance.Besides, previous debias methods such as artifact removal or masking also boost the performance but the improvement is subtle.

Generalization Analysis
Joshi et al. claim that building a "challenge set" to see if the intervention of the input cause model predictions to vary expectedly is a standard method of testing a model's robustness.Hence, to verify the robustness of our method, we construct an OOD challenge set using non-hateful counterfactual data generated by GPT-3 and hateful data from CO-NAN (Bonaldi, Helena and Dellantonio, Sara and Tekiroglu, Serra Sinem and Guerini, Marco, 2022).Details on OOD datasets can be found in Appendix A.3.Given MTL's strong performances in Table 1, we evaluate other baselines based on MTL, as shown in Table 3.Our proposed DSI outperforms other methods by a larger margin than that in Ta-ble 2, indicating the effectiveness of our method's robustness and great generalization ability.

Deep Dive of Sentence-level Biases
This part takes a deep dive into these sentence-level biases via qualitative analyses and visualization of their relative spuriousness distributions.

Potential Data-Specific Biases
After conducting grammar induction and investigating 180 grammar patterns, we observe the following categories with high PMI with the hateful label: (1) Absolute expression (8.8%):A large number of HS sentences contain absolute statements, where patterns like all the, all other, any other, etc. frequently occur.
In addition to the aforementioned types, there are other types of spurious patterns without cohesive themes in HS (see Table 6 for some examples).We also find that different datasets have different patterns.However, not all of the above patterns are spurious.The following section illustrates the RS distribution of these data-specific biases.

Bias Distribution Analysis
For potential data-specific biases, we use RS described in Section 3.1.2to validate their spuriousness.We visualize the distribution of these biases based on our proposed RS in Figure 4 in the Appendix.We find that most data-specific patterns' RS is positive, indicating that these highlycorrelated artifacts do make the model more likely to predict a particular label.However, the experiment result shows that most of their RS is less than 0.2.As a result, most of the data-specific biases' impact on the HS detector's output is minimal.This is because most of these biased patterns do not contain hateful semantics.Although they are highly correlated with a specific label statistically, PLM can ignore them to some extent during inference.

Case Study of Token-Level Biases
To further analyze our proposal's effectiveness on identity biases, we analyze the errors made by the baseline model that is solved by our approach.As a result, we conduct the following case study on one of our 9 datasets used in this work, i.e., the white supremacist forum dataset (de Gibert et al., 2018).We randomly sample four false positive sentences where the baseline model makes mistakes in the test set.As shown in Table 4, the baseline model can not correctly identify some complex non-hateful examples containing identity words.For example, the sentences containing ethnicity-related words (e.g., jews, Blacks, hebrews, and Arabs) confuse the baseline detector.Besides, gender-related words (e.g., women) may also make the model more likely to classify the sentence as hateful.However, these sentences are non-hateful based on their neutral contexts.Inspired by saliency methods (Simonyan et al., 2013;Sundararajan et al., 2017;Smilkov et al., 2017;Balkir et al., 2022), we compute the probability differences between original sentences and identity-masked sentences to better interpret the model prediction.The average differences in the above samples are 0.76 and 0.04 for the baseline detector and our proposal, indicating that the impact of these identity words has been considerably reduced by our approach.

Conclusion
This work investigates biases in hate speech detection from lexical and sentence levels.Apart from the statistical correlation between artifacts and a specific label, we analyze the relative spuriousness of the feature based on its impact on local and global models.We find that most highly-correlated pattern features do not have high RS.After that, we analyze the generation process of HS biases from a causal view.We identify two confounders that cause the biases and propose Multi-Task Intervention from the model level and Data-Specific Intervention from the data level to mitigate them.Noticeable performance improvements on nine HS datasets and a label-balance challenge set indicate the effectiveness and robustness of our approach.

Limitations
Our work is subject to a few limitations.First, our experiments are limited to English datasets.However, multilingual hate speeches are also frequently found on many social media platforms.There are still many challenges in languages other than English, especially some minor languages.
A thorough examination of our methods' effec-

Examples
Before After This is going to stir up a lot of yellow hatred for the hebrews .
hateful non-hateful I wonder if one of these guys is a Jew .hateful non-hateful What you say about Blacks , Arabs and Jews are LIES.
hateful non-hateful Are they talking about the African women with huge lip plates?hateful non-hateful tiveness in languages other than English is necessary, which we leave as future work.Second, we use BERT-base-cased as the PLM backbone for most HS detectors following Ramponi and Tonelli, 2022, and we add GPT-3, ChatGPT and text-davinci-003 as baselines.Other PLMs of different scales or architectures' robustness on various biases needs to be verified.Moreover, we only examine and mitigate biases in explicit HS datasets following previous works.However, biases in implicit HS (ElSherief et al., 2021b) are also important and we leave it as future work.In addition, our proposed RS and mitigation methods are applicable to any other text classification tasks, which we leave as future work.Besides, diverse prompt designs for counterfactual generators also need to be validated in future works.In this work, we only focus on single-turn HS detection tasks following previous works without considering user contexts (Yu et al., 2022), which is another practical setting.Finally, we find that some data is falsely annotated or based on different datasets.We do not modify them for fair comparisons.However, it is necessary to uniform the criteria for HS and correct problematic annotations.For token-level biases, although a large number of them are identity-related, there are still other highly-correlated tokens.We do not investigate them and leave it as future work.

Ethics Statement
Our proposals in the research aim to mitigate biases and accurate detection of HS.The main datasets utilized in the study are open-access and publicly available.The offensive terms and identity-related words included as examples are mainly used to help researchers analyze the models more effectively.We do not involve annotators in this process given the sensitive nature of this work, and to reduce exposure of hateful content to human participants.We also add a content warning in the beginning of this paper to warn readers.
Figure 1: (a) Word cloud of frequent words in 9 widely used hate speech datasets and an example of how spurious correlations lead the model to make mistakes.The size of a word denotes its frequency of occurrence.(b) An example from Mandl et al., 2019 indicates that both token-level and sentence-level biases can introduce spurious correlations.

Figure 2 :
Figure 2: SCM (Structural Causal Mode) for HS detection with vulnerable identities.We have omitted the unmeasured variable U * for each variable for brevity.P → L: Using corpus P to pretrain the language model L. (I, C) → X: Vulnerable objectives I (e.g., identity words) and their contexts C constitute the input data X.(L, X) → Y : Input the language model L with data X to output the prediction Y .(a) Two confounders G1 and G2 bias the generation distribution of pretraining corpus and HS data.(b) MTI intervenes with the bias from G1, where P0 represents the training corpus of auxiliary tasks.(c) DSI further intervenes with the bias from G2, where C0 denotes the newly generated context.To analyze grammar patterns biases, we only need to change I and other parts of the SCM remain the same.

Figure 3 :
Figure 3: Our proposed pipeline (DSI) is to find spurious features and mitigate these biases.The blue part indicates the framework for token-level biases (identity).The yellow part represents the framework data-specific biases.⊗ indicates intersection.For all the biases, we use off-the-shell GPT-3 as the counterfactual generator followed by a majority-voting filter.We use the counterfactual data along with the original training data to train the HS detector.
Firstly, MTI tries to mitigate the spurious correlations between PLM's training corpus P and vulnerable objectives I (caused by G 1 ) in order to resolve two kinds of biases in Section 3.1.By training with MTI objectives on a wide range of HS, we expect that the original distribution connection of P and I can be altered.Secondly, most HS is composed of unregulated sentences.MTI can make the PLM get a more robust representation for HS.Specifically, we introduce two auxiliary tasks on PLM to alleviate the bias from the pre-training corpus: (1) Masked Language Modeling (MLM)(Devlin et al., 2019): we randomly mask 15 % tokens from hateful sentences from 9 HS datasets and let PLM predict the tokens on the masking position with language modeling objective.(2) Multi-Task Learning (MTL): P l indicates the local probability, where we use the HS detector's softmax output trained on a single HS dataset to model it.P g represents the global one, where we use the average response of models trained on every HS dataset.Note that we do not use the model trained on all datasets for the class-imbalance problems.D l and D g mean the local and global probability difference between HS examples with and without feature x i , respectively.Intuitively, for spurious features, it can have a great impact on local probability and almost no impact on global probability.Therefore, it has large D l (x i , y) and small D g (x i , y), causing higher RS.

Table 1 :
al., Experiment results (macro-F1 score) of two GPT baselines and BERT finetune with two MLI methods on 9 HS datasets.All Finetune results are averaged over three runs using three random seeds.GPT-3's performances vary greatly across datasets.MTL has an 8.61 F1 improvement on average compared to Finetune.† means our method.

Table 3 :
Experiment results of different methods on the challenge set.Our proposal DSI can have a more noticeable improvement over other baselines on OOD analysis.
3 Implementation details can be found in Appendix A.6.2

Table 4 :
Case study on white supremacist forum dataset.The identity words are highlighted.Before the intervention, the model makes mistakes on these non-hateful examples.Our proposed intervention can address such problems.