Improving Counterfactual Generation for Fair Hate Speech Detection

Bias mitigation approaches reduce models’ dependence on sensitive features of data, such as social group tokens (SGTs), resulting in equal predictions across the sensitive features. In hate speech detection, however, equalizing model predictions may ignore important differences among targeted social groups, as hate speech can contain stereotypical language specific to each SGT. Here, to take the specific language about each SGT into account, we rely on counterfactual fairness and equalize predictions among counterfactuals, generated by changing the SGTs. Our method evaluates the similarity in sentence likelihoods (via pre-trained language models) among counterfactuals, to treat SGTs equally only within interchangeable contexts. By applying logit pairing to equalize outcomes on the restricted set of counterfactuals for each instance, we improve fairness metrics while preserving model performance on hate speech detection.


Introduction
Hate speech classifiers have high false-positive error rates in documents mentioning specific social group tokens (SGTs; e.g., "Asian", "Jew"), due in part to the high prevalence of SGTs in instances of hate speech (Wiegand et al., 2019;Mehrabi et al., 2019). When propagated into social media content moderation, this unintended bias (Dixon et al., 2018) leads to unfair outcomes, e.g., mislabeling mentions of protected social groups as hate speech.
For prediction tasks in which SGTs do not play any special role (e.g., in sentiment analysis), unintended bias can be reduced by optimizing grouplevel fairness metrics such as equality of odds, which statistically equalizes model performance across all social groups (Hardt et al., 2016;Dwork et al., 2012). However, in hate speech detection, this is not the case, with SGTs providing key information for the task (see Fig. 1). Instead, bias mitigation in hate speech detection benefits from relying on individual-level fairness metrics such as counterfactual fairness, which assess the output variation resulting from changing the SGT in individual sentences (Garg et al., 2019;Kusner et al., 2017). Derived from causal reasoning, a counterfactual applies the slightest change to the actual world to assess the consequences in a similar world (Stalnaker, 1968;Lewis, 1973).
Accordingly, existing approaches for reducing bias in hate speech detection using counterfactual fairness learn robust models whose outputs are not affected by changing the SGT in the input (Garg et al., 2019). However, a drawback of such approaches is the lack of semantic analysis of the input to identify whether changing the SGT leads to a small enough change that preserves the hate speech label (Kasirzadeh and Smart, 2021). For instance, in a hateful statement, "mexicans should go back to their sh*thole countries", substituting "mexicans" with "women" changes the hate speech label, while using "Hispanics" should preserve the output. Here, we aim to create counterfactuals that maximally preserve the sentence and disregard counterfactuals that violate the requirement for be-ing the "closest possible world" (Fig. 1).
To this end, we develop a counterfactual generation method which filters candidate counterfactuals based on their difference in likelihood from the actual sentence, estimated by a pre-trained language model with known stereotypical correlations (Sheng et al., 2019). Intuitively, our method provides outputs that are robust with regard to the context and are not causally dependent on the presence of specific SGTs. This use of sentence likelihood is inspired by Nadeem et al. (2020) as it captures the similarity of an SGT and its surrounding words to prevent unlikely SGT substitutions. As a result, only counterfactuals with equal or higher likelihoods compared with the original ("closest possible worlds") are used during training. To enforce robust outputs for similar counterfactuals, we apply logit pairing (Kannan et al., 2018) on outputs for sentence-counterfactual pairs, adding their average differences to the classification loss. Our method (1) effectively identifies semantically similar counterfactuals and (2) improves fairness metrics while preserving classification performance, compared with other strategies for generating counterfactuals.

Related Work
Unintended bias in classification is defined as differing model performance on subsets of datasets that contain particular SGTs (Dixon et al., 2018;Mehrabi et al., 2019). To mitigate this bias, data augmentation approaches are proposed to create balanced labels for each SGT or to prevent biases from propagating to the learned model (Dixon et al., 2018;Zhao et al., 2018;Park et al., 2018). Other approaches apply regularization of post-hoc token importance (Kennedy et al., 2020b), or adversarial learning for generating fair representations (Madras et al., 2018;Zhang et al., 2018) to minimize the importance of protected features.
By altering sensitive features of the input and assessing the changes in the model prediction, counterfactual fairness (Kusner et al., 2017) seeks causal associations between sensitive features and other data attributes and outputs. Similarly, counterfactual token fairness applies counterfactual fairness to tokens in textual data (Garg et al., 2019).
Counterfactual fairness presupposes that the counterfactuals are close to the original world. However, previous work has yet to quantify this similarity in textual data. Key to our proposed framework is evaluating the semantic similarity between the original and the synthetically generated instances to only consider counterfactuals that convey similar sentiment. Consequently, our method prevents synthetic counterfactuals unlikely to exist in real-world samples which (1) decrease classification accuracy by adding noise into the training process and (2) misdirect fairness evaluation by introducing unexpected criteria.

Method
We propose a method for improving individual fairness in hate speech detection by considering the interchangeable role of SGTs in each specific context. Given instance x ∈ X, and a set of SGTs S, we seek to equalize outputs of a classifier f for x and its counterfactuals x cf generated by substituting the SGT mentioned in x.
First, we provide the definition of counterfactual token fairness (CTF), which can be evaluated for a model over a dataset of sentences and their counterfactuals (Sec. 3.1). Next, we specify how counterfactual logit pairing (CLP) regularizes CTF in a classification task (Sec. 3.2). Lastly, we introduce our counterfactual generation method for Assessing Counterfactual Likelihoods (ACL, Sec. 3.3), which is driven by linguistic analysis of stereotype language in sentences.

Counterfactual Token Fairness (CTF)
Given instance x ∈ X, and a set of counterfactuals x cf , generated by perturbing mentioned SGTs, the CTF for a classifier f = σ(g(x)) is: where g(x) returns the logits for x (Garg et al., 2019). Lower CTF indicates similar (i.e., fairer) outputs for sentences and their counterfactuals.

Counterfactual Logit Pairing (CLP)
To reduce CTF while training a hate speech classifier, we apply counterfactual logit pairing (CLP) (Kannan et al., 2018) to all instances and their counterfactuals. CLP penalizes prediction divergence among inputs and their counterfactuals by adding the average absolute difference in logits of the inputs and their counterfactuals to the training loss: where c calculates the classification loss for an output f (x) and its correct label y and λ tunes the Stereotypical Sentences (from Gab) Communists and dictators are desperate to get rid of god. His blessing overcomes the fearful evils of this fallen world.
Dumb ass n**** don't realize you actually have to work your ass off on a farm. It doesn't just magically happen now that they've stolen the land from Whites.
Israel and the Islamist conspiracy to deny Jews their land.
Women. lie. about. rape. influence of the counterfactual fairness loss, the impact of which is discussed in the Appendix.

Counterfactual Generation
Rather than simplifying the model training by restricting CLP loss to all counterfactuals created by perturbing the SGTs in non-hate sentences (Garg et al., 2019), we identify similar counterfactuals based on likelihood analysis of each sentence. Our aim is to generate counterfactuals that preserve the likelihood of the original sentence.
In stereotypical sentences that target specific social groups, expecting equal outputs when changing the SGT leads to ignoring how specific vulnerable groups are targeted in text (Haas, 2012). Quantifying the change in a sentence as a result of perturbing SGTs has already been studied for detecting stereotypical language (Nadeem et al., 2020); similarly to Nadeem et al., we apply a generative language model (GPT2; Radford et al., 2019) to evaluate the change in sentence likelihood caused by substituting an SGT -e.g., we expect the language model to predict decrease in likelihood for a sentence about terrorism when it is paired with "Muslim" or "Arab" versus other SGTs.
Since GPT-2 uses the left context to predict the next word, for each word x i in the sentence, the likelihood of x i , P (x i |x 0 . . . x i−1 ), is approximated by the softmax of x i with respect to the vocabulary. Therefore, the log-likelihood of a sentence x 0 , x 1 , . . . x n−1 is computed with: lg P (x) = n i lg P (x i |x 0 , .., x i−1 ) We identify correct counterfactuals by comparing their log-likelihood to that of the original sentence and create the set of all correct counterfactuals x cf by including counterfactuals with equal or higher likelihood compared with x: in which substitute(x, S) creates the set of all perturbed instance by substituting the SGT in x, with Rank # Items # Choices Accuracy(mean) Agreement  another SGT from the list of all SGTs S, which in this paper is a list of 77 SGTs (see Appendix), compiled from Dixon et al. (2018) and extended using WordNet synsets (Fellbaum, 2012).

Experiments
Here, we apply our method for generating counterfactuals (Sec. 3.3) to a large corpus to explore the method's ability to identify similar counterfactuals. Then, we apply CLP (Sec. 3.2) with different strategies for counterfactual generation and compare them to our approach, introduced in Sec. 3.3.

Evaluation of Generated Counterfactuals
Data. We randomly sampled 15 million posts from a corpus of social media posts from Gab (Gaffney, 2018), and selected all English posts that mention one SGT (N ≈ 2M). The log-likelihood of each post and its candidate counterfactuals were computed. The primary outcome was the original instance's rank in log-likelihood amongst its counterfactuals. Higher rank for a mentioned SGT indicates the stereotypical content of the sentence. We conducted two qualitative analyses with human annotators to evaluate the generated counterfactuals. First, we selected sentences in which the highest ranks were assigned to the original SGTs and asked annotators to predict the mentioned SGT in a fill-in-the-blank test. If our method correctly ranks SGTs based on the context, we expect annotators to predict the original SGTs in such sentences. Then, we randomly selected a set of sentences and evaluated our method on finding the preferable counterfactual among a pair of candidates by comparing the choices to those of the annotators'.
Human annotators were from the authors of the paper, with backgrounds in computer science and social science. All annotators had previous experience with annotating hate speech content. However, they did not have any experience with the exact sentences in the evaluated dataset, given that the sentences were randomly selected from a dataset of 1.8M posts, collected by other researchers cited in the paper.
We preferred expert annotators over novice coders in this specific case, because previous stud-   ies have indicated expert coder higher performance in hate speech annotation (Waseem, 2016). Moreover, annotators' cognitive biases and perceived stereotypes can greatly impact their judgments in detecting hate speech (Sap et al., 2019). Therefore, we preferred to have expert annotators with a shared understanding of the definition of stereotypes and hate speech, who are consequently less subjective in their judgments.
Results. In 2.9% of sentences the original SGT achieves the highest ranking. In 86.03% of the posts where the original SGT is ranked second, the top-ranked SGT is from the same social category (e.g., both SGTs referred to race or gender). We randomly selected 500 original posts with highest likelihood among their counterfactuals (Table 1 shows such samples) to qualitatively assess their stereotypicality in a fill-in-the-blank style test with human subjects. Three annotators, on average, identified the correct SGT from 4 random choices for 74.88% of posts. In a second evaluation, given sentences and two counterfactuals, annotators were asked to identify which SGT substitution preserves the hate speech and likelihood of the sentence. On average, annotators agreed with the model's choice in 63.07% of the test items. Table 2 demonstrates accuracy and agreement scores of annotators.

Fair Hate Speech Detection
We apply our counterfactuals generation method to hate speech detection, and equalize model outputs for sentences and their similar counterfactuals.
Compared Methods. We fine-tined BERT (Devlin et al., 2019) classifiers with CLP loss, using five approaches for generating counterfactuals: 1) CLP+ACL applies our approach for Assessing Counterfactual Likelihoods (Sec.

3.3), 2)
CLP+NEG considers all counterfactuals for negative instances (Garg et al., 2019), 3) CLP+SG substitutes SGTs from the same social categories (inspired by Sec. 4.1), e.g., it replaces a racial group with other racial groups, 4) CLP+Rand substituting SGTs with random words, and 5) CLP+GV substitutes SGTs with ten most similar SGTs based on their GloVe word embeddings (Pennington et al., 2014). As baseline models we consider a vanilla fine-tuned BERT (BERT), and a fine-tuned BERT model that masks the SGTs (MASK) 1 . Evaluation Metrics. We compute CTF on two datasets of counterfactuals.

Data. We trained models on the Gab Hate
(1) Similar Counterfactuals (SC; collected from Dixon et al. (2018)) includes synthetic, non-stereotypical instances based on templates (e.g., <You are a ADJ SGT>). In such instances, the sentence is not explicit to the SGT, and the model prediction should solely depend on the ADJs so smaller values of CTF are indicative of a fairer models.
(2) Dissimilar Counterfactuals (DC; from Nadeem et al. (2020)) includes stereotypical sentences and their counterfactuals generated by perturbing SGTs. Since instances are stereotypical, we expect all counterfactuals to be ignored by a fair model and lower CTF scores. We also report group fairness metrics (equality of odds). The standard deviation of true positive (TP) and true negative (TN) rates across SGTs are reported for a preserved test set (20% of the dataset) and instances generated by perturbing the SGTs. The standard deviation of false positive ratio (FPR) for different SGTs are also reported for a dataset of non-hateful New York Times sentences. Lower standard deviations indicate higher group fairness. Results. Table 3 shows the results of these experiments on GHC and Storm. Evidently, our model (highlighted in Table 3) for generating counterfactuals enhances CTF while improving or preserving classification performance and group fairness (TP, TN, and FPR) on both datasets. The increase in classification performance demonstrates our method's capability in filtering noisy synthetic samples. These results call for further explorations of when fair models should treat SGTs equally. Rather than expecting equal results over all instances, fair predictions should be based on contextual information embedded in the sentences.

Conclusion
Our method treats social groups equally only within interchangeable contexts by applying logit pairing on a restricted set of counterfactuals. We demonstrated that biased pre-trained language models could enhance counterfactual fairness by identifying stereotypical sentences. Our method improved counterfactual token fairness and classification accuracy by filtering unlikely counterfactuals. Future work may explore semantic-based techniques for creating counterfactuals in domains other than hate speech detection, e.g., crime prediction, to better contextualize definitions of social group equality.

Broader Impact Statement
Our paper investigates bias mitigation in hate speech detection. This task is of great sensitivity because of the impact of online hate speech on minority social groups. While most discussions in the field of Ethics of AI focus on equalizing biases against different social groups from pre-trained language models, we make use of this bias to identify stereotypical or conspiratorial hate speech in social media and to ensure that hate speech detection models learn these linguistic association of stereotypes for protecting social groups from rhetoric that is explicitly targeting them.

A Appendix
All data is uploaded to dropbox 2
For each data point, the runtime of generating 64 counterfactuals along with their perplexity scores was about 1.5 seconds on instances with NVIDIA Tesla P4 Virtual Workstations and NVIDIA GeForce RTX 2080 SUPER and about 2.6 seconds on instances with NVIDIA Tesla K80 GPUs.
Hyper parameters We used the pre-trained GPT-2 model from the transformers library by hugging face 3 with 12-layer, 768-hidden, 12-heads, 117M parameters.
Dataset We downloaded the public dump of Gab posts 4 which contains more than 34 million posts from August 2016 to October 2018. After dropping posts with small number of English tokens (non-English posts) and malformed records, We got near 15 million posts referred to as SGT-Gab. Data can be found in the accompanied zip file.

A.3 Study 2
Implementation Details Each of the seven models were trained on 80% of the given dataset (either GHC or Storm), (dataset train.csv file) and tested on the remaining 20% (dataset test.csv file). The models were run on a single NVIDIA GeForce GTX 1080 GPU, where each epoch takes 3 seconds. Models were built in Python 3.6 and Tensorflow-GPU (Abadi et al., 2016).
Data cleaning was performed by applying the BertTokenizer tokenizer (Wolf et al., 2020), and models were trained by fine-tuning Bert-For-Sequence-Classification initialized with pretrained "bert-base-uncased" 5 with 12-layers, 768hidden, 12-heads, and 117M parameters (Wolf et al., 2020). The λ coefficient was set to 0.2 for all models to specify the same counterfactual loss in all models.
Hate Speech Datasets Here we provide detail on the two training datasets from our experiments. The Gab Hate Corpus (GHC; Kennedy et al., 2020a) is an annotated corpus of English social media posts from the far-right network "Gab." Labels were generated by majority vote between all provided annotations labels of "CV" (Call for Violence) and "HD" (Human Degradation) which are two sub-types of hate speech. Final dataset include 2254 positive labels of hate among 27557 items. Secondly, de Gibert et al. (2018) provide an annotated corpus of English (Storm). We used posts included in "all files", 6 and generated our own train and test subset. The final dataset includes 1196 positive labels among 10944 items.
For each dataset, the train and test set were split based on maintaining the same ratio of SGTs in both sets. Similarly, in each fold of cross validation 20% of the train set was selected for validation purposes based on maintaining the same ratio of hate labels.

Fairness Evaluation Datasets
We used three out-of-domain datasets for evaluating fairness: First, an existing dataset of stereotypes in English ("Dissimilar Counterfactuals"; DC) collected by Nadeem et al. (2020) was applied, which contains two types of stereotype: intersentence instances consisted of a base sentence provided for a target group and a stereotypical sentence generated by annotators for the same group, while intrasentence instances were single sentences annotated as stereotypes. For each sentence, we substitute the target group with all our SGTs, resulting in 25565 samples.
Second, "Similar Counterfactuals" (SC) consists of 77k synthetic English sentences generated by Dixon et al. (2018). After removing sentences with less that 4 tokens, we ended up with 3200 sentences.
Third, following Kennedy et al. (2020b) we use a corpus of New York Times (NYT) articles to measure false positive rate. Specifically, for each SGT in our list (see Section A.1), we sampled 500 articles containing a mention of this SGT (and no other SGT mentions). This produced a balanced random sample of SGTs, which are heuristically assumed to have no hate speech (excepting rare occurrences, e.g., quotations).
Evaluation For evaluating the Counterfactual Token Fairness (CTF) among a sentence and the list of its counterfactuals, we computed the cosine similarity of the 2D logits, produced as the output of Bert-For-Sequence-Classification model. We then calculated the average of these similarities to get a CTF value for the sentence and computed the average of CTFs over the dataset.
Analysis of the Regularization Coefficient As mentioned in Section 3.2, the regularization coefficient λ controls the extent to which counterfactual logit pairing formulation affects the training process. A larger value of λ is expected to increases the importance of bias mitigation, while decreasing the essential classification performance. While in our experiments in Section 4.2 we set the same value for λ for all counterfactual pairing approaches, here we discuss λ as it creates a trade-off between classification accuracy and counterfactual token fairness. Figure 2 and 3 demonstrate the effect of λ on the three main approaches evaluated on GHC and Storm datasets; 1) our approach for counterfactual generation (CLP+ACL), 2) Garg et al. (2019)'s approach which considers all counterfactuals of non-hate samples (CLP+NEG), and 3) counterfactual generation based on similar social categories (CLP+SG). As the plots denote, higher value of λ corresponds with lower classification accuracy and lower (more desirable) counterfactual token fairness. These results also denotes that our proposed method CLP+ACL, achieves higher accuracy and fairness compared to the other approaches with different values of λ. In our experiments reported in Table 3, λ is set to 0.2 for all approaches that are based on counterfactual pairing (CLP+*). We chose this value, since based on the observed results in 2 and 3, it demonstrates the effect of counterfactual pairing loss on improving the fairness metrics while preserving classification accuracy. Future applications of our approach should rely on fine-tuning λ during training.

A.4 Glossary
Unintended bias: When a model is biased with respect to a feature that it was not intended to be (e.g. race in Toxicity classifier).
Group Fairness: Fairness defintions that treat different groups equally (e.g. equality of odds, equality of opportunity.) Individual Fairness: Fairness definitions that ensure similar predictions to similar individuals (e.g. counterfactual fairness.) Equality of Odds: "A predictorŶ satisfies equalized odds with respect to protected attribute A and outcome Y , ifŶ and A are independent conditional on Y . P (Ŷ = 1|A = 0, Y = y) = P (Ŷ = 1|A = 1, Y = y), y ∈ {0, 1}", (Hardt et al., 2016) Equality of Opportunity: "A binary predictor Y satisfies equal opportunity with respect to A and Y if P (Ŷ = 1|A = 0, Y = 1) = P (Ŷ = 1|A = 1, Y = 1)", (Hardt et al., 2016) Counterfactual: Counterfactual conditionals are conditional sentences that assess the outcome under different circumstances. Here we use (Garg et al., 2019) definition of counterfactual questions, "How would the prediction change if the sensitive attribute referenced in the example were different?" with SGT as the sensitive attribute Counterfactual reasoning: The process of inferences from counterfactual conditionals compared to regular conditionals.
Stereotype: Stereotyping is a cognitive bias, deeply rooted in human nature (Cuddy et al., 2009) and omnipresent in everyday life through which humans can promptly assess whether an outgroup is a threat or not. Stereotyping, along with other cognitive biases, impacts how individuals create their subjective social reality as a basis for social judgements and behaviors (Greifeneder et al., 2017). Stereotypes are often studied in terms of the associations that automatically influence judgement and behavior when relevant social categories are activated (Greenwald and Banaji, 1995).

Non-hate Sample
Hate Sample (Garg et al.,

2019) All Counterfactuals No Counterfactuals Issues
Adding noisy synthetic data into the model since SGTs cannot interchangeably appear in all contexts Not supporting fairness for specific SGTs with high association with hate speech (Dixon et al., 2018) Current approach Counterfactuals with higher likelihood Improvement Preventing counterfactuals with lower sentence likelihood, that can be noisy instances Equalizing outputs for current instances and their more stereotypical counterfactuals