Mitigating Biases in Toxic Language Detection through Invariant Rationalization

Automatic detection of toxic language plays an essential role in protecting social media users, especially minority groups, from verbal abuse. However, biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection. The biases make the learned models unfair and can even exacerbate the marginalization of people. Considering that current debiasing methods for general natural language understanding tasks cannot effectively mitigate the biases in the toxicity detectors, we propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns (e.g., identity mentions, dialect) to toxicity labels. We empirically show that our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.


Introduction
As social media becomes more and more popular in recent years, many users, especially the minority groups, suffer from verbal abuse and assault. To protect these users from online harassment, it is necessary to develop a tool that can automatically detect the toxic language in social media. In fact, many toxic language detection (TLD) systems have been proposed in these years based on different models, such as support vector machines (SVM) (Gaydhani et al., 2018), bi-directional long shortterm memory (BiLSTM) (Bojkovskỳ and Pikuliak, 2019), logistic regression (Davidson et al., 2017) and fine-tuning BERT (d'Sa et al., 2020).
However, the existing TLD systems exhibit some problematic and discriminatory behaviors (Zhou * * Work is not related to employment at Amazon. 1 The source code is available at https://github. com/voidism/invrat_debias. et al., 2021). Experiments show that the tweets containing certain surface markers, such as identity terms and expressions in African American English (AAE), are more likely to be classified as hate speech by the current TLD systems (Davidson et al., 2017;Xia et al., 2020), although some of them are not actually hateful. Such an issue is predominantly attributed to the biases in training datasets for the TLD models; when the models are trained on the biased datasets, these biases are inherited by the models and further exacerbated during the learning process (Zhou et al., 2021). The biases in TLD systems can make the opinions from the members of minority groups more likely to be removed by the online platform, which may significantly hinder their experience as well as exacerbate the discrimination against them in real life.
So far, many debiasing methods have been developed to mitigate biases in learned models, such as data re-balancing (Dixon et al., 2018), residual fitting (He et al., 2019;Clark et al., 2019), adversarial training (Xia et al., 2020) and data filtering approach (Bras et al., 2020;Zhou et al., 2021). While most of these works are successful on other natural language processing (NLP) tasks, their performance on debasing the TLD tasks are unsatisfactory (Zhou et al., 2021). A possible reason is that the toxicity of language is more subjective and nuanced than general NLP tasks that often have unequivocally correct labels (Zhou et al., 2021). As current debiasing techniques reduce the biased behaviors of models by correcting the training data or measuring the difficulty of modeling them, which prevents models from capturing spurious and nonlinguistic correlation between input texts and labels, the nuance of toxicity annotation can make such techniques insufficient for the TLD task.
In this paper, we address the challenge by combining the TLD classifier with the selective rationalization method, which is widely used to inter-pret the predictions of complex neural networks. Specifically, we use the framework of Invariant Rationalization (INVRAT) (Chang et al., 2020) to rule out the syntactic and semantic patterns in input texts that are highly but spuriously correlated with the toxicity label, and mask such parts during inference. Experimental results show that INVRAT successfully reduce the lexical and dialectal biases in the TLD model with little compromise on overall performance. Our method avoids superficial correlation at the level of syntax and semantics, and makes the toxicity detector learn to use generalizable features for prediction, thus effectively reducing the impact of dataset biases and yielding a fair TLD model.

Previous works
Debiasing the TLD Task Researchers have proposed a range of debiasing methods for the TLD task. Some of them try to mitigate the biases by processing the training dataset. For example, Dixon et al. (2018) add additional non-toxic examples containing the identity terms highly correlated to toxicity to balance their distribution in the training dataset. Park et al. (2018) use the combination of debiased word2vec and gender swap data augmentation to reduce the gender bias in TLD task. Badjatiya et al. (2019) apply the strategy of replacing the bias sensitive words (BSW) in training data based on multiple knowledge generalization.
Some researchers pay more attention to modifying the models and learning less biased features. Xia et al. (2020) use adversarial training to reduce the tendency of the TLD system to misclassify the AAE texts as toxic speech. Mozafari et al. (2020) propose a novel re-weighting mechanism to alleviate the racial bias in English tweets. Vaidya et al. (2020) implement a multi-task learning framework with an attention layer to prevent the model from picking up the spurious correlation between the certain trigger-words and toxicity labels.
Debiasing Other NLP Task There are many methods proposed to mitigate the biases in NLP tasks other than TLD. Clark et al. (2019) train a robust classifier in an ensemble with a bias-only model to learn the more generalizable patterns in training dataset, which are difficult to be learned by the naive bias-only model. Bras et al. (2020) develop AFLITE, an iterative greedy algorithm that can adversarially filter the biases from the training dataset, as well as the framework to support it. Utama et al. (2020) introduce a novel approach of regularizing the confidence of models on the biased examples, which successfully makes the models perform well on both in-distribution and out-of-distribution data.

Basic Formulation for Rationalization
We propose TLD debiasing based on INVRAT in this paper. The goal of rationalization is to find a subset of inputs that 1) suffices to yield the same outcome 2) is human interpretable. Normally, we would prefer to find rationale in unsupervised ways because the lack of such annotations in the data. A typical formulation to find rationale is as following: Given the input-output pairs (X, Y ) from a text classification dataset, we use a classifier f to predict the labels f (X). To extract the rationale here, an intermediate rationale generator g is introduced to find a rationale Z = g(X), a masked version of X that can be used to predict the output Y, i.e. maximize mutual information between Z and Y . 2 Regularization loss L reg is often applied to keep the rationale sparse and contiguous: (Chang et al., 2020) introduces the idea of environment to rationalization. We assume that the data are collected from different environments with different prior distributions. Among these environments, the predictive power of spurious correlated features will be variant, while the genuine causal explanations always have invariant predictive power to Y . Thus, the desired rationale should satisfy the following invariant constraint: where E is the given environment and H is the cross-entropy between the prediction and the ground truth Y . We can use a three-player framework to find the solution for the above equation: an environment-agnostic predictor f i (Z), an environment-aware predictor f e (Z, E), and a rationale generator g(X). The learning objective of the two predictors are: In addition to minimizing the invariant prediction loss L * i and the regularization loss L reg , the other objective of the rationale generator is to minimize the gap between L * i and L * e , that is: min where ReLU is applied to prevent the penalty when L * i has been lower than L * e .

TLD Dataset and its Biases
We apply INVRAT to debiasing TLD task. For clarity, we seed our following description with a specific TLD dataset where we conducted experiment on, hate speech in Twitter created by Founta et al. (2018) and modified by Zhou et al. (2021), and we will show how to generalize our approach. The dataset contains 32K toxic and 54K non-toxic tweets. Following works done by Zhou et al. (2021), we focus on two types of biases in the dataset: lexical biases and dialectal biases. Lexical biases contain the spurious correlation of toxic language with attributes including Non-offensive minority identity (NOI), Offensive minority identity (OI), and Offensive non-identity (ONI); dialectal biases are relating African-American English (AAE) attribute directly to toxicity. All these attributes are tagged at the document level. We provide more details for the four attributes (NOI, OI, ONI, and AAE) in Appendix A.

Use INVRAT for Debiasing
We directly use the lexical and dialectal attributes as the environments in INVRAT for debiasing TLD 3 . Under these different environments, the predictive power of spurious correlation between original input texts X and output labels Y will change. Thus, in INVRAT, the rationale generator will learn to exclude the biased phrases that are spurious correlated to toxicity labels from the rationale Z. On the other hand, the predictive power for the genuine linguistic clues will be generalizable across environments, so the rationale generator attempts to keep them in the rationale Z.
Since there is no human labeling for the attributes in the original dataset, we infer the labels following Zhou et al. (2021). We match X with TOXTRIG, a handcrafted word bank collected for NOI, OI, and ONI; for dialectal biases, we use the topic model from Blodgett et al. (2016) to classify X into four dialects: AAE, white-aligned English (WAE), Hispanic, and other.
We build two debiasing variants with the obtained attribute labels, INVRAT (lexical) and IN-VRAT (dialect). The former is learned with the compound loss function in Equation (6) and four lexical-related environment subsets (NOI, OI, ONI, and none of the above); we train the latter using the same loss function but along with four dialectal environments (AAE, WAE, Hispanic, and other). In both variants, the learned f i (Z) is our environmentagnostic TLD predictor that classifies toxic languages based on generalizable clues. Also, in the INVRAT framework, the environment-aware predictor f e (Z, E) needs to access the environment information. We use an additional embedding layer Emb env to embed the environment id e into a ndimensional vector Emb env (e), where n is the input dimension of the pretrained language model. Word embeddings and Emb env (e) are summed to construct the input representation for f e .

Experiment Settings
We leverage RoBERTa-base (Liu et al., 2019) as the backbone of our TLD models in experiments. F 1 scores and false positive rate (FPR) when specific attributes exist in texts are used to quantify TLD and debiasing performance, respectively. The positive label is "toxic" and the negative label is "non-toxic" for computing F 1 scores. When evaluating models debiased by INVRAT, we use the following strategy to balance F 1 and FPR, and have a stable performance measurement. We first select all checkpoints with F 1 scores no less than the best TLD performance in dev set by 3%. Then, we pick the checkpoint with the lowest dev set FPR among these selected ones to evaluate on the test set. We describe more training details and used hyperparameters in Appendix B.

Quantitative Debiasing Results
In the left four columns of   Zhou et al. (2021). The bottom section contains scores of our methods. When FPR is lower, the model is less biased by lexical associations for toxicity. We used RoBERTa-base, while RoBERTa-large is used in Zhou et al. (2021). Thus, our Vanilla F 1 score is slightly lower than that of Zhou et al. (2021) by 0.5%.
bias. In addition to Vanilla, we include lexical removal, a naive baseline that simply removes all words existing in TOXTRIG before training and testing. For our INVRAT (lexical/dialect) model, we can see a significant reduction in the FPR of NOI, OI, and ONI over Vanilla (RoBERTa without debiasing). Our approach also yields consistent and usually more considerable bias reduction in all three attributes, compared to the ensemble and data filtering debiasing baselines discussed in Zhou et al. (2021), where no approach improves in more than two attributes (e.g., LMIXIN-ONI reduces bias in ONI but not the rest two; DataMaps-Easy improves in NOI and ONI but has similar FPR to Vanilla in OI). The result suggests that INVRAT can effectively remove the spurious correlation between mentioning words in three lexical attributes and toxicity. Moreover, our INVRAT debiasing sacrifices little TLD performance 4 , which can sometimes be a concern for debiasing (e.g., the overall performance of LMIXIN). It is worth noting that the lexical removal baseline does not get as much bias reduction as our method, even inducing more bias in NOI. We surmise that the weak result arises from the limitation of TOXTRIG, since a word bank cannot enumerate all biased words, and there are always other terms that can carry the bias to the model.
We summarize the debiasing results for the dialectal attribute in the rightmost column of Table 1. Compared with the Vanilla model, our method effectively reduces the FPR of AAE, suggesting the consistent benefit of INVRAT in debiasing dialect biases. Although the results from data relabeling (Zhou et al., 2021) and some data filtering approaches are better than INVRAT, these approaches are complementary to INVRAT, and combining them presumably improves debiasing performance.

Qualitative Study
We demonstrate how INVRAT removes biases and keeps detectors focusing on genuine toxic clues by showing examples of generated rationales in Table 2. Part (a) of Table 2 shows two utterances where both the baseline and our INVRAT debiasing predict the correct labels. We can see that when toxic terms appear in the sentence, the rationale generator will capture them. In part (b), we show three examples where the baseline model incorrectly predicts the sentences as toxic, presumably due to some biased but not toxic words (depend on the context) like #sexlife, Shits, bullshit. However, our rationale generator rules out these words and allows the TLD model to focus on main verbs in the sentences like keeps, blame, have. In part ( Table 2: Examples from the test set with the predictions from vanilla and our models. denotes toxic labels, and ȶ denotes non-toxic labels. The underlined words are selected as the rationale by our ratinoale generator. fails to generate the true answer, while the baseline model can do it correctly. In these two examples, we observe that our rationale generator remove the offensive words, probably due to the small degree of toxicity, while the annotator marked them as toxic sentences. Part (d) of Table 2 shows another common case that when the sentence can be easily classified as non-toxic, the rationale generator tends not to output any words, and the TLD model will output non-toxic label. It is probably caused by the non-stable predictive power of these non-toxic words (they are variant), so the rationale generator choose to rule them out and keep rationale clean and invariant.

Conclusion
In this paper, we propose to use INVRAT to reduce the biases in the TLD models effectively. By separately using lexical and dialectal attributes as the environments in INVRAT framework, the rationale generator can learn to generate genuine linguistic clues and rule out spurious correlations. Experimental results show that our method can better mitigate both lexical and dialectal biases without sacrificing much overall accuracy. Furthermore, our method does not rely on complicated data filtering or relabeling process, so it can be applied to new datasets without much effort, showing the potential of being applied to practical scenarios.

A Bias attributes
We follow Zhou et al. (2021) to define four attributes (NOI, OI, ONI, and AAE) that are often falsy related to toxic language. NOI is mention of minoritized identities (e.g., gay, female, Muslim); OI mentions offensive words about minorities (e.g., queer, n*gga); ONI is mention of swear words (e.g., f*ck, sh*t). NOI should not be correlated with toxic language but is often found in hateful speech towards minorities (Dixon et al., 2018). Although OI and ONI can be toxic sometimes, they are used to simply convey closeness or emphasize the emotion in specific contexts (Dynel, 2012). AAE contains dialectal markers that are commonly used among African Americans. Even though AAE simply signals a cultural identity in the US (Green, 2002), AAE markers are often falsy related to toxicity and cause content by Black authors to mean suppressed more often than non-Black authors (Sap et al., 2019).

B Training Details
We use a single NVIDIA TESLA V100 (32G) for each experiment. The average runtime of experiments for Vanilla model in Table 1 are 2 hours. The INVRAT model in Table 1 need about 9 hours for a single experiment.
The main hyperparameters are listed in Table 3. More details can be found in our released code. We did not conduct hyperparameter search, but follow all settings in the official implementation of Zhou et al. (2021) 5 . One difference is that because INVRAT framework needs three RoBERTa models to run at the same time, we choose to use RoBERTabase, while Zhou et al. (2021) uses RoBERTa-large. As a result, our F 1 score for the Vanilla model is about 0.5 less than the score in Zhou et al. (2021).  Table 3: The main hyperparameters in the experiment. Sparsity percentage is the value of α in L reg mentioned in equation 2; sparsity lambda and continuity lambda are λ 1 and λ 2 in equation 2; diff lambda is λ diff in equation 6.