Challenges in Automated Debiasing for Toxic Language Detection

Biased associations have been a challenge in the development of classifiers for detecting toxic language, hindering both fairness and accuracy. As potential solutions, we investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection. Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English). Our comprehensive experiments establish that existing methods are limited in their ability to prevent biased behavior in current toxicity detectors. We then propose an automatic, dialect-aware data correction method, as a proof-of-concept. Despite the use of synthetic labels, this method reduces dialectal associations with toxicity. Overall, our findings show that debiasing a model trained on biased toxic language data is not as effective as simply relabeling the data to remove existing biases.


Introduction
Current hate speech or toxic language detection 1 systems exhibit problematic and discriminatory behavior that causes them to have disparate negative impact on minority populations (Yasin, 2018;Guynn, 2020;Kim et al., 2020;Dias Oliva et al., 2020). Tweets simply containing a minority identity mention are commonly flagged as toxic by current systems, in contrast to those containing majority identity mentions, as illustrated in Figure 1.
At the core of the issue are dataset biases, i.e., spurious correlations between surface patterns and annotated toxicity labels ( §2), which stem from the data creation process (Sap et al., 2019). Previous work has outlined two such biases for hate 1 We use hate speech and toxic language interchangeably in this work, though their definitions do not perfectly align.  Figure 1: Lexical items and dialect markers cause problematic behavior for toxic language detection systems such as the widely used PerspectiveAPI. In the top two example pairs, statements with minority identity mentions and swear words used inoffensively are flagged as toxic, but majority identity mentions or offensive statements without overt swearing are missed. The bottom pair shows dialect-based racial bias for two inoffensive greetings, where markers of African American English (AAE) trigger the toxicity detector.
speech datasets (both shown in Figure 1): lexical bias which associates toxicity with the presence of certain words (e.g., profanities, identity mentions; Dixon et al., 2018;Dinan et al., 2019) and dialectal bias, where toxicity is correlated with surface markers of African American English (AAE; Davidson et al., 2019;Sap et al., 2019). When trained on biased datasets, models acquire and exacerbate these biases (e.g., flagging text by Black authors as more toxic than by white authors; Sap et al., 2019;Zhang et al., 2018). Concurrently, there has been elevated interest in developing debiasing methods for standard natural language understanding (NLU) tasks, i.e., methods that aim to decrease over-reliance on spurious correlations in NLU models (Clark et al., 2019;He et al., 2019;Karimi Mahabadi et al., 2020;Bras et al., 2020). This raises a natural question: are current debiasing approaches effective for mitigating biases specific to toxic language detection?
In this work, we address the above question by investigating two classes of debiasing approaches to mitigate lexical and dialectal biases-one that employs additional training objectives for bias removal, and another that filters training instances likely exhibiting spurious biases ( §3). Through comprehensive experiments, we show that both approaches face major challenges in mitigating biases from a model trained on a biased dataset (in our case, the dataset from Founta et al., 2018) for toxic language detection. While data filtering results in reduced bias associations in the data, models trained on filtered datasets still pick up on lexical ( §4) and dialectal biases ( §5). We find that dialectal biases are particularly challenging to address, as has also been shown by Xia et al. (2020). "Debiased" models still disproportionately flag text in certain dialects as toxic. Notably, mitigating dialectal bias through current debiasing methods does not mitigate a model's propensity to label tweets by Black authors as more toxic than by white authors.
We additionally explore an alternative proof-ofconcept study-relabeling supposedly toxic training instances whose automatic translations into a majority dialect are deemed non-toxic by the classifier. To this end, we create a synthetic dataset via few-shot dialect translation system built with GPT-3 (Brown et al., 2020). While only an illustrative solution, it nevertheless takes into account the dialectal context of the tweet, resulting in a model less prone to dialectal and racial biases ( §6). Overall, our findings indicate that debiasing a model already trained on biased toxic language data can be challenging, compared to relabeling the data to remove existing biases. Our code and data are publicly available on Github. 2

Biases in Toxic Language Detection
We test the use of debiasing 3 methods for the task of toxic language detection, which aims to flag rude, offensive, hateful, or toxic language on the internet, with the goal of moderating online communities (Roberts, 2019;Vidgen et al., 2019). This task differs in several ways from the natural language understanding (NLU) tasks that debiasing methods have been successful on, such as textual entailment (e.g., SNLI, MNLI;Bowman et al., 2015;Williams et al., 2018) or reading comprehension (e.g., SQuAD; Rajpurkar et al., 2016). First, compared to these NLU tasks where there is one correct label, the toxicity of language is inherently more nuanced, subjective, and contextual, which causes toxic language datasets to have lower agreement in general (Ross et al., 2017). Second, the dataset biases in NLU are predominantly artifacts introduced during data creation (e.g., negations, exaggerations;Schwartz et al., 2017;Gururangan et al., 2018), whereas those in toxic language detection are grounded in the social dynamics of the world (Spears, 1998;Technau, 2018). For example, viewing AAE as a more toxic or less proper variety of English is a form of linguistic discrimination that upholds racial hierarchies in the United States (Rosa and Flores, 2017).
In this work, we consider two broad categories of toxic language dataset biases-lexical ( §2.1) and dialectal ( §2.2). Our experiments focus on a single, widely used dataset ( §2.3) from Founta et al. (2018).

Lexical Biases (TOXTRIG)
Current toxic language detection systems often rely on the presence or absence of certain words (e.g., swear words, identity mentions) to make their predictions (Dixon et al., 2018;Dinan et al., 2019). While most previous analyses of this bias relied on a simple list of "bad" words (Davidson et al., 2019;Dinan et al., 2019), 4 we take a more nuanced view of how lexical items can convey toxicity, inspired by work in pragmatics and sociolinguistics of rudeness (Dynel, 2015;Kasper, 1990, inter alia). Specifically, we manually split our full list of words into three distinct categories depending on the extent to which they carry profane or hateful meanings or are simply associated with hateful contexts. 5 We refer to the full set of words as TOXTRIG, for Toxicity Triggers, which is included in our released repository. 6 Non-offensive minority identity mentions (NOI) refers to descriptive mentions of minoritized demographic or social identities (e.g., gay, female, Muslim). While these mentions are not usually inherently offensive by themselves, they are often found in offensive statements that are hateful towards minorities (Dixon et al., 2018). We detect these identity mentions in text using a list of 26 regular expressions.
Possibly offensive minority identity mentions (OI) are mentions of minoritized identities that could denote profanity or hate depending on pragmatic and contextual interpretations. This includes slurs and objectifying outdated terms to refer to minority groups, which are usually understood as attacks. Additionally, this includes reclaimed slurs (queer, n*gga), which connote less offensive intent when spoken by in-group members compared to out-group members (Croom, 2013).
Possibly offensive non-identity mentions (ONI) contains swear words and other profanities, which are usually offensive but not associated to any social groups (e.g., f*ck, sh*t). Note that the pragmatic interpretation of these words is not necessarily always toxic or offensive (Dynel, 2012), as they are often used to convey closeness between the speaker and listener or emphasize the emotionality of a statement (e.g., second example in in Figure 1).

Dialectal Biases (AAE)
Current toxic language detection systems also associate higher toxicity with dialectal markers of African American English (AAE; Sap et al., 2019;Davidson et al., 2019). Since AAE is a variety of English that is common among African Americans and often signals a cultural identity in the US (Green, 2002), this dialect-based racial bias causes speech by Black authors to be suppressed more often than non-Black authors (Sap et al., 2019), thereby exacerbating racial inequality (Rosa, 2019).
In our experiments, we estimate the dialect of a tweet using a topic model from Blodgett et al. (2016). This model was trained on 60M tweets, where the dialect of the tweet was inferred from the model coordinates, which yielded a probability of a tweet being in one of four dialects (African-American English, white-aligned English, Hispanic, and other). In this study, we only focus on African-American English (AAE) and whitealigned English (WAE) tweets; both definitions are based on US English, as per Blodgett et al. (2016). 7 Our experiments either use the probability of a tweet being in these dialects, or assign tweets their estimated-most-probable dialect.

Dataset for Toxic Language Detection
We focus our analyses on a widely used hate speech dataset of English tweets (Founta et al., 2018). The tweets were collected using a multiround bootstrapping procedure, and were labeled out of context 8 for toxic language. We focus on the 86k tweets that are annotated as hateful, abusive, or neither and discard those labelled as spam. We aggregate the abusive and hateful labels into a single toxic category, yielding 32k toxic and 54k non-toxic tweets. 9

Debiasing Methods
We consider two types of debiasing methods from current literature. The first type addresses known, pre-defined biases-such as lexical and dialectal biases for hate speech detection, via a modelbased approach involving additional training objectives ( §3.1). In contrast, the second type is agnostic to prior knowledge about biases, and instead filters out examples that appear "too easy" and might hence contain spurious correlations ( §3.2).

Debiased Training for Pre-Defined Toxicity Biases
We use the LEARNED-MIXIN method of Clark et al. (2019), which achieved high out-ofdistribution (OOD) performance on several NLU tasks, for debiased training. This method trains an ensemble containing a bias-only model which only uses pre-defined features corresponding to known biases, and a full model which uses all features. Intuitively, the ensemble encourages the full 7 We avoid using disputed terms such as general American English, standard American English, or mainstream US English, which are frequently used for WAE, since we believe that no dialect should be privileged with the designation "general", "standard", or "mainstream" (Rosa, 2019). 8 Only the tweet text-no profile information or conversational context-was shown to annotators. 9 We also explored using another widely used hate speech dataset (Davidson et al., 2017), which collected tweets using a seed list of swear words and slurs. However, in line with findings by Xia et al. (2020), debiasing led to degenerate behavior due to the data collection process, as discussed in Appendix B. model to rely more on features unrelated to the biases. Once trained, the bias-only model is discarded, and only the "bias-free" full model is used for inference, following Clark et al. (2019). Bias-only model Given its effectiveness on bagof-words (BoW) features, we use an SVM classifier as the lexical-bias-only model. For example, the TOXTRIG-only model counts the frequency of TOXTRIG words in each tweet. Our dialectal-biasonly model uses the probability of dialects (AAE, WAE, Hispanic, and other) obtained from a dialect detector (Blodgett et al., 2016) as features in a SVM classifier.
Full model We fine-tune a RoBERTa-large classifier (Liu et al., 2019), a state-of-the-art classifier for the toxicity detection task. See Appendix A.1 for more modeling details.
Note that we only consider the LEARNED-MIXIN-ONI and LEARNED-MIXIN-TOXTRIG models for lexical debiasing, due to poor accuracies of the bias-only models for NOI and OI. 10

Data Filtering for Spurious Biases
In addition to debiasing methods that handle known biases, we also explore automated approaches which filter out instances exhibiting unspecified, spurious biases. Specifically, we describe below two data selection methods that have shown strong OOD performance.
AFLite (Bras et al., 2020) is an algorithm based on the key intuition that examples predicted correctly by the simplest methods likely exhibit spurious biases. An ensemble of simple linear classifiers is trained and tested on different partitions of the data; test instances which are "predictable", or classified correctly by most classifiers in the ensemble are discarded. The algorithm is iterative, and is repeated until a target data size is achieved. Models trained on this filtered dataset achieve higher performance on OOD and adversarially constructed test sets, compared to the original model, on several text and image classification datasets. This indicates a reduction in spurious biases in the filtered data.
DataMaps  show the presence of distinct regions in a datasetnamely, easy, hard and ambiguous-defined with respect to a given model. These regions are discovered based on the training dynamics of a model, determined by the model's confidence in the true class, for each example, as well as the variability of this confidence, throughout training epochs.  show that training exclusively on the hard and ambiguous regions of the data results in high OOD performance, indicating lower prevalance of spurious biases. The easy region is the largest in size for RoBERTa; however, experiments showed that training exclusively on these examples hurt OOD generalization on different NLU tasks. Following this work, we create DataMaps-Easy, DataMaps-Ambiguous, and DataMaps-Hard subsets for our dataset.
Following , we set the target filtered subset size to 33% of the original training set for both filtering methods, but our filtering additionally preserved the original label proportions. We then fine-tune a RoBERTa-large classifer on these filtered subsets; see Appendix A.2 for more details.

Experiments: Lexical Biases
We investigate the effect of debiasing approaches ( §3) on removing lexical biases in hate speech detection. First, we discuss the evaluation framework for measuring bias reduction ( §4.1). We present quantitative ( §4.2) and qualitative ( §4.3) results on lexical bias removal for all debiasing approaches, and OOD evaluation for debiased training methods ( §4.4). See Appendix A.3 for hyperparameters and other experimental settings.

Evaluation Framework
We report the performance of all models as overall accuracy and F 1 with respect to the toxic class. Given that current hate speech systems tend to rely heavily on the presence of NOI, OI, and ONI mentions ( §2.1) for labeling text as toxic, we use false positive rate (FPR) over each of these categories to measure the degree of bias in the model, following Hardt et al. (2016) and Xia et al. (2020). Specifically, we report the FPR of a model on tweets containing NOI (FPR NOI ), OI (FPR OI ), and ONI (FPR ONI ), as well the F 1 corresponding to each of these classes. Intuitively, the lower the FPR * , the less the model infers lexical associations for toxicity, and hence is less biased.

Evaluation for Filtered Datasets
We additionally consider metrics based on spurious lexical associations for data filtering approaches. This measures prevalence of spurious surface patterns in the filtered datasets, which might propagate to models trained on the data. Specifically, we report the Pearson's correlation between the gold standard toxicity label and whether or not it contains NOI, OI, or ONI mentions. These correlations are denoted as R ONI , R NOI , and R OI , respectively; lower values indicate reduction in lexical biases.
Baselines We consider comparison against two natural baselines: a vanilla RoBERTa-large classifier trained on the original dataset (Original). We also consider a baseline trained on a random selection of the training data (Random), for comparison with data filtering methods for debiasing. Each subset is trained on 33% of the training data.

Results for Lexical Bias Reduction
First, we measure the reduction in lexical biases in filtered datasets, as given by AFLite and DataMaps. As shown in Table 1, subsets given by AFLite and the ambiguous and hard regions produced by DataMaps reduce the overall associations between TOXTRIG words and toxicity, compared to the original and random baselines; DataMaps-Hard has the largest reduction. On the other hand, as expected, DataMaps-Easy shows an increased association between TOXTRIG mentions and toxicity, showing that the these examples display overt lexical biases. Table 2 shows results for lexical bias reduction using both debiased training approaches, as well as models trained on datasets filtered using AFLite and all three regions from DataMaps. Both debiased training approaches, LMIXIN-ONI and LMIXIN-TOXTRIG, reduce FPR ONI as well as FPR OI by a large amount. However, both approaches also hurt in-distribution test performance, indicating that ONI and other TOXTRIG features are essential for good performance. 11 In contrast, the models trained on hard and ambiguous subsets from DataMaps both preserve indistribution performance, even though they are trained only a third of the original data. They also reduce the rate of falsely predicting NOI mentions as toxic (FPR NOI ), while not showing much improvement for ONI and maintaining FPR OI of the original baseline. Surprisingly, the model trained on the easy subset from DataMaps shows good bias reduction on the NOI and ONI categories, while matching the random selection baseline for OI. This is despite DataMaps-Easy showing an increased association between TOXTRIG mentions and toxicity (Table  1). Notably, the F 1 for all categories suffers under this model, indicating that it is less competent than the baseline. These results suggest that reduced associations in the data might not necessarily lead to debiased models trained on the same data. Overall, no single approach outperforms all others across different categories for lexical debiasing.

Qualitative Analysis
A qualitative study of the Founta et al. (2018) test set shows the presence of many annotation errors. We show three representative annotation errors in Table 3. The first example contains an atypical example of toxicity, towards white folks, which the annotators might have been unaware of. It also contains a link which annotators had access to, but not models. The second contains the word p*ss which the annotators may have relied for their assessment. The third encourages violence/abuse towards an identity which isn't typically the target of violence. Interestingly, the DataMaps-Easy predictions agree with all the gold standard annotations; perhaps such annotation errors and ambiguity are responsible for the performance discussed  A Table 3: Examples of test set tweets with their gold-standard annotations and predictions from models trained on

DataMaps-Hard (DM-Hard) and DataMaps-Easy (DM-Easy) subsets. A denotes tweets with toxic labels, and
represents non-toxic labels. We anonymize the usernames to protect user privacy.
in §4.2. These annotation ambiguities might also impair our measurement for models' performance and debiasing effects, and expose a limitation of these datasets.

Adversarial Evaluation: ONI-Adv
To further study the reliance of debiased models on the ONI words, we use the test set from Dinan et al. (2019), which contains 3000 manually created sentences, 300 toxic. The toxic sentences barely contain any ONI mentions, e.g., Cheese is made by french people who smell. We call this test set ONI-Adv (for adversarial) since it challenges models with a reversal in the association between toxicity and offensive non-identity words (e.g., "f*ck", "sh*t"). We report F 1 for all models in Figure 2, which shows how well a model identifies toxicity in offensive tweets that do not contain overtly lexical cues of toxicity. The debiased training approaches improve over the baselines; data filtering methods do not. One reason for this might be that data filtering methods were trained on much less data than both LMIXIN models. Regardless, none of the models we test are good at predicting subtle, non-overt toxicity.

Experiments: Dialectal and Racial Biases
We test the efficacy of the bias reduction methods from §3 for dialectal bias ( §2.2) reduction.

Dialectal Biases
For our dialectal bias experiments, we first infer the dialect of a tweet as described in §2.2. Then, analogous to the lexical bias evaluation, we quantify the dialectal debiasing using the Pearson's correlation between estimated probabilities of AAE and toxicity (R AAE ), and the false positive rates of models on AAE tweets (FPR AAE ). See Appendix A.3 for hyperparameters and other experimental settings. Results in Table 4 show that almost all data filtering and debiasing methods reduce dialectal biases, with DataMaps-Easy as the exception (con- Figure 2: Challenge set evaluation for lexical biases, comparing all debiasing methods with baselines, using the ONI-Adv test set. Takeaway: F 1 (↑) measures show that all models perform poorly at identifying toxic text not containing overtly lexical cues of toxicity. In general, debiased training approaches outperform the original model on this challenge set, while data filtering is not as effective. sistent with Table 1). Notably, DataMaps-Hard performs the best at dialectal debiasing, both in terms of toxicity-AAE correlation (R AAE ) and in terms of false flagging of toxicity (FPR AAE ). Interestingly, most models' decrease in false flagging is small, suggesting room for improvement.

Racial Biases
To quantify the real-world impact of dialectbased racial bias, we measure the rates of toxicity predicted by models on a corpus of tweets for which the race of authors is available, but not annotations of toxicity. Specifically, we consider the dataset released by Preoţiuc-Pietro and Ungar (2018), which consists of 5.4M tweets, collected from 4,132 survey participants (3,184 White, 374 African American) with self-reported race/ethnicity and Twitter user handles. 12 We quantify our models' racial bias by measuring the difference in rates of flagging tweets by African American authors and those by white authors, following Sap et al. (2019). 13 Listed in Table 5, our results show that automatic debiasing methods do not consistently decrease the racial discrepancy in flagging toxicity. Notably, the toxicity rates on tweets by African American authors-and the diferences compared to white authors-are similar across all debias-12 For efficiency, we randomly select 12k tweets from the dataset as the OOD test set. 13 Note that we assume that authors from all races have the same likelihood of writing toxic language.  ing methods and baselines, except for DataMaps-Easy, which shows the most racial bias in toxicity flagging. Surprisingly, DataMaps-Hard, which mitigated dialectal bias the best out of all debiasing methods, also shows high discrepancy between author races. Confirming previous results, this suggests that debiasing these systems requires more than automatic debiasing methods.

Towards Data Relabeling
Based on our quantitative and qualitative analyses, we believe there still is room for improvement in debiasing hate speech detection. Therefore, we turn our attention to the role of label noise in datasets. Partly inspired by our qualitative analyses of debiased models' predictions, we design a proof-of-concept study where we automatically correct the label of tweets using a(n automatic) dialectal translation of the tweet, inspired by previous work showing that highlighting AAE tweets' dialect led them to be labeled as less toxic (Sap et al., 2019). We conclude this study by discussing the limitations and ethical implications of the synthetic data, and cautioning against its real-world application.  Table 5: Racial disparity in toxicity prediction reported on Preoţiuc-Pietro and Ungar (2018). W-Tox. indicates % of white users' tweets being flagged as toxic, AA-Tox. indicates % of African American users' tweets being flagged as toxic, ∆ refers to the difference between AA-Tox. and W-Tox., and AA/W refers to the ratio between AA-Tox. and W-Tox. Takeaway: Methods generally fail in debiasing on this OOD test set except the relabeling approach shows some benefit.
Focusing on dialectal bias, our key assumption is that an AAE tweet and its corresponding WAE version should have the same toxicity label, therefore toxic AAE tweets whose WAE versions are non-toxic are candidates for label correction. 14 However, gold-standard translations of AAE to WAE would require qualified translators, and automatic AAE-to-WAE translation systems do not exist, to the best of our knowledge. Therefore, we create a proof-of-concept study-we set up a AAE to WAE "translation" system using the fewshot capabilities of the GPT-3 language model (Brown et al., 2020). Under this mechanism, we prompt GPT-3 with four translation pairs (taken from Spears, 1998) and an AAE tweet from our training data, and generate its WAE "translation". The list of prompts, as well as further details, are provided in Appendix C. Note that we do not recommend this approach to build large scale parallel data for dialects, as discussed under ethical implications and limitations.
Next, as per our heuristic, we only relabel toxic AAE tweets whose WAE translation is predicted as non-toxic by either our vanilla classifier trained on the original Founta et al. (2018) dataset, or an identical classifier trained on the WAE translated tweets. The resulting dataset (AAE-relabeled) is the same size as the original dataset, but with 954 (12%) out of 8260 toxic AAE tweets relabeled as 14 Note that this assumption does not hold for lexical items, because substituting lexical items (e.g., swapping a minority mention for a majority mention) would drastically change the denotational meaning of the sentence. Table 6). To assess the validity of the relabeling, the first three authors manually annotated toxicity of 50 randomly selected relabeled tweets. On average, authors agreed with 84% of the relabeling decisions.

non-toxic (examples in
Then, we evaluate the dialectal bias of AAErelabeled and quantify the dialect and racial prediction biases from a RoBERTa-large classifier trained on AAE-relabeled, following §5. As shown in the last row of Table 4, this relabeling scheme decreases dialectal bias more than any other debiasing method, specifically as measured by the FPR on AAE tweets, with one point drop in F 1 score. The F 1 score on the "gold" test data (Table 4) are not fully reliable, as test data contain label biases and better performance could come from exploiting these biases. As shown in Table 5, the model trained on AAE-relabeled has the lowest racial disparity in toxicity flagging rates compared to all other methods.
These results highlight that debiasing methods are much less effective at mitigating dialectal dataset biases compared to data relabeling. For future investigations, we recommend obtaining human-written AAE-WAE pairs (e.g., as done by Groenwold et al., 2020). Additionally, to ensure less biased toxicity labeling, we recommend recruiting AAE speakers or experts for avoiding over-identification of AAE-markers as toxic (Spears, 1998;Croom, 2013). Alternatively, we recommend exploring more holistic representations of social biases or toxicity (e.g., Social Bias Frames; .

Ethical Implications & Limitations
The above synthetic setting is meant to illustrate the role of labeling quality on biases in annotations. We strongly caution against using this approach in real-world applications, such as building parallel datasets for dialects. First, due to how its training data was selected, GPT-3 has likely not been exposed to many African American English varieties during training (Jo and Gebru, 2020). Second, pretrained language models are known to generate toxic language at non-trivial rates (Gehman et al., 2020), which could cause differential toxicity in the translations.

Related Work
Debiasing Toxicity Detection As the popularity of hate speech and toxic language detection sys-  tems has grown, several biases have been found in dataset and models, spurring various debiasing efforts to mitigate these individual biases (e.g., gender bias, racial bias; Park et al., 2018;Sap et al., 2019;Davidson et al., 2019). Some work tackles identity-based biases, e.g., using data re-balancing (Dixon et al., 2018), or adversarial feature learning (Vaidya et al., 2019). Less work has tackled racial or dialectal bias. Notably, Xia et al. (2020) use adversarial training to prevent the model from associating toxicity with AAE, showing only small improvements in fairness. Based on those results, we do not explore adversarial methods, opting instead for ensemble-based methods of predefined bias reduction. In contemporary work, Mozafari et al. (2020) use a re-weighting mechanism, which shows some effects in debiasing racial bias. We leave it for future work to evaluate this method in our setting. In contrast to all previous work, our experiments also measure the effectiveness of bias-agnostic methods.
Other General Debiasing Methods Several approaches for debiasing NLU tasks have been proposed lately. Some approaches rely on adversarial training to remove protected attributes (e.g. gender or race), from a model's internal representations (Zhang et al., 2018;Xia et al., 2020). Other approaches include confidence regularization (Utama et al., 2020), as well as other product of expert approaches (He et al., 2019;Karimi Mahabadi et al., 2020) similar to the debiased training approach from Clark et al. (2019), which is the only debiased training we employ due to its relatively strong performance.

Conclusion
We investigate whether toxic language detection systems can be debiased using recently introduced methods for debiasing text classification in NLU tasks. Focusing on two types of biases, lexical and dialectal, our experiments show that these methods face significant challenges in reducing the biased behavior in toxicity detectors. This indicates that biases in toxic language detection might be different in nature compared to spurious associations studied in typical NLU settings. We studied a synthetic scheme for relabeling examples with potential dialectal biases; our results indicate that correcting noisy labels results in better bias reduction. Our findings suggest that instead of solely relying on development of automatic debiasing for existing, imperfect datasets, future work focus primarily on the quality of the underlying data for hate speech detection, such as accounting for speaker identity and dialect. Indeed, such efforts could act as an important step towards making systems less discriminatory, and hence safe and usable.  Table 7: Lexical and dialectal associations between toxicity in the original dataset (Davidson et al., 2017) and various filtered counterparts. Random, AFLite, and DataMaps all contain only 50% of the original data after filtering. (We could not perform downsampling on these datasets due to their heavily skewed label distribution.) Lower Pearson R correlation value indicates less superficial patterns in the dataset, thus are less biased. The easy subset gives the best results here are due to its severe inbalanced label distribution.   Table 9: Dialectal bias evaluation for all debiasing methods, on both in-distribution test set as well as outof-distribution dialect and race priming test sets. In addition to accuracy and F 1 , we report the false positive rate with respect to tweets in AAE (FPR AAE ), reflecting dialectal bias (lower is less debiased). Each method is based on a RoBERTa-large classifier.