Entropy-based Attention Regularization Frees Unintended Bias Mitigation from Lists

Natural Language Processing (NLP) models risk overfitting to specific terms in the training data, thereby reducing their performance, fairness, and generalizability. E.g., neural hate speech detection models are strongly influenced by identity terms like gay, or women, resulting in false positives, severe unintended bias, and lower performance.Most mitigation techniques use lists of identity terms or samples from the target domain during training. However, this approach requires a-priori knowledge and introduces further bias if important terms are neglected.Instead, we propose a knowledge-free Entropy-based Attention Regularization (EAR) to discourage overfitting to training-specific terms. An additional objective function penalizes tokens with low self-attention entropy.We fine-tune BERT via EAR: the resulting model matches or exceeds state-of-the-art performance for hate speech classification and bias metrics on three benchmark corpora in English and Italian.EAR also reveals overfitting terms, i.e., terms most likely to induce bias, to help identify their effect on the model, task, and predictions.


Introduction
Online hate speech is growing at a rapid pace, with effects that can result in dangerous criminal acts offline.Due to its verbal nature, various Natural Language Processing approaches have been proposed (Qian et al., 2018;Indurthi et al., 2019;Attanasio and Pastor, 2020;Kennedy et al., 2020;Vidgen et al., 2021, inter alia).Recently, detection performance has significantly improved with the use of large pre-trained language models based on Transformers (Vaswani et al., 2017), such as Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019).However, Figure 1: False positive from BERT as a hate speech detector.The darker and taller the bar, the higher the overfitting on the term.
several works have shown that by fine-tuning neural language models on hate speech detection, the classifiers obtained contain severe unintended bias (Dixon et al., 2018), i.e. they perform better or worse when texts mention specific identity terms (such as gay, Muslim, or woman).As a result, a sentence like "As a Muslim woman, I agree" would be wrongly classified as hate speech, purely due to the presence of two identity terms, i.e., terms referring to specific groups based on their sociodemographic features.One cause of false positives is selection bias in the keyword-driven collection of corpora (Ousidhoum et al., 2020).Figure 1 shows a false positive example for a fine-tuned BERT model on hate speech detection.Ideally, the model should rely on the words adore and you.Instead, BERT overfitted to the word Girl and associated it with a hateful context.This unwanted effect demonstrates the issues of lexical overfitting, and how they cause uninteded bias on identity terms.
Various methods have been proposed to mitigate and measure (unintended) bias (Elazar and Goldberg, 2018;Park et al., 2018;Dixon et al., 2018;Nozza et al., 2019;Kennedy et al., 2020;Vaidya et al., 2020).However, all those methods rely on the availability of a set of identity terms.This is a severe limitation, which hinders the generalizability and applicability of hate detection models to real-world contexts.For example, a model designed to reduce the unintended bias on genderrelated terms (such as woman, wife) will not address unintended bias for religious affiliation.So practitioners must decide a-priori "which vulnerable groups are present in our data?" We propose an Entropy-based Attention Regularization (EAR) that forces the model to build token representations by attending to a wider context, i.e., consider a larger number of tokens from the rest of the sentence.We measure the attended context as the entropy of the self-attention weight distribution over the input sequence.We use EAR as a regularization term in the loss computation to maximize each token's entropy.We apply EAR to BERT.The resulting model (BERT+EAR) significantly improves performance on unintended bias mitigation in English and Italian.In addition, it requires no a-priori knowledge (e.g., sets of identity terms), making it fairer and more general.The contextualized representations EAR induces avoid basing the classification on individual terms and, ultimately, mitigate lexical overfitting and intrinsic bias from pre-trained weights.
As a training by-product, EAR lets us extract the overfitting terms, i.e., terms accounting for narrower context that most likely induce unintended bias.These terms can highlight possible weaknesses in the model: from the over-sensitivity of pre-trained weights to specific words (Sheng et al., 2019;Nangia et al., 2020;Vig et al., 2020), to overspecialization of training corpora on the keywords used for collecting data (Ousidhoum et al., 2020).
Note that while we show results on BERT, EAR is applicable to any attention-based architecture.
Contributions.EAR is a novel entropy-based attention regularization method to mitigate unintended bias by reducing lexical overfitting.It is applied to all terms, so it does not need a-priori domain knowledge (e.g, predefined term lists).Independent of domain-specific information, EAR generalizes better to different languages and contexts compared to similar approaches.Attention entropy is used to extract a list of the most likely biased terms.EAR code is available at https: //github.com/g8a9/ear.

Entropy-based Attention Regularization
Attention was originally designed for aligning target and source sequences in machine translation Figure 2: Self-attention distribution on tokens Girl (solid orange) and you (shaded blue).Attention for Girl is concentrated on its representation: its entropy is low.Attention for you is spread: its entropy is high.(Graves, 2013;Bahdanau et al., 2015).However, in the Transformer architecture (Vaswani et al., 2017), it has become a means to account for lexical influence and long-range dependencies.It also provides useful information about the importance of a term for the output (Wiegreffe and Pinter, 2019;Brunner et al., 2020;Sun and Marasović, 2021).Here, we use the notion of attention entropy, and EAR's use of it in BERT.Note, though, that EAR can be used with any attention-based architecture.
Self-attention in Transformers.The Transformer model consists of two connected units, an encoder and a decoder, designed for sequence-tosequence tasks.
A transformer encoder applies scaled-dot product self-attention over the input tokens to compute N independent attention heads.1 Let E = [e 0 , ..., e ds ] be the sequence of input embeddings, with e i ∈ R dm .For the h-th attention head and i-th position, each embedding e i is projected into a query q h,i , a key k h,i and value v h,i .So each token expresses an attention distribution over all input embeddings as where K h is the matrix of keys and d k their dimension.Attention weights a h,i = [a h,i,0 , ..., a h,i,ds ], where a h,i,j ∈ [0, 1] and j a h,i,j = 1, can be seen as a soft-indexing over the values.Since the values are projections of the tokens themselves, each weight in self-attention measures the contribution of its token to the attention head and, in turn, to the new token representation.We provide additional details to the self-attention mechanism in Appendix A.
Attention entropy.Information entropy was first introduced in Shannon (1948), and measures the average information content of a random variable X with the set [x 0 , ..., x n ] of possible outcomes.It is defined as Following Ghader and Monz (2017), we compute the entropy in the self-attention heads by interpreting each token's attention distribution as a probability mass function of a discrete random variable.The input embeddings are the possible outcomes, and the attention weights their probability.
For the sake of simplicity, we now discuss the computation of attention entropy of a single token in a standard transformer encoder.Attention weights are first averaged over heads by defining a ′ i,j = 1 h h a h,i,j as the mean attention that the token at position i pays to the token at position j.Then, we define a probability mass function by applying a softmax operator: We define the attention entropy as follows Intuitively, attention entropy measures the degree of contextualization while constructing the model's upper level's embedding.A large entropy suggests that a wider context contributes to the new embedding, while a small entropy tells the opposite: only a few tokens are deemed relevant.From a broader viewpoint, contextualized tokens improve the information passage between continuous layers by re-distributing the information content for every unit involved.
Figure 2 shows a toy example of self-attention distributions for two arbitrary tokens.Solid orange bars correspond to a Girl,j , while shaded blue bars correspond to a you,j .The toy example illustrates the correlation between attention distributions and entropy.The representation of you uses a wider context and, thus, it has a higher attention entropy.Note that, if present, we discard padding tokens from the attention entropy computation.Conversely, we include special tokens when required by the downstream task.EAR in BERT.We introduced attention entropy as a proxy for the degree of contextualization of token representations above.Following this intuition, we propose BERT with EAR mitigation (BERT+EAR), a novel model trained to learn tokens with maximal self-attention entropy over the input sequence.We fine-tune BERT+EAR in the downstream task of hate speech detection.Note, though, that the approach is feasible for any classification task.In classification models, having more contextualized tokens avoids individual terms driving the classification outcome because they got over-attentioned.
Although EAR is applicable to any Transformerbased model, we base our approach here on the BERT (Devlin et al., 2019) base architecture.BERT provides an informative case study, given the number of architectures it has spawned and the recent interest in its attention patterns (Clark et al., 2019b;Kovaleva et al., 2019;Serrano and Smith, 2019).BERT consists of twelve stacked transformer encoders, each running self-attention on the output of the previous encoder.In BERT+EAR, we build new tokens with the maximal information content coming from the previous layer for every transformer layer in the architecture.Using Equation 4, we first compute the attention entropy of each token in the input sentence.We then take their mean and define the average contextualization for the ℓ-th layer as where H ℓ i is the attention entropy of the token at position i, and d s is the length of the input sequence (excluding the padding tokens but including the [CLS] and [SEP] special tokens).Finally, we introduce a new regularization term to the model loss to maximize the entropy at each layer: L is the total loss, L C and L R are the classification and regularization loss, respectively, and α ∈ R is the regularization strength.As in previous work, L C is the Cross Entropy loss obtained with a linear layer on top of the last encoder as a classification head.It receives the [CLS] embedding and outputs the probability of the positive class (Hate).
The new regularization term L R frames the task of maximal contextualization learning in the network.This framing has several advantages over existing approaches.First, it is a sum of differentiable terms and is hence differentiable.We can thus optimize BERT+EAR with classical back-propagation updates.Second, the regularization is agnostic to specific identity terms.It instead induces the network to learn contextualized tokens globally.This induction is crucial to regularize biased terms that might not be known in advance.Finally, note that the L R pools each layer's entropy-based contributions H ℓ .Each term H ℓ is in turn dependent on the sole attention entropy defined in Equation 4. This makes the setup a general framework not limited to BERT.L R can be used to evaluate and maximize the token contextualization in any attention-based architecture.
Figure 3 shows a graphical overview of BERT+EAR.Each layer provides a contextualization contributing to the loss independently, where layers with a low average contextualization increase the loss the most.Note also that, similarly to He et al. (2016), L R introduces skip connections between layers and the classification head, so shorter paths for the contextualization information to flow.
Insights from attention entropy.On the one hand, we use attention entropy maximization to train BERT+EAR and test its classification and bias mitigation performance.On the other hand, we can leverage attention entropy to automatically extract the tokens with the lowest contextualization, which are the most likely to induce unintended bias.When a sentence is fed through a model like BERT, we can inspect the attention distribution of its terms2 .
We propose to exploit entropy, and hence contextualization, to gain insights into any attentionbased model.Given a corpus and a model we want to inspect, we repeatedly query the model with sentences from the corpus and collect each token's attention entropy.Finally, we take each token's mean to measure the impact it has on bias, where lower is worse.Note that the same term can impact bias differently depending on the sentence.
While our approach works for any attentionbased model and data set, we test it on fine-tuned classifiers to extract the biased terms learned on the training data set.We discuss this functionality in Section 5.

Experimental settings
In this work, we consider the problem of unintended bias (Dixon et al., 2018): "a model contains unintended bias if it performs better for comments containing some particular identity terms than for comments containing others".
Datasets.Unintended bias is measured on synthetic test sets, artificially generated by filling manually defined contexts with identity terms (e.g., I hate all ___, I love all ___) .By construction, each identity term appears 50% of the time in hateful contexts and 50% in non-hateful ones.If a model then classifies the instances related to one identity term differently than the others, it means that the model contains unintended bias towards that term, e.g., if every instance containing the term women is labelled hateful, independently of the context.Synthetic test sets simulate new data, so a model that has low performance on this set demonstrates low generalization abilities and incapacity to be used in real-world contexts and applications.
We test BERT+EAR on hate speech datasets with associated synthetic test sets to measure unintended bias.
MISOGYNY (EN) (Fersini et al., 2018) is a stateof-the-art corpus for misogyny detection in English.The related synthetic test set (Nozza et al., 2019) was created via several manually defined templates and synonyms for "woman" as identity terms.MISOGYNY (ITA) (Fersini et al., 2020) is the benchmark corpus for misogyny detection in Italian.The synthetic test set has been generated similarly to the English one.This dataset allows us to study EAR's impact on cross-lingual adaptation.
MULTILINGUAL AND MULTI-ASPECT HATE SPEECH (MLMA) (Ousidhoum et al., 2019) consists of tweets with various hate speech targets.We choose to work on its English part.We use the synthetic test provided in Dixon et al. (2018), generated by slotting a wide range of identity terms into manually defined templates.
Table 1 reports statistics of the data sets.Alongside the size of train, test, and validation sets, we report also the percentage of hateful instances to show the class balance.Note that MLMA is highly unbalanced with 88% of instances associated with the hateful class.Note that the original MULTI-LINGUAL AND MULTI-ASPECT dataset comes in a multi-label, multiple class setting.Following Ousidhoum et al. (2021), we used the Hostility dimension of the dataset as target label and created a Hate binary from it as follows.We considered single-labeled "Normal" instances to be non-hate/non-toxic and all the other instances to be toxic.
To further characterize our data sets, we explore the aspect of selection bias, reporting the measure B 2 (Ousidhoum et al., 2020).The metric ranges from 0 to 1 and evaluates how likely topics of the data set are to contain keywords of the data collection.Values above 0.7 demonstrate high selection bias, implying the need for unbiasing procedures.
We report also the size and number of identity terms used in the synthetic test sets.The percentage of hateful content is perfectly balanced (50%) since each identity term should appear exactly in the same context as the others to measure the unintended bias.See Appendix B for the list of identity terms and further preprocessing details.

Metrics
We use the weighted and binary F1-score of the hateful class (F1 w and F1 hate ) as classification metrics.We consider both due to the class imbalance of test sets (see Table 1).
We compute the unintended bias metrics from Dixon et al. (2018) and Borkan et al. (2019).They are computed from differences in the score distributions between instances mentioning a specific identity-term (subgroup distribution) and the rest (background distribution).The three per-term AUC-based bias scores are: 1) AUC subgroup calculates AUC only on the data subset of a given identity term.A low value means the model performs poorly in distinguishing between hateful and non-hateful comments that mention the identity term.

Baselines
We compare BERT+EAR against the following existing approaches: (1) BERT (Devlin et al., 2019), (2) BERT+SOC mitigation (Kennedy et al., 2020), where the authors modify BERT's loss to lower the importance weight of identity terms, computed with the Sampling-and-Occlusion (SOC) algorithm (Jin et al., 2019) 2020) is obtained by training the model on additional samples from Wikipedia articles (assumed to be non-hateful) to balance the distribution of specific identity terms.Nozza et al. (2019) extracted these additional non-hateful samples from an external Twitter corpus (Waseem and Hovy, 2016).
To address the impact of different term lists, we also consider two different versions of BERT+SOC mitigation, one where we test the effect of missing identity terms and the other where the identity terms are translated for adapting to a new language.

Experimental Results
Table 2 shows classification and bias metrics on both synthetic and test set for the three corpora, i.e., MISOGYNY (EN) (top), MISOGYNY (ITA) (middle), and MLMA (bottom).The top rows in each table section report the performance of hate speech detection models specifically proposed for the respective dataset.The lower rows show the results of baselines and BERT+EAR.BERT+SOC mitigation uses the identity terms from Kennedy et al. (2020) (see Appendix C), unless a different identity terms lists is specified (e.g., "BERT+SOC mitigation, translated ITs").
BERT+EAR obtains comparable and, in most cases, better performance on all three datasets than all state-of-the-art debiasing approaches, which are based on (i) the knowledge of identity terms and (ii) data augmentation techniques.However, identity terms are not always readily available, which severely limits the generalization of those approaches.Similarly, there are several drawbacks to data augmentation with (assumed) non-hateful samples containing the identity terms.1) Data augmentation is expensive.It requires filtering a large dataset (usually Wikipedia) and retraining the model with a much larger set of instances.2) Data augmentation with task-specific identity terms requires prior knowledge of those terms, and is therefore limited by the authors' knowledge.
3) The overlap between identity terms in the evaluation set and the augmented data inevitably (but somewhat unfairly) improves the performance on the synthetic dataset.
BERT+EAR is overall the best debiasing model considering the proposed bias metrics.The only exception is MISOGYNY (EN), for which BERT+EAR has lower AUC bnsp and AUC bpsn than BERT+SOC mitigation.The latter's advantage, however, comes with high variability in the results.BERT+SOC mitigation seems more sensitive to random initialization.The standard deviation over 10 runs is 37%, compared to 13% of BERT+EAR.Figure 4 shows the AUC subgroup metric separately by identity term on MISOGYNY (EN).We compare BERT and BERT+EAR over 10 different initialization runs.EAR improves BERT across all identity terms Most existing models and AUC-based metrics for unintended bias focus only on the false positives (i.e., hateful instances wrongly recognized as non-hateful).While correctly recognizing hateful instances is important, we believe that the problem of false negatives is equally important.Since BERT+EAR does not rely on identity term lists, it regularizes terms that impact both the positive and negative class.BERT+EAR obtains an average decrease of 15.04% in false negative rate compared to BERT and BERT+SOC mitigation.Indeed, the performance difference between BERT+EAR vs. BERT and BERT+SOC is mainly due to nonhateful instances (∼95% of the time).Reducing the impact of overfitting terms like f*ck and p*ssy in MISOGYNY (EN) causes BERT+EAR to consider a larger context, and correctly labels them as non-hateful.

Error Analysis
Table 3 shows tweets from the MISOGYNY (EN) data set which have been correctly predicted by BERT+EAR but misclassified by BERT or BERT+SOC.These tweets serve as qualitative examples of the effectiveness of forcing the model to attend to a wider context and not overfit to training-specific terms, exploiting the richness of information (Nozza et al., 2017).The examples are an excerpt of the most common cases where BERT+EAR classifies the non-hateful examples correctly: (1) when slurs or negative words (such as sk*nk) are used in a non-hateful context, like slang or lyrics, (2) when many words associated with misogyny appear in the sentence (e.g., rape, abuse) and (3) when the hateful target is male and the instance should not be classified as misogynous.The use of a wider context by BERT+EAR allows the model identify such non-misogynous instances compared to BERT and BERT+SOC.In particular, BERT+SOC is even more biased in these cases because its debiasing techniques overly rely on specific terms (e.g.woman) and increase overfitting to training-specific examples.

Impact of predefined identity terms
We also analyze the impact of predefined identity term lists on performance by evaluating the effect of (i) missing identity terms, and (ii) adapting to a new language where the list is unavailable.
First, we remove every identity term of BERT+SOC from MISOGYNY (EN) that appears at least once in the evaluation set, here women and woman out of 24 terms.This reflects the real-world case where the identity term list does not contain a specific group present in the data.The significant performance drop resulting from this case (Table 2, top, "missing ITs") highlights a strong weakness of term-based mitigation strategies.
Second, we analyze the case where identity terms need to be adapted to a new language, e.g., Italian.We translated the English identity terms from BERT+SOC to Italian via Google Translate. 4able 2 (middle, "translated ITs") shows that the performance is lower than BERT+EAR.A simple translation of predefined identity terms is therefore not an option for cross-lingual settings.This aligns with the findings by Nozza (2021), that demon-  strated that cross-lingual hate speech detection is limited by the use of non-hateful, language-specific taboo interjections that are not directly translatable.
In sum, we demonstrated that relying on a predefined list of identity terms is a strong limitation for performance and generalizability of the model.In contrast, BERT+EAR's independence from any predefined terms makes it the ideal model in realworld scenarios.

Extracting overfitting terms
While being the core of EAR, attention entropy serves another purpose.Once standard fine-tuning is concluded (i.e., with no regularization involved), models have overfitted specific terms.We identify these terms using attention entropy.
To extract the most indicative terms, we replicate training conditions.Specifically, we run inference using all the training data using a fine-tuned checkpoint and a standard BERT tokenizer.We collect attention entropy values for each term and average them over all training instances.Terms with lowest average entropy show the highest overfitting as the model learned them with a narrow context. 5etrieving these terms after training allows us to gain insights into the domain and language-specific aspects driving the outcome.
Table 4 shows the top 10 terms with highest lexical overfitting on the studied datasets extracted from the corresponding fine-tuned model.We extract terms strongly correlated with the positive class, e.g., womens*ck (97%), shut (96%), n*gger (92%), sb*rro (97%), c*lone (95%).Note that these terms are not frequent in the corpus.Overfitting terms appear with an average document frequency of only 4.7%, while the most frequent terms have 32.5% average document frequency across datasets.These results suggest that the higher the class polarization of a token, the narrower the context BERT will use to learn its representation, and the higher the overfitting.

Related Work
The first works to study bias measurement and mitigation in neural representation aimed at removing implicit gender bias from word embeddings (Bolukbasi et al., 2016;Caliskan et al., 2017;Garg et al., 2018;Romanov et al., 2019;Ravfogel et al., 2020).More recently, researchers have started to focus on contextualized sentence representations and effective neural models for understanding the presence and resolution of bias (Nozza et al., 2021;Ousidhoum et al., 2021).
While the majority of proposed approaches focus on data augmentation (Dixon et al., 2018;Nozza et al., 2019;Sharma et al., 2020;Bartl et al., 2020;de Vassimon Manela et al., 2021), different approaches have been proposed for bias mitigation intervening directly in the objective function.Kennedy et al. (2020) proposed to apply regularization during training to the explanationbased importance of identity terms, obtained with Sampling-and-Occlusion (SOC) explanations (Jin et al., 2019).Kaneko and Bollegala (2021)  posed a method for debiasing pre-trained contextual representation by retaining the learned semantic information for gender-related words (e.g., she, woman, he, man) and simultaneously removing any stereotypical biases in the pre-trained model.Zhou et al. ( 2021) exploited debiasing methods for natural language understanding (Clark et al., 2019a) to explicitly determine how much to trust the bias given the input.Vaidya et al. (2020) proposed a multi-task learning model for predicting the presence of identity terms alongside the toxicity of a sentence.
The main drawback of all aforementioned works is their strict reliance on a set of predefined identity terms.This list can be either defined manually by experts or extracted a-priori from the data set.In both cases, the subsequent debiasing models will be strongly affected by these biased terms, limiting the applicability of the trained model to new data.This is a severe limitation, since it is not always possible to retrain a model on new data to reduce bias, resulting in limited use in real-world cases.

Conclusion
We introduce EAR, a regularization approach applicable to any attention-based model.Our approach does not require any a-priori knowledge of identity terms, e.g., lists.This feature (i) allows us to generalize to different languages and contexts, and (ii) avoids neglecting important terms.Thus, it prevents the introduction of further bias.As part of the training procedure, EAR also discovers the impact of relevant domain-specific terms.This automatic term extraction provides researchers with an analysis tool to improve data collection and bias mitigation approaches.
EAR, applied to BERT, reliably classifies data with competitive performance and substantially improves various bias metrics.BERT+EAR generalizes better to new domains and languages than similar methods.
In future work, we will apply EAR-based models to different downstream tasks to both improve bias mitigation and automatically extract biased terms.

A Details on self-attention in Transformers
The Transformer (Vaswani et al., 2017) is the building block of many recent neural language models.A Transformer model consists of two connected encoder and a decoder units which align a source and a target sequence.Differentiating from the original formulation, large language models, such as BERT, drop the encoder and use the remaining encoder to process a single input sequence.
A transformer encoder consists of a multi-head self-attention block and a position-wise, fully connected feed forward neural network.Both the selfattention block and the feed forward network adopt a residual skip connection and batch normalization.We provide details for a standard forward pass in the encoder.In attention blocks, the multi-head output is computed with Scaled Dot-Product Attention between a set of queries and keys of dimension d k , and a set of values of dimension d v .Let Q, K and V be the respective matrix representations.The attention is then computed as To improve expressiveness, the operation is performed on N different, independent linear projections of the same queries, keys and values, so that N attention heads are produced.The heads are then concatenated, projected back to the original input space, and finally fed through the fully connected neural network to produce the next layer embeddings.Let E = [e 0 , ..., e ds ] be the sequence of input embeddings6 , with e i ∈ R dm .In the specific case of a transformer encoder, queries, keys and values correspond to the input embeddingsi.e.Q = K = V = E.As such, the output of the multi-head self-attention block is computed applying the previously presented Equation to the N token projections, concatenating and projecting back to the original space:

B Experimental setup
Hyper-parameters All our experiments use the Hugging Face transformers library (Wolf et al., 2020).We base our models and tokenizers on the bert-base-uncased checkpoint for English tasks and on the dbmdz/bert-base-italian-uncased checkpoint for Italian.We pre-process and tokenize our data using the standard pre-trained BERT tokenizer, with a maximum sequence length of 120 and right padding.We train all models with the following hyperparameters: batch size=64, learning rate=0.00002,weight decay=0.01,learning rate warmup steps=10%, full precision, maximum number of training epochs=30, and early stopping on non-improving validation loss after 5 epochs.Table 2 report results of BERT+EAR trained for 20 epochs with no early stopping, and regularization strength α = 0.01.We chose the latter parameters with grid search on α ∈ [0.0001, 0.001, 0.01, 0.1, 1] and epochs ∈ [10,20,30,40,50].When fine-tuning on MULTILINGUAL AND MULTI-ASPECT, we use a weighted cross-entropy classification loss (L C ) to discount class unbalance.Specifically, we normalize the loss for data points belonging to class C by the prior probability of C, evaluated as its relative frequency in the training set.
We trained all models with 10 different initialization seeds per parameter configuration and averaged over them to obtain stable results and meaningfully compute significance.

Statistical significance
We compute the statistical significance of BERT+EAR over BERT and BERT with SOC mitigation via bootstrap sampling, following Søgaard et al. (2014), using • and △ (and their filled counterparts for a stronger significance) symbols, respectively.We use 1000 bootstrap samples and a sample size of %20.For Hate Speech, significance can only be computed on F1-scores, since bias metrics require an assumption about the label distribution across identity terms that is not given.

Selection bias
We computed the B 2 metric following Ousidhoum et al. (2020).Specifically, we run the authors' code on each of our training dataset, using the query keywords used to sample each dataset.In case of queries composed of multiple words, we split and considered them separate keywords.

Dataset preprocessing
The original MULTILIN-GUAL AND MULTI-ASPECT dataset comes in a multi-label, multiple class setting.Following Ousidhoum et al. (2021), we used the Hostility dimension of the dataset as target label and created a Hate binary from it as follows.We considered single-labeled "Normal" instances to be non-hate/non-toxic and all the other instances to be toxic.

Computation time
We report NVIDIA Tesla V100 PCIE-16GB -equivalent computation time for the tested models.Averaging across the three presented data sets, training and evaluating 10 seeds of BERT+EAR (without early stop) requires 22 hours, compared to 72 hours for BERT+SOC and 7 hours for BERT.The regularization of attention entropy does not affect the computation time by a significant amount.
CO 2 emission Experiments were conducted using a private infrastructure, which has an estimated carbon efficiency of 0.432 kgCO 2 eq/kWh.A cumulative of 319 hours of computation was performed on the hardware of type Tesla V100-PCIE-16GB (TDP of 300W).Total emissions are estimated to be 41.34 kgCO 2 eq.Estimations were conducted using the Machine Learning Impact calculator presented in (Lacoste et al., 2019).

Figure 3 :
Figure 3: Overview of BERT+EAR.Grey boxes are Transformer layers.Each builds a token with attention entropy H ℓ i .Right green box pools layer-wise contextualization contributions and outputs regularization loss.First layer self-attention distribution (bottom) shown for you (shaded blue) and Girl (solid orange).

Figure 4 :
Figure 4: AUC subgroup results broken down by identity term on MISOGYNY (EN).
such a f*cking hoe, I love it -the new Kanye and Lil Pump I kings make women feel comfortable about their sexuality.school drive me insane.like cool b*tch!im depressed too!! doesnt mean im a f*cking c

Table 1 :
Statistics of the data sets.
) Background Positive Subgroup Negative (AUC bpsn ) calculates AUC on the hateful background examples and the non-hateful subgroup examples.A low value means that the model confuses non-hateful examples that mention the identity term with hateful examples that do not.3) Background Negative Subgroup Positive (AUC bnsp ) calculates AUC on the non-hateful background examples and the hateful subgroup examples.A low value means that the model confuses hateful examples that mention the identity with non-hateful examples that do not.

Table 3 :
Examples of MISOGYNY (EN) tweets misclassified by BERT or BERT+SOC, and correctly classified by BERT+EAR.Next to the tweet text, we report the ground truth label and the prediction of each model.Exact phrasing changed to protect privacy.