Dynamically Refined Regularization for Improving Cross-corpora Hate Speech Detection

Hate speech classifiers exhibit substantial performance degradation when evaluated on datasets different from the source. This is due to learning spurious correlations between words that are not necessarily relevant to hateful language, and hate speech labels from the training corpus. Previous work has attempted to mitigate this problem by regularizing specific terms from pre-defined static dictionaries. While this has been demonstrated to improve the generalizability of classifiers, the coverage of such methods is limited and the dictionaries require regular manual updates from human experts. In this paper, we propose to automatically identify and reduce spurious correlations using attribution methods with dynamic refinement of the list of terms that need to be regularized during training. Our approach is flexible and improves the cross-corpora performance over previous work independently and in combination with pre-defined dictionaries.


Introduction
The relative sparsity of hateful content in the real world requires crawling of many of the standard hate speech corpora through keyword-based sampling (Poletto et al., 2021), rather than random sampling.Thus, hate speech classifiers (D'Sa et al., 2020;Mozafari et al., 2019;Badjatiya et al., 2017) often learn spurious correlations from the training corpus (Wiegand et al., 2019) leading to a substantial performance degradation when evaluated on a corpus with a different distribution (Yin and Zubiaga, 2021;Bose et al., 2021;Florio et al., 2020;Arango et al., 2019;Swamy et al., 2019;Karan and Šnajder, 2018).
Recent work has proposed regularization mechanisms to penalize spurious correlations by attempt- ing to explain model predictions using feature attribution methods (Ross et al., 2017;Rieger et al., 2020;Adebayo et al., 2020).These methods assign importance scores to input tokens that contribute more towards a particular prediction (Lundberg and Lee, 2017).For instance, Liu and Avci (2019) penalize the attributions assigned to tokens contained in a manually curated dictionary consisting of group identifiers (e.g.women, jews) that are often known to be targets of hate.Kennedy et al. (2020) extract group identifiers manually from the top tokens indicated by a bag-of-words logistic regression model trained on the source corpus.However, regularizing only group identifiers limits the coverage of such approaches, and may not capture other forms of corpus-specific correlations learned by the classifier limiting its performance on a new corpus.Moreover, such manually curated lists may not always remain up-to-date because new terms emerge frequently (Grieve et al., 2018).While Yao et al. (2021) do not use such lists for refining models in different target-domains, their method still requires input from human annotators.
In this paper, we hypothesize that the classification errors in a small annotated subset from the target can reveal spurious correlations between tokens and hate speech labels learned from the source (see Table 1).To this end, we propose Dynamic Model Refinement (D-Ref), a new method to identify and penalize spurious tokens using feature attribution methods.We demonstrate that D-Ref improves the overall cross-corpora performance independently and in combination with pre-defined dictionaries.

Dynamic Model Refinement (D-Ref)
In this section, we describe the general theoretical framework of the proposed approach.We assume that during training our hate speech classification model has access to the source training corpus D train S and a small validation set D val T from a target corpus with different distribution, following a similar setting to Maharana and Bansal (2020).Our Dynamic Model Refinement (D-Ref) approach consists of 2 recurring steps across epochs: (i) we first extract a set of spurious tokens using D val T at the end of every epoch; and (ii) then we penalize the extracted tokens during the next epoch.

Extraction of Spurious Tokens
Global token-ranking in source corpus: We first begin with identifying the tokens from D train S that are highly correlated with hate/non-hate labels.These tokens are suitable candidates for causing source-specific spurious correlations, restricting generalizability to a new corpus.
For that purpose, at the end of every training epoch

Instance-level local ranking in target corpus:
We hypothesize that tokens highly correlated with hate/non-hate classes in the source, but also causing mis-classifications in the target, should most likely contribute to spurious source-specific correlations, and may not be important for hate speech labels.Thus, we identify the tokens that cause misclassifications in D val T , and then obtain a list of spurious tokens dynamically after every epoch ep i .
We rank the tokens in the target instances from D val T based on their loc-atr j tok , starting from the highest attributed token per instance j to the lowest.The top k tokens in j is given by tok where k is a hyper-parameter in D val T .We treat the two error cases of False Positives (FP) and False Negatives (FN) separately.Here the hate class is considered as the positive class.
Since the tokens responsible for FP may also be important for the True Positives (TP), we only extract those that have high attributions for FP, but not for TP.Further, another filtering step is applied, where only the tokens common to the top N from the ranked gl-hate are extracted.This results in discarding the tokens that may not be globally correlated with a class with respect to the source model.So tok T .Similarly, top k tokens corresponding to FN instances are extracted, wherein those common to TN are discarded, and subsequent filtering based on the gl-nhate is performed, i.e.
This step thus yields a list of possible spurious tokens at the end of ep i , S ep i = [tok F P , tok F N ] ep i .

Penalizing the Extracted Spurious Tokens
In this step, we attempt to reduce the importance assigned, by the source model, to the extracted spurious tokens by penalizing the terms in S ep i during the next epoch ep i+1 .We propose three different ways for token penalization: Tok-mask: In this case, we simply mask the tokens from S ep i present in D train S after every ep i and then train the source model during ep i+1 .
Reg: Since token masking might eliminate substantial information, we regularize the model using S ep i .The attributions assigned to these terms are pushed towards zero by the following learning objective on D train S : where L ′ is the classification loss and L atr is the attribution loss.Here ϕ (t) is the attribution score for the token t.Intuitively, this should reduce the importance of tokens contributing to source-specific patterns and encourage learning more general information.Both losses are computed over D train S .
Comb: We finally combine S ep i with the predefined group identifiers from Liu and Avci (2019) and Kennedy et al. (2020) to perform regularization using Equation 2. We surmise that repeating these steps at the end of every epoch should reduce the source-specific correlations while the source model gets trained.We use three different attribution methods: (i) Scaled Attention (α∇α) (Serrano and Smith, 2019): Here attention weights α i are scaled with their corresponding gradients ∇α i = δ ŷ δα i , where ŷ is the predicted label.Serrano and Smith (2019) show that combining an attention weight with its gradient can better indicate token importance for model predictions, compared to only using the attention weights.
(ii) Integrated Gradients (IG) (Sundararajan et al., 2017): This method is based on the notion that the gradient of a prediction function with respect to input can indicate the sensitivity of the prediction for each input dimension.As such, it aggregates the gradients along a path from an uninformative reference input (e.g.zero embedding vector) towards the actual input such that the predictions change from uncertainty to certainty.
(iii) Deep Learning Important FeaTures (DeepLIFT/DL) (Shrikumar et al., 2017): This aims to explain the difference in the output from a reference output in terms of the difference of the input and a reference input.Given a target output neuron t, a reference activation t 0 of t, and ∆t = t − t 0 , it computes the contribution scores C ∆x i ∆t of each input neuron x i that are necessary and sufficient to compute t, such that n i=1 C ∆x i ∆t = ∆t.The reference input could be the zero embedding vector.(54.4% hate; train: 32497, val: 1016, test: 4062).We reduce the size of available D val T in Dynamic by randomly sampling 25% of the validation set (4064).We remove URLs, split hashtags into words using the CrazyTokenizer3 , remove infrequent Twitter handles, punctuation marks and numbers, and convert text into lower-case.See Appendix A for a detailed discussion on the corpora.

Experimental
Baselines We compare D-Ref with the following baselines: (i) BERT Van-FT (Devlin et al., 2019): vanilla fine-tuning on D train S without regularization; (ii) Convolutional Neural Network with regularization of pre-defined group identifier terms using IG for feature attribution (Liu and Avci, 2019); (iii) BERT using two variations for regularization: (a) all the mentioned group identifiers, (b) group identifiers extracted from the top features of a bagof-words logistic regression trained on each individual corpus (Kennedy et al., 2020) 4 ; (iv) χ2 -test with one degree of freedom and Yate's correction (Kilgarriff, 2001) to extract tokens tok from D train S that reject the null hypothesis with 95% confidence.The null hypothesis states that in terms of tok, both D train S and D val T are random samples of the same larger population.We, then, regularize the attribution scores5 assigned to these terms, with BERT.(v) Pre-def: BERT with regularizing the combined pre-defined group identifiers from (ii) and (iii).

Model training
We use pre-trained BERT (Devlin et al., 2019) for our approach.We train all the models over D train S from the source and evaluate over D test T from the target.The best model for all the baselines and D-Ref are selected by tuning over D val T .See Appendix B on hyper-parameter tuning.

Cross-corpora Predictive Performance
Table 2 presents macro-F1 scores across five random initializations of each experiment using six cross-corpora pairs.We observe that overall, all feature-attribution methods with D-  (Dror et al., 2018;Efron and Tibshirani, 1993), 95% confidence interval.
that although the terms obtained through the χ 2 test from the source indicate differences across domains, they may not necessarily be important for the prediction of hate/ non-hate labels by the source model, and may not contribute to source-specific spurious correlations.We find that D-Ref-Reg with IG and DL achieves better average macro-F1 of 58.9 and 58.4 respectively, compared to the corresponding Pre-def (IG) and Pre-Def (DL) that obtain an average of 57.6.D-Ref-Reg (α∇α) provides an average macro-F1 of 58.7, comparable to Pre-def (α∇α) with 58.8.However, D-Ref-Reg achieves significantly improved scores in more cases, as compared to Predef using all the attribution methods, i.e. 4/6 cases (α∇α), 3/6 cases (IG) and 3/6 cases (DL) with D-Ref-Reg, compared to 3/6 (α∇α), 1/6 (IG) and none (DL) with Pre-def.D-Ref-Tok-mask exhibits improvements on average (α∇α: 57.8, IG: 58.2, DL: 57.8) over , demonstrating the effectiveness of the token extraction mechanism of D-Ref.Finally, D-Ref-Comb displays the best overall performance, with the highest average score of 59.We attribute this improvement from D-Ref to its increased coverage with dynamic token extraction, and reduction of spurious source-specific correlations, while the baselines only penalize the group identifiers.A dynamic approach also corrects the model during training before it can get fully biased towards these tokens.Finally, it can incorporate the pre-defined lists along with the extracted tokens, and further improve the performance.(iv) HATN (Hierarchical Attention Transfer Network) (Li et al., 2018(Li et al., , 2017) ) This approach uses attention and a domain adversarial pivot extraction mechanism.

Domain-Adaptation Approaches
(v) Sarwar and Murdock ( 2021): This adopts a data-augmentation strategy leveraging a negative emotion dataset (Go et al., 2009)  T from target for model selection all the above methods.Table 3 shows the results on comparing against other DA approaches.We note that the average performance of all the other DA approaches in this task is lower than Van-MLM-FT, as discussed in our previous work (Bose et al., 2021).χ 2 -test, on an average, fails to surpass the Vanilla baseline.Besides, the DA approach proposed for cross-domain hatespeech detection by Sarwar and Murdock (2021) also yields an overall drop in performance.They perform data-augmentation by replacing relevant words from an external negative emotion dataset with tagged hateful terms from the target domain.We find that a major portion of the augmented instances lack meaning, and this negatively impacts the adaptation.However, across all feature attribution methods, D-Ref-Reg improves the crosscorpora performance compared to Van-MLM-FT and the DA approaches, with average macro-F1 of 59.6 (α∇α), 59.

Qualitative Analysis
Table 4 shows the change in attributions for some instances in D test T from Dynamic that were misclassified by Van-FT but correctly classified by our D-Ref-Reg (IG).Van-FT wrongly attributes higher importance to 'f*cking' and 's*cks' for the hate class in the first example, and 'blacks' and 'queers' for non-hate in the second due to source-specific correlations.However, D-Ref-Reg (IG), extracts and penalizes abusive tokens like {s*ck, a**hole, d*ck} for the former causing FP and {africans, dark, queer} for the latter causing FN.Our approach not only penalizes the exact tokens, but also those with similar meaning (e.g.'blacks' is contextually close to 'dark', 'africans'), giving more importance to the context around the spurious tokens.See Appendix C for the token-lists.

Conclusion
We proposed a dynamic approach for automatic token extraction with regularization of the source model such that the spurious source specific correlations are reduced.Our approach shows consistent cross-corpora performance improvements both independently and in combination with pre-defined tokens.Future work includes applying our method on other cross-domain text classification tasks and exploring how explanation faithfulness can be improved in out-of-domain settings (Chrysostomou and Aletras, 2022).

A Data Description
While HatEval and Waseem are sampled from Twitter, Dynamic is generated using a human-andmodel-in-the-loop process.These corpora have been collected across different time frames, and hence they involve different topics of discussion, which are also determined to a large extent by the keywords used for sampling.As such, the problem of dataset bias with spurious correlations are induced with such focused sampling procedures (Wiegand et al., 2019) used in Waseem and HatEval.
For instance, in Waseem, a large amount of tweets, available at the time of our experiments, consist of hate tweets directed against women, which results in False Positives for instances from other corpora that contain women related terms.We observed that most of the racist tweets were already removed and were unavailable for experiments.HatEval, on the other hand, has a mix of tweets directed against women and immigrants, and hence it demonstrates decent performance when evaluated over Waseem that consists of sexist tweets.On the contrary, Dynamic contains annotator-generated tweets that includes challenging perturbations.For instance, it includes non-hate instances like 'It's wonderful having gay people around here', 'I hate the concept of hate', 'Tea is f*cking disgusting', which can easily fool a classifier learned on biased datasets, and result in classifying these instances as hateful.Moreover, this corpus covers different targets of hate.As such, when Dynamic is used as the target corpus, the spurious correlations learned by the source classifier become relatively well-visible, which are captured and penalized by D-Ref while the source model gets trained.The data used in the work are publicly available, and download links are provided in the respective original articles, which are referenced in this paper.However, in the case of Waseem, where only tweet IDs are provided, some tweets might be unavailable.

B Implementation Details
We leverage the pretrained BERT-base model6 for our experiments.We use a batch size of 8, learning rate of 1 × 10 −5 and Adam optimizer with decoupled weight decay regularization (Loshchilov and Hutter, 2019) for Van-FT, Van-MLM-FT, D-Ref and Pre-def.For Integrated Gradients, following Liu and Avci (2019), the interpolated embeddings are treated as constants while back-propagating the loss from the regularization term.An all zero embedding vector is used as the baseline input for both Integrated Gradients and DeepLIFT.We use the original code, as provided by the respective authors, for all the prior-arts.For Pre-Def, we combined the pre-defined lists from Kennedy et al. (2020) and Liu and Avci (2019) and regularized their attribution scores over BERT with α∇α, IG, and DL as feature attribution methods.
We implement the data-augmentation approach proposed by Sarwar and Murdock (2021) ourselves due to the absence of an available implementation.
Following the description present in the paper, we prepare the training data for the sequence tagger by labeling all the terms in the hateful instances from the source corpus that are also present in the lexicon from hatebase.org 7.However, we do not tokenize the lexicon obtained from hatebase.orgwhile searching for the corresponding matching terms in the source corpus.We convert the lexicon into lower-case and look for the exact match in the source corpus.
For D-Ref, we set the value of top N tokens used from ranked {glist-hate, glist-nhate} as 500.The values of k ∈ top {10%, 20%, 30%, 40%} of the instance-length in D-Ref, and λ in both D-Ref and Pre-def are selected through hyper-parameter tuning over D val T using a random seed.For α∇α and DeepLIFT, λ ∈ {0.1,0.5,1,10,20,30,40,50,60} and for IG,λ ∈ {1,10,20,30,40,50,60}.We run supervised fine-tuning on D train S for 6 epochs with all the BERT models (prior-arts and D-Ref).We select the models (prior-arts and D-Ref) by tuning over D val T from the target corpus, with respect to macro-F1 scores.Table 5 presents the macro-F1 scores obtained on the validation set for D-Ref and the prior arts.

C Tokens extracted in different epochs
The list of error-causing tokens for False Positives (FP) and False Negatives (FN) in D val T , extracted for the cases presented in Section 3.4, is given below.We underline the tokens present in the visualization examples (both Table 4  Don't get me wrong I don't hate asians, but I definitely don't like them Since, the Waseem dataset is made available as tweet IDs, we observed that it mostly contains sexist comments, while most of the racist content must have been removed before we could crawl it.Hence, the tokens related to race mostly occur in non-hate contexts causing FN.
Even though some error-causing tokens remain in the list until the end, their overall effect should be reduced as the regularization is performed throughout the training procedure, which causes improvement in macro F1.

D In-corpus performance
We present the in-corpus performance, i.e. the performance on the source corpus in terms of macro-F1 scores, obtained when the source model is refined for the corresponding target corpus using D-Ref-Reg, in Table 6.For D-Ref-Reg, the model is tuned over the target corpus validation set.Here ep i , we first obtain the global class-specific ranked list of tokens from D train S .This is achieved by computing global attributions per token tok and class c (gl-atr c tok ) from its attribution per instance j (loc-atr j tok ) averaged across all training instances classified as c by the source model trained until ep i : gl-atr c tok = |D train S | j=1 1 ŷj =c loc-atr j tok ∀occurrence of tok in j |D train S | j=1 1 ŷj =c #(occurrence of tok in j) (1) Here c ∈ {hate, non-hate}, ŷ is the predicted class and 1 is the indicator function.Prior to this, loc-atr j tok are individually normalized using sigmoid to obtain values in a closed range.Rarely occurring tokens and stop-words are not considered for the global ranking.The gl-atr c tok values are sorted from the highest globally attributed token to the lowest, which yields two ranked token-lists [gl-hate, gl-nhate] ep i .
We further compare D-Ref-Reg with various Domain Adaptation (DA) methods.However, such methods typically leverage the unlabeled train set from the target domain (D train T ).We first continue pre-training BERT model on D train T following Rietzler et al. (2020).Then, we perform supervised fine-tuning and regularization on D train S using D-Ref-Reg (Masked Language Model + D-Ref-Reg).We compare against the following methods: (i) BERT Van-MLM-FT : MLM training of BERT on D train T and supervised fine-tuning on D train S .(ii) BERT PERL (Pivot-based Encoder Representation of Language) (Ben-David et al., 2020): This performs pivot based fine-tuning using the MLM objective of BERT by masking and predicting the pivot terms present in the combination of D train S and the unlabeled D train T .Here pivots are terms that are frequently present in the unlabeled data of both the source and target corpora, and are predictive of the source labels.(iii)BERT-AAD (Adversarial Adaptation with Distillation)(Ryu and Lee, 2020), This is a domain adversarial approach with BERT where a target encoder is adapted with an adversarial objective that leverages D train S and D train T .
comparison, we initialize (v) and (vi) with the MLM trained BERT on D train T , while the other methods already make use of D train T for adaptation.We use D val 8 (IG), and 60.8 (DL), compared to 58.1 from Van-MLM-FT.Since D-Ref-Reg and Van-MLM-FT use identical MLM pre-training on D train T , the improvements can be attributed to the dynamic token extraction of our method.More generally, when the larger set of target domain unannotated instances D train T are unavailable, D-Ref can identify and correct spurious correlations on source using a small amount of annotated instances from the target D val T , as demonstrated in Section 3.2.When sufficient number of unannotated instances from the target corpus are available, D-Ref can yield further cross-corpora improvements by leveraging the unannotated target instances with the MLM pre-training.

Table 1 :
Spurious correlations learned by the source classifier between the shaded tokens and the hate label.

Table 2 :
on source →target pairs (H : HatEval, D : Dynamic, W : Waseem).Bold denotes the best performing approach in each column for every feature attribution method.* denotes statistical significance compared to Van-FT with paired bootstrap , for cross-domain hate-speech detection.They construct a weakly labeled augmented dataset by training a sequence