Exposing the limits of Zero-shot Cross-lingual Hate Speech Detection

Reducing and counter-acting hate speech on Social Media is a significant concern. Most of the proposed automatic methods are conducted exclusively on English and very few consistently labeled, non-English resources have been proposed. Learning to detect hate speech on English and transferring to unseen languages seems an immediate solution. This work is the first to shed light on the limits of this zero-shot, cross-lingual transfer learning framework for hate speech detection. We use benchmark data sets in English, Italian, and Spanish to detect hate speech towards immigrants and women. Investigating post-hoc explanations of the model, we discover that non-hateful, language-specific taboo interjections are misinterpreted as signals of hate speech. Our findings demonstrate that zero-shot, cross-lingual models cannot be used as they are, but need to be carefully designed.


Introduction
An increasing propagation of hate speech has been detected on social media platforms (e.g., Twitter) where (pseudo-) anonymity enables people to target others without being recognized or easily traced. While this societal issue has attracted many studies in the NLP community, it comes with three important challenges. First, "hate speech" covers a wide range of target types, including misogyny, racism, and various other forms. While they often intersect, these types require different approaches.
Second, available labeled corpora refer to different definitions of hate speech, collection strategies, and annotation frameworks (Fortuna and Nunes, 2018). This lack of consistency strongly limits research on hate speech, which ultimately needs to apply cross-domain or transfer learning approaches for using different corpora.
Third, most of the research on hate speech detection consider only English and only a limited number of labeled corpora are available (Fortuna and Nunes, 2018;Vidgen and Derczynski, 2021;Poletto et al., 2020). However, hate speech is not specific to any one language, and approaches proposed for English may not fit other languages. Each language exhibits different complexities in dealing with gender or reflecting cultural ideas around it.
The lack of models and labeled corpora for non-English languages seems a perfect application for zero-shot, cross-lingual learning (Lamprinidis et al., 2021;. But is it? In this paper, we investigate the limitations of zero-shot, cross-lingual solutions based on mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020) on benchmark data sets of hate speech against immigrants and women in English, Italian, and Spanish.
Our analysis demonstrates that these approaches have significant limitations: (1) they are not able to capture common (taboo) language-specific expressions, and (2) they do not transfer to different hate speech target types. We show that the reasons for these limitations are due to the high presence of language-and target-specific taboo interjections in non-hateful contexts, like porca puttana or puta. 1 While derogatory for women, these terms are often used as intensifiers in non-hateful context, blurring the lines for detection. Since English does not use equivalent words in the same way, zero-shot, cross-lingual models will not observe them in the training data. Consequently, these models consider the literal meaning of these terms as individual words, treating them as misogynous hate speech. These findings demonstrate that, at the current moment, cross-lingual, zero-shot transfer learning is not a solution for solving the lack of models and labeled corpora in non-English languages for hate speech detection.  Contributions 1) We investigate different learning frameworks on benchmark corpora for the detection of hate speech targeting women and immigrants 2) We expose the limits of zero-shot, crosslingual solutions using the multilingual BERT model (mBERT) 3) We show interpretable results through post-hoc explanation.

Zero-shot, Cross-lingual Hate Speech Detection
We investigate different learning settings: 1) zeroshot, cross-lingual, i.e., training on one language and testing on unseen languages; 2) monolingual, i.e., training and testing on the same language; 3) few-shot, cross-lingual, i.e., training on one language and a small percentage of samples from the test language and testing on the test language; 4) augmented cross-lingual, i.e., training on several languages and testing on a language included in the training.
Multilingual BERT Recently, contextual embeddings pretrained on large corpora substantially advanced research for several major Natural Language Processing (NLP) tasks (Nozza et al., 2020). In particular, multilingual BERT (mBERT) (Devlin et al., 2019), a model pretrained on monolingual Wikipedia dumps in 104 languages, has shown surprisingly good abilities for zero-shot, cross-lingual model transfer for different NLP tasks (Pires et al., 2019). In this paper, we fine-tune the mBERT model on the task of hate speech detection considering data from one or multiple languages.
Post-hoc Explanation One of the biggest limitations of using complex black-box models, such as BERT, is the lack of interpretability. Following Kennedy et al. (2020)  To assess the cross-lingual evaluation framework, we use hate speech benchmark data sets with consistent definitions, annotation schema, and collection strategies (see Appendix C). For English and Spanish, we adopt the data sets proposed in the shared task of hate speech against immigrants and women on Twitter (HatEval) (Basile et al., 2019). For Italian, we consider two different corpora proposed for Evalita shared tasks (Caselli et al., 2018) Table 2 shows the macro-averaged F1 score for hate speech detection on different training and test languages (in rows and columns, respectively). Underlined numbers refer to the monolingual setting results, while zero-shot, cross-lingual results are italicized. We report as baselines the best performing model for each of the considered data set released in conjunction with shared tasks. 2 Since the aim of this paper is to investigate classification abilities of cross-lingual, zero-shot models, we do not aim to overcome the baselines but to provide comparable results.

Hate speech towards immigrants
Observing monolingual results (underlined numbers in Table 2), we see that training and testing in English gives the poorest performance. This behavior is due to an over-sensitivity to specific words/hashtags used during data collection (e.g. #SendThemBack, #StopTheInvasion), which leads to overfitting. In Appendix A, we report the SOC explanation of a misclassified tweet containing these hashtags. We confirm this finding by training the monolingual English model on data deprived of these hashtags, which lead to higher macro-F1 (from 0.368 to 0.438).
The zero-shot, cross-lingual configuration (italic numbers in Table 2) shows very different results between the two targets. Zero-shot learning obtains good performance for detecting hate speech towards immigrants: when testing Italian and Spanish, results are very similar; when testing on English, training on a different language is better than including English data, resulting in a 22% macro-F1 improvement on average. This is because training sets based on other languages do not contain the above-mentioned specific words and therefore do not suffer from over-sensitization.

Hate speech towards women
Concerning hate speech towards women, the zeroshot, cross-lingual model obtains significantly lower performance for Spanish and Italian. To better understand this substantially different finding, we analyze wrongly labeled instances. We discover that zero-shot, cross-lingual models are strongly influenced by common, language-specific taboo interjections to mislabel non-hateful text as misogynous. In particular, expressions that contain literal insults towards women but are not misogynistic per se. For example in Spanish, beyond its misogynistic meaning, the word puta (literally bitch) is also used as an exclamation of surprise (e.g., puta mierda). The Italian expressions porca troia and porca puttana (literally porca (pig) + troia/puttana (slut)) are very generic taboo interjections that do not have a misogynistic connotation. It is important to notice that these interjections are not directly translatable and usually used in combination, e.g. porca + puttana, puta + mierda.
To demonstrate this finding, in Table 3 we report the number of times a zero-shot cross-learning model correctly predicts the labels of instances containing taboo interjections for Italian and Spanish (i.e., porca puttana, porca troia, puta). The high frequency of instances containing taboo interjections (29% and 78% of the test set), due also to the keyword-driven collection strategy, proves the importance of understanding these linguistic expressions. The following numbers illustrate the   Figure 1 shows the SOC explanation of a nonhateful tweet correctly classified by the monolingual Italian model and wrongly classified by the zero-shot, cross-lingual model trained on English and Spanish data. As expected, training and testing on Italian teach the model that porca puttana is a very general exclamation that does not imply misogyny (high importance score for nonmisogynous prediction). However, when training on other languages, this taboo interjection is not recognized because it is strictly related to the test language. We observe that zero-shot, cross-lingual models consider the literal meaning of individual words, and consequently treat terms like porca puttana as misogynous regardless of their use in context.
To further validate this major finding, we conduct an additional experiment on the corpus of hate speech towards women: we train few-shot, crosslingual models randomly sampling 1% of training data in the test language. The averaged results on 10 runs in terms of macro-F1 are: 0.660 for ES+EN⇒IT; 0.702 for EN+IT⇒ES. The significant improvements with respect to zero-shot performances prove that misogyny detection is strongly entangled with common, language-specific taboo interjections that are very frequent in the data set.

Hate speech towards immigrants and women
Finally, to demonstrate the need for treating target types separately, we run the zero-shot, cross-lingual model on the merged data sets of hate speech towards immigrants and women. The results in terms of macro-F1 are: 0.572 for ES+IT⇒EN; 0.513 for ES+EN⇒IT; 0.632 for EN+IT⇒ES (see Appendix B). Following Stappen et al. (2020), these scores suggest a sufficient adaptation by the models. However, they represent a compromise between the high results of zero-shot cross-lingual hate speech detection against immigrants and the low results of hate speech detection against women. By showing the results for the two separate targets, we demonstrated that zero-shot cross-lingual models suffer from limitations when predicting hate speech detection against women and that, in general, zero-shot cross-lingual hate speech detection has yet to be solved.

Impact of language-specific taboo interjections on XLM-R
In order to understand whether common, languagespecific taboo interjections play a role in other language models, we conducted experiments with XLM-R (Conneau et al., 2020). XLM-R is a large cross-lingual language model based on RoBERTa (Liu et al., 2019), trained on 2.5TB of filtered Com-monCrawl data, which significantly outperformed mBERT on a variety of cross-lingual benchmarks. XLM-R achieves high macro-F1 scores in monolingual settings for detecting hate speech towards women in Italian and Spanish (0.806 for IT⇒IT; 0.859 for ES⇒ES). Similar to the previously presented findings, we observe a significant drop of 36% in macro-F1 when considering the zero-shot cross-lingual settings (0.604 for EN⇒IT; 0.511 for ES⇒IT; 0.404 for IT⇒ES; 0.658 for EN⇒ES). This drop in macro-F1 is more evident when considering the performance when training on Spanish and testing on Italian and vice versa. These results on XLM-R bring more evidence about the role that language-specific taboo interjections have in impacting the performance.
Only a few studies have investigated hate speech detection across different languages. Steimel et al. (2019) asked which factors affect multilingual settings for German and English, concluding that a shared classification algorithm is not conceivable due to lack of corpora comparability. In Sohn and Lee (2019), the authors proposed a multi-channel model exploiting multilingual BERT and languagespecific BERT for Chinese, English, German, and Italian. Finally, Stappen et al. (2020) proposed a novel, attention-based classification block for performing zero-and few-shot, cross-lingual learning on the HatEval data set. While they state that transfer learning is effective for hate speech detection, we argue that there is a need to investigate hate speech targets separately since these models consistently fail misogyny classification.

Conclusion
We demonstrate that cross-lingual, zero-shot transfer learning, in its traditional settings, is not a feasible solution for solving the lack of models and labeled corpora for hate speech detection. We argue that hate speech is language specific, and NLP approaches to identifying hate speech must account for that specificity and the adoption of related techniques must be done with care (Bianchi and Hovy, 2021). We plan to expand this evaluation to other languages and to investigate a solution based on bias mitigation Kennedy et al., 2020) and on pragmatic role-aware models (Holgate et al., 2018;Pamungkas et al., 2020) to reduce the impact of this problem on classification. Future work will also focus on modeling language's social factors (Hovy and Spruit, 2016;Hovy, 2018;Hovy and Yang, 2021), such as speaker and receiver characteristics, and study their impact on hate speech detection classifiers.

Ethical Considerations
We are aware that the inherent (gender) biases of sentence and word embeddings are affecting the model's performance on detecting hate speech towards women (Bolukbasi et al., 2016;Sheng et al., 2019;Nangia et al., 2020;Nozza et al., 2021). We believe that this issue plays a role in the classifica-tion models. However, in this paper we extensively demonstrate that the presence of taboo interjections is one of the main hurdles that specifically hinder zero-shot, cross-lingual hate speech detection results.
Finally, we want to highlight that the presented findings are specifically related to the considered languages and data sets. Hopefully, our work will generate more conscious research about the use of hate speech detection models in zero-shot, crosslingual frameworks. A Additional Post-Hoc Explanation Figure 2 shows the hierarchically clustered explanations from SOC for an example of non-hateful speech wrongly classified as hateful by the monolingual English model. It is evident how the (incorrect) high score of the hashtag eclipses the influence of non-hateful words such as days, kids, and school.

B Additional Results
Immigrants+Women  Table 4: Results in terms of macro-F1 for the merged corpora containing hate speech towards immigrants and women. Monolingual results are underlined. Zeroshot cross-lingual results are highlighted in italic. * = differs significantly from monolingual at p ≤ 0.05. * * = significant difference at p ≤ 0.01.

C.1 Consistent Data sets
We use benchmark hate speech data sets with consistent definitions, annotation schema, and collection strategies. All the three data sets (Bosco et al., 2018;Fersini et al., 2018;Basile et al., 2019) refer to the same definitions of hate speech towards immigrant and women. 3 This paper focuses on the common binary classification task (hateful/nonhateful) across all data sets, ensuring the same annotation schema. Finally, all data sets have been 3 https://github.com/msang/hateval/blo b/master/annotation guidelines.md collected by following three strategies: (1) monitoring potential victims of hate accounts, (2) downloading the history of identified haters and (3) filtering Twitter streams with keywords, i.e. words, hashtags and stems.
For experimental evaluation, we use the data set splits provided in the associated shared task for comparability with previous work.

C.2 Implementation Details
We implement the proposed work exploiting the public code implementation of the classification model presented by Kennedy et al. (2020) 4 . We use their hyperparameter configuration for training: batch size is set to 32, the learning rate of the Adam optimizer is set to 2 × 10 −5 , the loss function is the binary cross entropy.
Computing Infrastructure We independently run the experiments on two machines: the first one is equipped with two NVIDIA RTX 2080TI and has 64GB of RAM. The other one is equipped with four GPUs, NVIDIA GTX 1080TI, and has 32GB of RAM.