HATE-ITA: New Baselines for Hate Speech Detection in Italian

Online hate speech is a dangerous phenomenon that can (and should) be promptly counteracted properly. While Natural Language Processing supplies appropriate algorithms for trying to reach this objective, all research efforts are directed toward the English language. This strongly limits the classification power on non-English languages. In this paper, we test several learning frameworks for identifying hate speech in Italian text. We release HATE-ITA, a multi-language model trained on a large set of English data and available Italian datasets. HATE-ITA performs better than mono-lingual models and seems to adapt well also on language-specific slurs. We hope our findings will encourage the research in other mid-to-low resource communities and provide a valuable benchmarking tool for the Italian community.


Introduction
Online hate speech is a dangerous phenomenon that can (and should) be promptly counteracted properly.While Natural Language Processing supplies algorithms to achieve that, most research efforts are directed toward the English language.Indeed, there is now a plethora of approaches and corpora (Indurthi et al., 2019;Kennedy et al., 2020b;D'Sa et al., 2020;Mollas et al., 2022;Kiela et al., 2021, inter alia), that can be adopted for addressing English hate speech detection.
However, this choice strongly limits the classification power in other languages where fewer resources are available, like Italian.Researchers have put a great effort into improving Italian models (Fersini et al., 2018;Bosco et al., 2018;Sanguinetti et al., 2018Sanguinetti et al., , 2020)).However, previous work does not address the task systematically, resulting in no clear evidence of the performance of these models.Consider also that a competitive baseline for hate speech detection in Italian does not yet exist.Current datasets are not broad enough to cover all the protected categories and are generally based on a few thousand samples.Data annotation is a costly process, and annotating hate speech requires tremendous care.
Multi-lingual models give a possible way out of this issue.Nozza (2021) shows that combining multiple languages in training can help overcome the apparent limitations of hate speech detection models.We start from those conclusions to build up our work by collecting a large dataset of English hate speech data that we combine with some data in Italian.We use this new collection to train multi-lingual models and show the performance and examples across different Italian datasets.
The contribution of this short workshop paper is thus straightforward: we thoroughly evaluate and release to the community a set of models for Italian hate speech detection obtained through fine-tuning of multi-lingual models (HATE-ITA).1These models are wrapped in high-level API that will allow the community to access and use these models for future research easily.These models set a new baseline on two state-of-the-art hate speech detection datasets in Italian.To the best of our knowledge, this is the first paper that showcases the use of a large English dataset in combination with a small portion of Italian to create a robust resource for hate speech detection in Italian.
Contribution 1) our experiments show that multi-lingual models can effectively be used to cover missing ground in some mid-to-low resource languages; 2) while providing researchers with strong baselines, our models can also be used to study which areas and targets are still not yet covered, thus guiding directions for future research (see Section 4.4).We release HATE-ITA as an open-source Python library2 .

Background
In this work, we consider the task of hate speech as binary (hate/non-hate).To control the number of samples for each protected group in the training data, we consider the target of the hateful messages.We select six target attributes based on the type of discrimination, namely origin, gender identity, sexual orientation, religious affiliation, and disability.We consider these targets as the superset of classes able to cover the majority of dataset-specific labels.We discarded the other and none class from all the datasets because they might represent other classes.

State-of-the-art Corpora
We describe the datasets we included in the training set in this work.The English corpora have been selected by filtering the ones covering our desired targets from a public list 3 .
Italian For Italian, we consider two different corpora proposed for Evalita shared tasks (Caselli et al., 2018): the automatic misogyny identification challenge (AMI18) (Fersini et al., 2018) for hate speech towards women and the hate speech detection shared task (HaSpeeDe18) (Bosco et al., 2018) for the part related to hate speech towards immigrants proposed in (Sanguinetti et al., 2018).Both datasets comprise 2,500 instances for training, 500 for validation, and 1,000 for testing.
English Ousidhoum et al. (2019) present MlMa, a multi-lingual multi-aspect hate speech analysis dataset in Arabic, English, and French.The dataset consists of tweets collected by querying languagespecific keywords.Mollas et al. (2022) propose ETHOS, a multilabel English hate speech detection dataset of Reddit posts.They employ an automatic pre-annotation process where the posts are first labeled with a machine learning classifier.Only the uncertain ones (within the [.4, .6]probability range) are manually labeled using a crowdsourcing platform.Following the authors, we binarise the values of each label (if value ≥ 0.5 → 1 else value → 0).The targets are identified only when the post is hateful, so we discard the non-hateful ones.Here, we map the targets national_origin and race to origin.Kennedy et al. (2020c) collected a large set of comments from different social media sources 3 https://hatespeechdata.com/(YouTube, Twitter, and Reddit).The annotation process has been performed via a crowdsourcing platform where each comment receives four ratings.The authors further ensured that every annotator received comments across all the hate speech scale.Since the dataset is annotated with a continuous hate score, we used a threshold set to binarise the problem: if value < -1 → 0 and if value > 0.5 → 1.We merged origin and race classes into the origin class.Mathew et al. (2021) collected English posts from the social media platforms Twitter and Gab.Then, they used a crowdsourcing platform for annotating each post as hate, offensive, or normal speech; annotators also have to select the target communities mentioned in the posts.Labels are aggregated, and the final one is obtained through majority voting.We discard the instance when there is no majority (i.e., the three annotators have assigned a different label).Here, we binarise the targets as suggested by the authors into toxic (hatespeech or offensive) and non-toxic (normal).We also map the targets based on the grouping made in the paper (see Table 3 in (Mathew et al., 2021)), with the only exception of Indigenous and Refugee that we assign to origin class.Kennedy et al. (2020a) presented the Gab Hate Corpus (GHC), a multi-label English corpus of posts from the social network gab.com.Comments were annotated by at least three trained annotators with the following classes: Call for Violence, Assault on Human Dignity, or Not Hateful.Following Kennedy et al. (2020b), we aggregate the first two for obtaining the hateful class.We selected only the targets used in our study (removing political) and merged nationality/regionalism and race or ethnicity classes into the origin class.Kiela et al. (2021) introduced a novel framework for dynamically creating benchmark corpora.The annotators are asked to find adversarial examples, i.e., hard examples that a target model would misclassify.The obtained dataset also provides the target group. 4Here, we mapped their targets to ours, removing the ones not covered.
Table 1 shows the size of the dataset created by combining all the afore-mentioned English corpora.

Experimental Methodology
Our experimental setup illustrates three aspects: 1) the performance of the different models on a train, validation, and test setup that we construct on our data, 2) the performance on different datasets (also considering two new additional datasets that we take as out-of-domain) and 3) a qualitative evaluation section in which we use explainability methods to assess which words are contributing more to the prediction.
For the models we train, we run three different experimental frameworks: 1) mono-lingual (MONO), in which we train our models only on Italian data; 2) multi-lingual (MULTI), in which we combine the Italian and the English data for training; 3) zero shot, cross-lingual (ZERO), in which we train a model only with English data.All the models are tested on the Italian test data (Fersini et al., 2018;Sanguinetti et al., 2018).

Data Setup
We used the splits provided by the associated shared tasks for the Italian dataset.This setup en-  2018), we isolated 500 instances from the training to be used as the validation set.For the combined English data, we isolate 20% with stratified sampling to be used as the validation set.The details of the parameters used to fine-tune the models can be found in the Appendix A. Models are trained for 5 epochs and evaluated every 50 steps, and we select the best checkpoint considering the validation loss.

Overall Results
Table 2 shows the results only for the models that we trained by testing on the official splits of each Italian dataset (see Section 2.2).We have found two crucial takeaways.First, the best multi-lingual model (XLM-Large) performs sensibly better than the best model trained only on mono-lingual data (mBERT).Second, models subject to multi-lingual training always outperforms mono-lingual ones.
Recent research (Nozza et al., 2020) has shown that language-specific datasets are more effective when used to fine-tune language-specific models; this research suggests that training only on the small set of Italian data is not enough even when using a language-specific model: joint fine-tuning with larger datasets is an effective way of obtaining more accurate hate speech classifiers.This is a very interesting result: considering the small amount of Italian data used by the multi-lingual model, this opens future applications of multi-lingual pipelines to low-resource languages.Finally, the increase in performance of the multi-lingual framework comes directly from the Italian data we added to the training since the performance of the purely zero-shot cross-lingual models is much worse than the monolingual one.

Results by Dataset
This section shows the results split by datasets for our multi-lingual best models and for DeHate-Bert.We show the results on the test sets of Sanguinetti et al. ( 2018) and AMI18 (Fersini et al., 2018).Moreover, we also test on the complete test set of HaSpeeDe18, Bosco et al. (2018) and the shared task re-runs HaSpeeDe20 (Sanguinetti et al., 2020) and AMI20 (Fersini et al., 2020b).Unfortunately, DeHateBert was not fine-tuned following the guidelines described in (Bosco et al., 2018) as the authors used different splits.For this reason, we cannot evaluate the performance of this model on HaSpeeDe18 and (Sanguinetti et al., 2018) (some examples of the examples in the test sets are used for training).
Table 3 shows the results for each dataset.We do not show results for Italian models as they perform much worse (see Table 2).These results show that our models have consistent performance over most categories.Indeed, XLM-Twitter, beats De-HateBert by 39 and 19 points in F1 on AMI18 and AMI20 respectively.This outcome further demonstrates the need for protected group coverage in the training set.

Results on Multi-Lingual HateCheck
We also use the recently introduced Multi-Lingual HateCheck (MHC) (Röttger et al., 2022).MHC is a suite of functional tests for multi-lingual hate speech detection models that extend the original English HateCheck (Röttger et al., 2021).MHC tests several functionalities that can affect hate prediction (e.g., counterspeech, spelling variations, use of slurs).Here, we used only the Italian subset.MHC should serve as an external testbed to validate our models.
Results in Table 4 show the consistent performance of our models.XLM-Twitter and XLM-Large strongly outperform the results of the original baseline proposed by Röttger et al. (2022).

Qualitative Evaluation
Figure 1 reports token contribution explanations of four correct predictions from our multi-lingual XLM-Large.The texts are complex examples in Italian, as standard models usually misclassify them (Nozza, 2021).We extracted token contributions using the interpretability suite provided in Attanasio et al. (2022b).The first two examples regard the taboo Italian expression p*rca p*ttana (literally p*rca (pig) + p*ttana (sl*t)).When used separately (porca e puttana (pig and slut)), they should be considered literally; when used together, the two words form taboo expressions that do not have a misogynistic connotation.The latter two examples regard the ambiguous Italian term finocchi.
The word means fennels in a food-related context, but can also be translated to f*ggots when refereed to individuals.
In NLP, the scarcity of data in languages beyond English has generated an interest in zero-shot learning (Srivastava et al., 2018;Ponti et al., 2019;Pfeiffer et al., 2020;Wu et al., 2020;Bianchi et al., 2021Bianchi et al., , 2022, inter alia) , inter alia) and the application of this to hate speech detection methods (Corazza et al., 2020;Stappen et al., 2020;Aluru et al., 2020;Leite et al., 2020;Rodríguez et al., 2021;Feng et al., 2020;Pelicon et al., 2021).In particular, Aluru et al. ( 2020) exploited several deep learning models and multi-lingual embeddings for performing an extensive analysis on 16 datasets in 9 different languages in few-and zero-shot learning settings.Rodríguez et al. (2021) use the pre-trained Language Agnostic BERT Sentence Embeddings (Feng et al., 2020) obtaining good results.Other research efforts focused on translating English data to enrich data availability in other languages with mixed results: Ibrohim and Budi (2019) shows that translations do not bring good results using traditional machine learning classifiers.However, more sophisticated pipelines of translation and pre-training can indeed provide some improvement over standard benchmarks (Pamungkas et al., 2021;Wang and Banko, 2021).

Conclusion
This paper presents a novel resource for Italian hate speech detection on social media text, HATE-ITA.Researchers can use this new set of models to assess the quality of new systems by providing a more reliable benchmark.However, this is just the first step.Indeed, we do not claim to have released the final model for Italian hate speech detection; HATE-ITA requires careful benchmarking to understand if it can accurately capture hate speech on other targets.

•
IT: Come si fa a rompere la lavatrice p*rca p*ttana • EN: How the hell can you break the washing machine • IT: Sono arrivati i finocchi • EN: Here come the f*ggots • IT: È arrivata l'insalata di finocchi • EN: Here it comes the fennel salad

Table 1 :
Statistics of the English dataset.

Table 3 :
Results on different benchmark datasets for the multi-lingual models.

Table 4 :
Results on different MULTILINGUAL HATE-CHECK.We report F1 score the for hateful and nonhateful cases, and the overall macro-F1 score.