ANDES at SemEval-2020 Task 12: A Jointly-trained BERT Multilingual Model for Offensive Language Detection

This paper describes our participation in SemEval-2020 Task 12: Multilingual Offensive Language Detection. We jointly-trained a single model by fine-tuning Multilingual BERT to tackle the task across all the proposed languages: English, Danish, Turkish, Greek and Arabic. Our single model had competitive results, with a performance close to top-performing systems in spite of sharing the same parameters across all languages. Zero-shot and few-shot experiments were also conducted to analyze the transference performance among these languages. We make our code public for further research


Introduction
Offensive language, hate speech, cyberbullying, and abusive language detection are topics that have summoned a lot of interest in the last years, particularly due to the necessity in Social Media to stop -or at least, to diminish-this omnipresent phenomenon. Not only the social implications are behind this, but also practical implications for companies: recently, some large advertisers have removed their presence from Social Media as they consider that platforms full of "divisiveness and hate speech" do not give value to their companies (Hern, 2020).
The differences between these categories are loosely defined. Waseem et al. (2017) propose using two axes to understand the particularities of each discourse: whether the abusive language is directed to a specific individual or to a generalized group, and whether the offending discourse is explicit or implicit.
Even when online platforms prohibit behavior that crosses the line into abuse, 2 these rules are frequently violated. Automated moderation algorithms are necessary to perform a faster and even better usergenerated-content moderation or to serve as a tool that helps human moderators to reduce the volume of offensive content still present in online platforms. To address this problem, offensive language detection is usually thought of as a binary classification problem in which the input is a post (tweet, message, comment, etc.) and the output is the classification of whether the post is offensive or not.
Several features of offensive language make it complex to detect. The offenders could intentionally hide offensive words by substituting letters with special characters or numbers. Content with irony or sarcasm could be harmful even when all words are polite, and vice versa. The real intention behind speech is sometimes difficult to detect even for humans.
According to Chatzakou et al. (2017), in the case of Twitter, each tweet provides a fairly limited context. Therefore, an offensive post may be identified as unoffensive if the context is not taken into account. Despite several published proposals on detecting offensive language, hate speech, and other related concepts, there is no consolidated or effective strategy that could be applied to different languages and domains.
Building a quality labeled dataset is an expensive task in time and effort. Most available datasets are in English, and often datasets are not published due to privacy concerns, making research on offensive language detection difficult for other languages. The generalization power of learning models would allow us to transfer knowledge to languages with poor resources.
In this work, we describe our participation in SemEval 2020 Task 12: OffensEval, an offensive language detection task in five different languages. We propose a single multilingual system based on Multilingual BERT (Devlin et al., 2018) to jointly address the offensive language detection. We trained our system using all the available training data and evaluated its performance for each of the languages instead of adjusting language-specific models. Our single model performs fairly well, achieving performances comparable to the winning teams for each language. To further explore the multilingual dimension of BERT, we analyze zero-shot and few-shot capabilities in a cross-lingual setting for the task in question. We make our code public for further research.

Background
The detection of offensive language, cyberbullying and hate speech are tasks that are closely connected and often confused (Malmasi and Zampieri, 2018). Several machine learning models addressing hate speech or offensive language detection have been proposed in the last years; in particular deep learning models (Gambäck and Sikdar, 2017;Park and Fung, 2017;Badjatiya et al., 2017;Agrawal and Awekar, 2018;Bisht et al., 2020;Gertner et al., 2019;Pérez and Luque, 2019) have increased their popularity among researchers on this task. Despite the growing interest in the area, the models are usually trained and evaluated inside very specific English datasets, and their generalizability to other contexts or languages is still a challenge. Morever, building these datasets is difficult (Waseem, 2016) and achieved performances are often overestimated (Arango et al., 2019;Wiegand et al., 2019).
There are only a few studies addressing multilingual detection of these subtypes of abusive language in the related literature. In these, authors proposed single systems that can be used to classify data in different languages. Some of the common features used in this kind of models are multilingual word representations such as MUSE (Conneau et al., 2017) or Multilingual BERT (Devlin et al., 2019). Other authors combined word embeddings with tweet-level features (Corazza et al., 2020) or linguistic features (Benito et al., 2019).
In most cases, the multilingual models are trained and tested independently for each language and do not combine different languages in a single evaluation. An exception is the approach proposed by Bojkovský and Pikuliak (2019), where the authors trained deep neural network architectures with a concatenation of English and Spanish datasets to classify data in both languages.
Although Multilingual BERT models have been tested as end-to-end solutions for several tasks, they have not been widely explored for offensive language detection. Pires et al. (2019) tested the zero-shot capability of BERT for transferring knowledge from one language to another in named entity recognition and part of speech tagging tasks obtaining high performing results.

Data
The dataset used in this work is described thoroughly in . We decided to replace the distantly-supervised English training dataset  presented in this task by the OLID dataset of Zampieri et al. (2019). This is because the focus of this work is the multilingual capabilities of BERT, and the datasets of the languages added to this task (Danish(Sigurbergsson and Derczynski, 2020), Greek (Pitenis et al., 2020), Arabic (Mubarak et al., 2020) and Turkish(Çöltekin, 2020)) resemble more the OLID dataset in that they were manually annotated.

System Overview
Our model is a fine-tuned version of Multilingual BERT (Devlin et al., 2018). This architecture (which has become the state of the art for most NLP tasks) consists of a stack of transformer blocks (Vaswani et al., 2017) pretrained in two tasks: masked language model (also known as Cloze task) and next sentence prediction. For the downstream task an output layer is added and the model is fine-tuned with very low learning rates. Multilingual BERT shares the same training as single-language BERT but using a concatenated dataset of 104 languages, and has demonstrated to have surprising cross-lingual capabilities, even among languages that do not share scripts (Pires et al., 2019).
The implementation used in this work is the pretrained multilingual BERT-base model from the HuggingFace library (Wolf et al., 2019). This model consists of 12 transformer blocks, 12 self-attention heads, and a hidden layer size of 768. On the top of it, we added a linear layer and applied a sigmoid function to the outputs. We trained the model for 10 epochs with a batch size of 32 using a dropout probability of 0.1, setting the initial learning rate at 5 * 10 −5 and binary cross entropy as the loss function. Adam with linear warm-up of 10% of the steps was used to optimize the loss.

Experimental Setup
We tested several experimental configurations using the data described in Table 1. The main purpose of our different setups is to test the capability of multilingual models not only in inside-language evaluation but also generalizing knowledge from one language to another. The generalization capability of offensive language detection models across different languages has been poorly explored.
For monolingual evaluation, we trained our model using each one of the training sets and the corresponding validation and testing sets. This experimental setup allows us not only to test the model in a specific language but also to obtain reference values to be compared to the ones obtained in the multilingual experimental setups. We refer to this setting as BERT Lang (BERT Greek, BERT English, etc).
As a second configuration, we opted for multilingual training. We trained our model with the concatenation of all the training sets and evaluated it over each test set. The purpose of this experiment is to find languages that contribute positively to the monolingual classification. We call this setting BERT All.
It might be argued that in monolingual settings it would have been a better option to simply use the BERT version trained specifically for that language. However, we decided to use the multilingual version to have comparable results.
To assess the multilingual potential of BERT for this task, we also performed some zero-shot and fewshot experiments. Zero-shot experiments consisted in training the model in one language and evaluating it in a different one. That is, training with language A and testing with language B. Few shot experiments, on the other hand, trains the model in language A using also a little amount of instances from B, and tests their performance against B. This cross-lingual generalization is desirable to tackle the same problem in low-resourced languages. In Section 5 we discuss the results in each case.
The evaluation metric proposed for this task is Macro F1.

Error Analysis and Model Interpretation
To analyze the reasons behind the errors of our model, we used the Captum library (Kokhlikyan et al., 2019) implementation of Integrated Gradients (Sundararajan et al., 2017) to have more information about the importance of each token towards classifying the tweet as offensive. This returns, for a model and a  sentence, a value for each token representing the contribution of it towards the positive class (offensive) or towards the negative class (not offensive). Table 2 shows the results for each of our trained classifiers. The classifier presented for the competence is BERT All, in spite of having better performing systems in the monolingual settings; for instance, BERT Greek has better results for Greek than BERT All. However, BERT All performs fairly well, having small differences with the best performing system for each language. For most languages, it stays in the "top cluster" of the competition for each language, most notably in Danish achieving the 7th position. We must remember that BERT All is a single model for all the languages, reducing the need for several models -in our case, a reduction of five-to-one.

Results
In most cases, BERT All showed results equal or slightly worse than the monolingual setting, telling us (at first sight) that adding other languages does not contribute to the overall performance. In the case the Turkish language, however, there is a slight increase from 0.766 to 0.773 of F-score. More interesting is the case of the Danish dataset, where the F-score increased from 0.74 to 0.77.
To find out the impact of the other languages over the Danish results, we conducted a data augmentation experiment. We augmented the training Danish dataset with other languages data to classify the tweets in the Danish dataset. Table 3 shows the results of this experiment. The results are, in general reasonable. Adding data from different languages does not dramatically impact the monolingual results. The addition of the Arabic dataset turned out to be the most successful one, despite having been the worst result in the zero-shot experiment. It is somehow surprising that a language from a very different family might positively impact in the performance of the Danish classifier.
Regarding the zero-shot cross-lingual capability (the non-diagonal entries in Table 2) it can be observed that there is no transfer learning in this mode. That is, no classifier trained in one language performs successfully when tested in a different one. The only exception for this is Danish-English, as we can see  Figure b displays the word importance for each token: red means that the token is pushing towards the "offensive" class, whereas green pulls to not offensive class.
that training with English and testing against Danish gives something around 0.65 Macro F1. However, a close examination of the true positives in this case brings us to the conclusion that the transference is mainly due to vocabulary sharing and code switching. Few-shot experiments yielded slightly better results. Figure 1 displays the performance of classifiers trained with 5%, 10%, 15% . . . of the instances of the Danish dataset (blue line) and the performance of training the same classifier but using also another dataset: English or Arabic. Performance using 20% of the Danish data and the English dataset achieves 0.69 Macro-F1 score while we need almost the double of data to have the same performance using Danish-only training. Using Arabic dataset yields marginal improvement in performance. Figure 2a displays the confusion matrix for our model against the English test set. We can observe that most errors come from false positives. Figure 2b shows the word importance for a couple of false positive examples. It turns out that numerous spurious correlations are learned by our classifier: words such as "Trump", "disgusting", "racist" trigger the sentence as offensive. Also, more complex constructions such as reporting offensive incidents are not understood by the classifier: for instance, "They call me b*tch" should not be marked as offensive.

Error Analysis
It is also important to notice that there are a number of instances that seem to be mislabelled. For instance, these examples were labelled as non-offensive: • what border agent let those snakes on a plane? did they have passports? i bet you those snakes did not have any legal identification • It's Me and IDGAF About Nothing -Females with nigga mindsets are dangerous • Fuck knows how Coutinho fits in the team but I'll take him Please don't forget to buy a CB though for christ sake.
Whereas these are considered offensive: • Crazy that as we get older and go through certain shit you just want to keep it to yourself.
• @USER Trump lied..and then coldly said it now appeared his building was the tallest in NYC. He's a sick, sick twist.
• Knew sis was a liar but I got soft anyways smh • It's crazy how people make excuses for them to walk out of a person's life...
In some cases it seems that there are instances that are wrongly labelled, and in other cases there are inconsistencies regarding swearing -is saying "sh*t" or "f*ck" considered offensive? It's known that offensive language (alongside with cyberbullying, hate speech and so on) are difficult tasks for the annotators and that non-skilled annotators deliver datasets having a lot of noise.

Conclusion
In this work, we explored the capabilities of Multilingual BERT for offensive language detection in a multilingual scenario. We evaluated our model in different experimental setups: training and testing it individually for each language; training with one language data and testing in others (zero-shot mode); and training a single model on all languages.
Something to notice is that, in spite of the good performance of multilingual BERT in zero-shot cross-lingual evaluation for other tasks, it did not work well for this one. The only exception is the English-Danish cross-lingual evaluation, and it is mainly due to vocabulary intersection. Further work is needed to analyze the reasons why these zero-shot experiments failed. Nonetheless, the only experiment performed in few-shot mode showed slightly better results, letting a Danish model with just a hundred of examples achieve a reasonable performance.
On the other hand, training all the languages jointly resulted in a single model having a similar performance overall. This is interesting as having a single model instead of five has practical implications, in particular concerning the resources needed for big models such as BERT.
The relationship of offensive language among different languages is something to be studied in more depth. At first glance, our experiments showed no zero-shot transference; a few-shot experiment showed (in the case of two languages of similar typology) some transference. Further study is needed to explain these results.