GruPaTo at SemEval-2020 Task 12: Retraining mBERT on Social Media and Fine-tuned Offensive Language Models

We introduce an approach to multilingual Offensive Language Detection based on the mBERT transformer model. We download extra training data from Twitter in English, Danish, and Turkish, and use it to re-train the model. We then fine-tuned the model on the provided training data and, in some configurations, implement transfer learning approach exploiting the typological relatedness between English and Danish. Our systems obtained good results across the three languages (.9036 for EN, .7619 for DA, and .7789 for TR).


Introduction
The growth of Social Media has seen the spread of two different but connected phenomena: on the one hand, they helped to create a more open and connected world, and, on the other hand, they contributed to the spread of offensive and abusive behaviors. Although the use of "bad language" is intimately connected with freedom of speech, the phenomenon has become so pervasive that developing Natural Language Processing (NLP) systems that automatically and efficiently detect and classify offensive on-line content is a pressing need (Nobata et al., 2016;Kennedy et al., 2017).
SemEval-2020 Task 12: OffensEval 2 ) is a follow-up edition of SemEval-2019 Task 6: OffensEval (Zampieri et al., 2019a) and it addresses the problem of offensive language detection in Twitter messages by focusing on two pending issues: multilingualism and hierarchical tagset annotation.
The multilingualism issue is targeted by providing for the first time data in 5 different languages, namely English, Danish, Greek, Turkish, and Arabic, by applying a shared definition of offensive language. The languages cover different values of the typological spectrum in terms of type (Fusional vs. Agglutinative), language family (Indo-European vs. Altaic vs. Afro-African), genus (Germanic vs. Greek vs. Turkic vs. Semitic), Subject-Object-Verb 1 order (SVO vs. no dominant order vs. SOV vs. VSO) (Dryer and Haspelmath, 2013;Ramat and Baldry, 2011), as well as writing systems. The multilingual aspect poses two additional challenges: (i.) availability of NLP tools and language resources as some of the proposed languages are considered low-resourced (e.g., Danish and Greek) (Rehm and Uszkoreit, 2013); and (ii.) differences in the perceived offensiveness of a message. In particular, given that offensiveness is a highly subjective category, seeing that a message is always "offensive for someone" (Vidgen et al., 2019), different communities of speakers may have different perceptions of what is offensive or not. The use of a shared definition is a way of mitigating the potential differences across communities, but this aspect cannot be ignored in the development of a system for offensive language detection.
The hierarchical annotation tagset is reflected in three sub-tasks, namely: • Sub-task A: Offensive language identification: The task consists in predicting if a tweet is offensive or not. The definition of "offensive message" (OFF) is based on the SemEval-2019 Task 6 (Zampieri et al., 2019b), namely "posts containing any form of non-acceptable language (profanity) or a targeted This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/. 1 Generally, subject and object are used in an informal semantic sense to denote the more agent-like and more patient-like elements (Dryer, 2013) offense, which can be veiled or direct. This includes insults, threats, and posts containing profane language or swear words." (Zampieri et al., 2019a(Zampieri et al., , p. 1416 • Sub-task B: Automatic categorization of offense types: The task consists in predicting the type of offense. It applies only to messages labelled as offensive (OFF) in sub-task A. Two categories are distinguished: targeted offense, that applies whether the message is offensive towards an individual, group, or others; and untargeted, that applies whether the message is offensive but does not contain any specific target.
• Sub-task C: Offense target identification: The task consists in identifying the type of target of an offensive message, such as an individual, a group, or any other type not fitting into the first two categories (e.g., an organization, a situation, an event, or an issue).
Sub-task A is proposed for all languages, while sub-tasks B and C are proposed for English only. Manually annotated training data are available for all languages. No language has an official development dataset. English has a special place in this edition: the organizers provided only automatically annotated material, i.e., silver data, together with the manually annotated training and test data from the 2019 edition. The availability of silver data calls for innovative ways for using such data, such as a full-fledged re-training an existing pre-trained language model rather than directly employing them in a supervised system.

Related Work
Previous work on offensive language detection and related phenomena (i.e. abusive language, hate speech, cyberbullying) has seen the deployment of different system architectures with varying levels of performances. Ideally, we can observe three major waves of systems: (i.) discrete linear models (Waseem and Hovy, 2016;Karan andŠnajder, 2018); (ii.) neural networks (Cimino et al., 2018;Kshirsagar et al., 2018;Mitrović et al., 2019); and (iii.) pre-trained language models (Liu et al., 2019). Discrete linear models are very competitive and powerful methods that have been successfully applied to identify offensive/abusive language, that in many cases outperform more complex approaches based on neural networks (Montani and Schüller, 2018). While neural networks appear to have fluctuating behaviours when applied to offensive/abusive language datasets, pre-trained language models further confirmed their predictive power.
Recently, Swamy et al. (2019) have conducted the first systematic comparison of these three families of models against four different datasets of offensive/abusive language. Feature selection and pre-processing were kept to a minimum, while they conducted fine-tuning of the hyper-parameters (i.e., sequence length, drop out, and class weights). Results confirm BERT as the best performing model. However, improvements (or decrements) across model (per dataset) are minimal. The fluctuating behavior of neural network models is further confirmed with performances being lower than those of a linear model (i.e., an SVM) in two datasets.

System overview
The system we propose builds on top of recent work in the use of pre-trained language modelsà la BERT (Devlin et al., 2019). This family of recently proposed models are based on learning a language model in an unsupervised fashion from a large amount of data (pre-training), and on a subsequent step to train the model to solve a specific task on annotated data (fine-tuning). BERT, in particular, employs bidirectional Transformer-based encoders and a masking task for the pre-training, learning to predict randomly removed words in context, and therefore learning contextual word representations. We differentiate with respect to the standard fine-tuning approach by adding a retraining step. It is undisputed that BERT and BERT-like models are the new state of the art in NLP, however, these models are trained on a massive amount of what could be labelled "standard" natural language data, such as news articles, Wikipedia pages, and books. None of these models is somehow "ready to be used" for Social Media data. 2 In our perspective, retraining BERT will have two beneficial effects: first, it improves the tuning of Figure 1: System illustration: the mBERT model is first re-trained using the MLM objective using language-specific twitter messages, fine-tuned on the language specific training set, and then applied to classify new messages.
the model towards Social Media language variety, and, second, it reduces efforts in pre-processing and cleaning of the data for fine-tuning.
A further aspect we took into account is multilingualism. We aimed at developing a unifying approach that could be easily applied across the different languages. The lack of monolingual BERT models for all the languages in the task 3 guided us to select multilingual BERT (mBERT) (Devlin et al., 2019;Pires et al., 2019). mBERT consists of 12 stacked transformers, with a hidden layer size of 768 and 12 self-attention heads, like its monolingual English counterpart, BERT BASE . mBERT is pretrained on the concatenation of monolingual Wikipedia pages of 104 languages with a shared word piece vocabulary and it does not make use of any special marker to signal the input language, nor does it have any mechanism that explicitly indicates that translation equivalent pairs should have similar representations. Figure 1 graphically illustrates our approach. For each language, we collect potentially offensive tweets and use them to retrain the mBERT model by applying the Masked Language Model (MLM) objective. This will provide us with new "shifted" mBERT models along three dimensions: (i.) language variety (i.e. Social Media); (ii.) language (i.e. English, Danish, Turkish), and (iii.) polarity (i.e., offensive-oriented model). After retraining, the new model is fine-tuned and applied to the test data. We added a linear classifier on top of the pooled output for the [CLS] token to generate the predictions. We differentiate the general architecture only with respect to the language specific data used in the re-training and the fine-tuning steps. We developed our system for Sub-task A: Offensive language identification in three languages, namely English , Danish (Sigurbergsson and Derczynski, 2020), and Turkish (Çöltekin, 2020). Code, additional training data, and models are publicly available at https:// github.com/davidecolla/Offenseval2020. The following paragraphs describes the process of collecting the additional data per language that we used to retrain mBERT.
Danish We compiled, in a semi-automatic way a list of potentially offensive seed terms by combining three methods: (i.) keywords extraction using TF-IDF from the OFF messages in the training data; (ii.) the conservative portion of HurtLex v1.2 (Bassignana et al., 2018); and (iii.) a list of 140 Danish offensive terms from Wiktionary. We thus generated two collections of offensive tweets: the first, D1, contains 197k tokens (7,690 tweets). The second collection, D2, extends D1 with an additional 330k tokens obtained using the Wiktionary list, reaching 527k tokens (20,994 tweets).
Turkish Similarly to Danish, we compiled a list of potentially offensive seed terms using the same methods: (i.) keywords from all OFF messages in the training data with TF-IDF; (ii.) the conservative portion of HurtLex; (iii.) Turkish offensive terms from Wiktionary (19 terms). We generated only one collection, T1, with 5.7 million tokens (392,674 tweets).

Experimental setup
We used the mBERT pre-trained model available via the huggingface Transformers library. 4 After retraining mBERT for each language, we fine-tuned the models using the training data made available by the task organizers. For English, we used the training set of OffensEval 2019. In all fine-tuned settings, we used a standard learning rate of 2e − 5, a batch size of 32, and 4 training epochs. Pre-processing steps are reported in the Appendix.
mBERT has been retrained on each of the tweet collections per language separately, generating three models for English (mBERT-E1, mBERT-E2, and mBERT-E3), two for Danish (mBERT-D1, mBERT-D2), and one for Turkish (mBERT-T1). In addition to fine-tuning the retrained models per language, we experimented with a transfer learning approach on Danish (mBERT-D3). The choice was inspired by the close typological connection between English and Danish and the limited amount of retraining data we retrieved for Danish. We fine-tuned with the Danish training data the retrained model for English obtained with E3 (mBERT-E3). We hypothesize that mBERT-E3 could be more robust than the retrained monolingual Danish models (mBERT-D1 and mBERT-D2) because of the larger amount of retraining materials biasing mBERT for language variety and offensive content. At the same time, the typological similarity of English and Danish, and the multilingual nature of mBERT should not harm performances given the additional language specific fine-tuning step using Danish training data.
We ran an internal evaluation to verify whether the proposed system works and selected the best retrained model (at least for English and Danish). Evaluations were conducted using the OffensEval 2019 test data for English, while we split the OffensEval 2 training data for Danish and Turkish retaining 90% of the data for fine-tuning and 10% for test. On the basis of the results (see Table 3 in the Appendix for details), we selected the following systems: mBERT-E3 for English (retrained with E3 tweet collection); mBERT-D3 for Danish (transfer learning model), and mBERT-T1 for Turkish.
We ran our experiments on a machine with the following configuration: NVIDIA K40 GPU, Intel Xeon E5-2680 v3 Processor and 64GB of RAM (Aldinucci et al., 2017). The training of the mBERT architecture on the largest data collection, namely the E3, took eight hour for each epoch, whilst the fine-tuning, performed on the same machine, ran for two hours for each epoch. Table 1 reports the results on the blind test data. For Turkish and English our approach obtained very competitive results compared to the top ranking systems, with deltas lower than 0.05 in both cases. On the other hand, the results for the transfer learning approach in Danish are disappointing. Although we obtained very good results on the NOT class (both Precision and Recall higher than .90), transfer learning did not manage to boost the OFF class. We also evaluated the original monolingual models for Danish, mBERT-D1 and mBERT-D2. Both models obtained top-ranking macro-F1 scores (.8138 and .8195 respectively) and show a higher Precision for the OFF class when compared to the transfer learning model (.8214 and .8518 vs. .6285, respectively) maintaining similar Recall (.5609 for both mBERT-D1 and mBERT-D2 vs. .5365 for mBERT-D3). The performance on the NOT class is comparable across all models for Danish. Generally, the NOT class obtains good results across the three languages, while systems underperform against the OFF class. Table 1 also highlights a different behavior between the English model and those for the other two languages, namely a higher Recall on the OFF class. Since the main difference between the systems is the additional training data, a possible explanation for this behaviour could be a higher offensiveness load of the retraining data, that may bias the model towards the OFF class. In particular, the additional training data for our English model were collected based on higher quality lexical resources, while the Danish and Turkish data had to rely on a potentially high-coverage, but low-precision lists of lexical items.  Tables 2a-2c depict the confusion matrices between the predictions and the gold standard data from the task organizers. It clearly appears that the classifiers are asymmetrically biased, confirming the observation based on the scores. A qualitative error analysis on the output of the classifiers across the three languages has shown some common patterns on the reasons for the misclassifications. We have observed that False Positives (NOT train →OFF prediction ) tend to be dominated by instances containing mildly offensive terms (e.g. EN to suck, DA latterlige [ridiculous], TR boktan [shitty]) or terms carrying negative polarity, such as EN ignorant, DA tosse [fool]. As for the False Negatives (OFF train →NOT prediction ), we observe two trends: the first, messages contain strong offensive lexical cues that are misspelled (e.g., EN stoopid), or difficult to find in common lexicons of abusive terms (e.g., EN twat), or idiomatic expressions (e.g., TR kapak olsun [lit. "get a cover"]). The second concerns the presence of ambiguous words (e.g. EN jerk, in @USER Wings over and it's not even a question (sweet chili & Jamaican jerk hanger) 5 or implicitly offensive messages (DA NED MED SVENSKEN! [down with the Swedish] 6 ; TR Şimdi sana anlatsam anlamıcan o yüzden boşver[If I'll explain this you, you'll not understand it] 7 ), presence of irony (TR @USER aşırı komikmiş kardeş ilk esprin mi [it's too funny bro, is this your first joke?] 8 ), or harsh criticism (TR Türkçe pop gibisin sesin güzel konuşmaların boş güzellik [You sound like Turkish pop, your voice is beautiful] 9 ).

Conclusion
The approach of our system is based on the combination of re-training and fine-tuning mBERT. The re-training step has been added to bias mBERT against three aspects: language variety, language, and polarity. The bias in the classifier is sensitive to the collection of the additional training materials. Results of the systems across the three languages show that the quality of the retraining data is sensitive to the quality of the language resources and strategies used to retrieve them. On the fine-tuning step, this aspect appears to impact mainly the Recall for the OFF class, as shown by the EN results compared to DA and TR.
Among the phenomena we detected as sources of noise in the classification, their explicitness appears 5 instance ID: A2825 6 instance ID: 1695 7 instance ID: 32854 8 instance ID: 38605 9 instance ID: 43122 to play an important role. Recent work has focused on this aspect (Kumar et al., 2020;Caselli et al., 2020) by proposing more fine-grained levels of annotations.
In future work, we will focus on the hurdles of figurative and idiomatic language usage in offensive messages, following the approach in Mladenović et al. (2017), by enriching HurtLex with multi-word expressions (MWEs) automatically extracted from corpora in multiple languages.