KS@LTH at SemEval-2020 Task 12: Fine-tuning Multi- and Monolingual Transformer Models for Offensive Language Detection

This paper describes the KS@LTH system for SemEval-2020 Task 12 OffensEval2: Multilingual Offensive Language Identification in Social Media. We compare mono- and multilingual models based on fine-tuning pre-trained transformer models for offensive language identification in Arabic, Greek, English and Turkish. For Danish, we explore the possibility of fine-tuning a model pre-trained on a similar language, Swedish, and additionally also cross-lingual training together with English.


Introduction
Offensive language is a prevalent phenomenon in many online communities and social media platforms. Due to the vast amount of content, it is often infeasible to manually moderate all user submitted content. Computational methods for identifying this type of content is one possible way to help mitigate the problem. Different aspects of the problem such as aggression (Kumar et al., 2018), cyber bulling (Sprugnoli et al., 2018) and hate speech (Malmasi and Zampieri, 2017) have been studied in recent work. OffensEval 2019 used a new three-level hierarchical annotation schema to capture multiple aspects of offensive language in one framework (Zampieri et al., 2019a).
While much of the previous work is focused on English, offensive language detection is a multilingual problem. Apart from country specific communities, large social media platforms such as Facebook and Twitter have many users interacting in their native tongue. Recently, offensive language detection addressed different languages such as German (Wiegand et al., 2018), Arabic (Mulki et al., 2019), Italian (Sanguinetti et al., 2018), and Spanish (Fersini et al., 2018). In OffensEval 2020, the first level task of offensive language detection has been expanded to cover five languages, Arabic, Danish, English, Greek, and Turkish.
Transfer learning is nothing new in NLP but over time, the pre-training has become more complex, incorporating more context. In recent years, language models based on the transformer architecture pre-trained on large amounts of unlabeled text and then fine-tuned on downstream tasks have been used to achieve state-of-the-art (SOTA) results on many natural language benchmarks (Devlin et al., 2018;Liu et al., 2019;Yang et al., 2019). In OffensEval 2019, seven of the top ten models used BERT in some way (Zampieri et al., 2019b). One of the advantages of transfer learning is that it can potentially reduce the amount of labeled data that is needed. The model can learn general features of language from a large unannotated corpus during pre-training. Task specific features can then be learned from a smaller annotated corpus. On some datasets, using a pre-trained language model has shown to match the results of models trained from scratch on ten times more data. Adding language model fine-tuning on unlabeled domain specific text can potentially reduce the need for labeled data even more (Howard and Ruder, 2018).
One obstacle to using large transformer models is that the pre-training step is expensive. The Megatron-LM has 8.3 billion parameters and was trained over 9 days on 512 GPUs (Shoeybi et al., 2019). In comparison, the fine-tuning step is relatively inexpensive. This makes model sharing an important part of applying large transformer models to many tasks. The HuggingFace Transformers library provides a platform for sharing models developed by researchers and the community, and a unified API for using them (Wolf et al., 2019).
One additional challenge with multilingual offensive language detection is low resource languages. Such languages might lack both unlabeled data for pre-training and labeled data for fine-tuning. One possible solution in such cases is to use multilingual models. Such models can achieve lower perplexity than monolingual models for language modeling of low resource languages (Conneau and Lample, 2019). In some contexts, multilingual models can even outperform monolingual models on downstream tasks . In the case of lacking labeled data, they have also shown to perform well on zero-shot cross-lingual classification tasks. This type of transfer works best between typologically similar languages. However, transfer is possible to some extent even between languages with different scripts (Pires et al., 2019). This paper describes our system for OffensEval 2020 . We participated in Sub-task A: Offensive language identification for all language tracks. Based on the recent success of the transformer architecture, we compared monolingual BERT models for Arabic, English, Greek, and Turkish with the XLM-R multilingual model . We found that the monolingual models outperform the multilingual models for all languages on the development data. We used models available through the HuggingFace Transformers library. Since no monolingual models were available for Danish, we initially compared a Swedish BERT model with multilingual XLM-R. We found that the Swedish model worked reasonably well on the development data, while XLM-R only predicted the majority class for most runs. We hypothesized that this is due to the small and imbalanced Danish dataset; similar high variance results have been seen for BERT in Devlin et al. (2018) and Phang et al. (2019). To get around the problem of the small dataset, we tried cross-lingual training of Danish and English using XLM-R which outperformed the Swedish BERT model.
In Section 2 we give a short description of the task and data used. Section 3 presents our approach, describing data preprocessing, models and training approach. Section 4 shows our results on the test data for OffensEval 2020.

Task and Data
OffensEval 2020 uses a multilingual dataset of posts from Twitter, tweets, with annotations following the hierarchical annotation schema proposed by Zampieri et al. (2019a). Only the first level of annotation is provided for all languages. This level discriminates between two kinds of tweets: • Offensive (OFF): Tweets containing any form of offensive language. This includes insults, threats, and profanity.
• Not Offensive (NOT): Tweets not containing any form of unacceptable language.
The goal of the task is to distinguish between offensive and not offensive tweets. Macro-averaged F1-score is used as evaluation metric. Table 1 shows a summary of the labeled training datasets for each language. All the datasets are imbalanced to some extent, with the majority of tweets being labeled as not offensive. Danish is the most extreme in this regard, having only 13% of tweets labeled as offensive. We can also see that the size of the datasets varies quite a bit, with Turkish having about ten times as many labeled instances as Danish.

Data Preprocessing
A minimal amount of preprocessing was done. We applied only two operations to all languages: 1. Multiple consecutive user mentions were replaced with a single @User to reduce sequence length and noise.
2. All tweets were truncated or padded to a common length. This length was chosen separately for each language to be the smallest length longer than 95% of all tweets in the training set.
Additional processing was done on the external datasets for English. We sampled about 10,000 additional tweets from Davidson et al. (2017). Samples were chosen such that the complete labeled tweet dataset became balanced. Tweets with at least 3 annotators labeling it as either offensive language or hate speech were labeled as OFF. Tweets with all annotators agreeing on the neither-class were labeled as NOT. A balanced dataset of 13,000 Wikipedia comments, from the Kaggle dataset, were also added. To be consistent with the Twitter data, all comments were at most 280 characters. Any comment having at least two of the labels toxic, severe toxic, obscene, threat, insult, or identity hate was labeled as OFF. Comments with no negative labels were labeled as NOT. For both datasets, we replaced URLs with a URL token, and for the tweet dataset, we replaced user mentions with @User.
Additionally we sampled 400,000 tweets from the English silver standard data using confidence scores as weights. These were then filtered down further to the 40,000 tweets with highest confidence using our model as described in section 3.3.

Models
Vaswani et al. (2017) initially introduced the transformer architecture in the context of machine translation. While previous approaches relied on convolutional and recurrent neural networks, they showed that a relatively simple architecture based on feed-forward neural networks and attention mechanisms could provide better results while being more parallelizable and faster to train. Like previous sequence-tosequence models the transformer consists of two main components: an encoder component and a decoder component. Radford et al. (2018) trained a left-to-right language model, GPT, using only the decoder part of the transformer and fine-tuned it on multiple downstream tasks with minimal task specific changes. Devlin et al. (2018) showed the importance of bi-directional pre-training for certain types of tasks by obtaining new SOTA results on 11 NLP benchmarks, including an almost 8 point improvement on GLUE. Their model architecture, named BERT (Bidirectional Encoder Representations from Transformers), is the architecture we used for all monolingual models apart from English.
Since the decoder component of the transformer already does masking of subsequent positions, it is a natural choice for the next word prediction language modeling task used by GPT. To be able to train a bidirectional language model, BERT instead uses the encoder part of the transformer. Apart from increasing the size, it is almost identical to the initial transformer implementation. BERT consists of a stack of encoders, 12 for BERT BASE and 24 for BERT LARGE , compared to 6 in the original transformer.
Each encoder, in turn, consists of two main parts: a self-attention layer followed by a feed-forward neural network. Self-attention is the mechanism which allows the transformer to consider other words in the sequence when encoding the current word. BERT increases the number of attention heads from 8 in the original Transformer to 12 for BERT BASE and 16 for BERT LARGE . Finally the number of hidden units in the feed-forward neural networks is also increased from 512 to 758 and 1024 for BERT BASE and BERT LARGE , respectively.
We used pre-trained BERT language models without changes to the base architecture. For the finetuning step, we followed the approach for single sentence classification suggested by Devlin et al. (2018). A single fully connected classification layer was added to the base model. A special [CLS] token was prepended to all inputs. The contextual representation of this token was used as an embedding for the complete sentence, and passed to the classification head. The complete base model was fine-tuned during training. Liu et al. (2019) showed that BERT is undertrained. Their model, RoBERTa, uses exactly the same architecture as BERT. RoBERTa outperforms BERT simply by training on more data, with larger batches, for a longer time. Some additional simple changes in the pre-training approach, such as removing one of the pre-training objectives and training on longer sequences, improved the results even further. This is the monolingual model we used for English. There were no pre-trained RoBERTa models available for the other languages. The fine-tuning approach is identical to the one used for BERT.
Similarly, in the multilingual context, the XLM-RoBERTa (XLM-R) model we used achieves much of its improvement over previous multilingual models by using several orders of magnitude more data .  also find that vocabulary size has a large impact when many languages are used. Again XLM-R uses the same model architecture as BERT. However, the increase of vocabulary size from 30K to 250K leads to an increase of the total number of parameters from 110M and 335M to 270M and 550M for the BASE and LARGE models, respectively. All five languages are present among the 100 languages used during pre-training of XLM-R. The fine-tuning approach is identical to the one used for the previous models.
A summary of the different pre-trained models that we used for each language is provided below:  Table 2: Mean and maximum F1 macro on the development sets for five random restarts on each language and model.

Experiments
We carried out the initial experimentation and the hyperparameter selection using the English data from OffensEval 2019. We followed the fine-tuning procedure recommended for BERT by Devlin et al. (2018). We tested the following parameters, where the best performing values are underlined: • Batch size: 16, 32 • Learning rate (Adam): 5e-5, 3e-5, 2e-5 • Epochs: 2, 3, 4 The dropout was kept constant at 0.1 for all layers. Overall we found that fine-tuning was relatively insensitive to batch size and learning rate. However, most random restarts seemed to overfit when using more than 2 epochs. The same hyperparameters were then used for all further experiments.
For each language, 20% of the data was set aside as a development set and used for model selection. For each model, we ran five random restarts with different data shuffling and classifier head layer initialization. The model with the best macro-averaged F1-score on the development set was then used for submission. Table 2 summarizes the results we obtained.
For English, the training was done in two steps. Initially, we trained the model using only the labeled data. We then used this model to label 400,000 samples from the silver standard data. We labeled the 20,000 instances with the highest scores as OFF and the 20,000 instances with the lowest scores as NOT. We finally added these 40,000 tweets to the training set used to train the final model.
For Danish, we initially failed to train XLM-R to predict anything other than the majority class. Since XLM-R has shown promising cross-lingual transfer results, we tried training Danish together with English. We did this by shuffling the Danish training data with the English data from OffensEval 2019. We evaluated the models only on the Danish development dataset. Table 3 shows our results on the official test data. The figures are similar to those we obtained on the development dataset. Danish shows the largest drop in performance, going from 0.813 on the development dataset to 0.775 on the test dataset. Nonetheless, since the development set was rather small, it might be difficult to conclude on the generalization performance.

Impact of external and silver standard data
Previous work has shown that models for offensive language detection often generalize poorly to other datasets (Karan andŠnajder, 2018;Swamy et al., 2019;Arango et al., 2019). This is especially true when  Table 4: Results on the English test set using different subsets of the training data. For the combination OffensEval19 + Silver, the silver standard data was processed using the approach described previously, but only using OffensEval19 for the initial training.
evaluating across domains, e.g. between Twitter and Wikipedia, but also within the same domain. Some features are likely platform specific and some datasets focus on specific aspects of offensive language. The data collection process can also lead to some types of content being overrepresented. We tried to determine the impact of the different English datasets we used by retraining the model on different subsets of the data. The results on the test set are shown in table 4. All the labeled datasets perform reasonably well on their own. Surprisingly the sampled Wikipedia data performs just as well as the OffensEval 2019 data. The sampled data from (Davidson et al., 2017) performs worse. This might be due to it being smaller and oversampled to contain more offensive tweets. This hypothesis is also supported by the fact that when used with the OffensEval 2019 data, the results are comparable with the submitted model. Finally, the silver standard data seems to be most useful when the original labeled dataset is small.

Error analysis
To get a better understanding of the kind of mistakes the system makes we studied some of the misclassified instances. To get some indications of what words are important for the classification of a given sentence, we applied LIME (Ribeiro et al., 2016). In short, LIME estimates the importance of a word by: 1. Generating many distorted versions of the original tweet.
2. Applying the original classifiers to the distorted tweets.
3. Training a white-box model to predict the output of the original classifier given a version of the tweet. Table 5 shows five instances from the English OffensEval 2019 dataset, where the classifier assigned a high confidence to the wrong class. Examples 1 and 2 are very short and the profanity dominates the other words. Both examples look like reasonable classifications. However, the same thing seems to happen in Example 3. The word shit dominates the otherwise inoffensive sentence. Example 4 has no direct profanity. Looking at bigrams using LIME, stinking cute is correctly identified as inoffensive. Example 5 doesn't seem to have any offensive language. It is possible that it could be considered offensive given external knowledge about the people mentioned. Given only the tweet, the classification looks reasonable.

# Tweet
Prediction Label 1 Are you fucking + serious? URL OFF NOT 2 And dicks + . URL OFF NOT 3 #Room25 is actually incredible, Noname is the shit + , always has been, and I'm seein her in like 5 days in Melbourne. Life is good. Have a nice day.
OFF NOT 4 @User Aw she is so stinkingcute + ! How old is she now? NOT OFF 5 #ChristineBlaseyFord is your #Kavanaugh accuser. #Liberals try this EVERY time. #ConfirmJudgeKavanaugh URL NOT OFF Table 5: Examples of misclassifications for English. Using LIME, we marked words that have a large impact on the classification. A + indicates agreement with the predicted label and a -indicates disagreement.

Conclusions
In the context of offensive language detection for multiple languages, we found that fine-tuning transformer models works well. Monolingual models outperform multilingual models for all languages studied. However, multilingual models can still be a viable alternative when no monolingual models are available. When the amount of labeled data is small, they can also be used for cross-lingual transfer. We showed the positive effect of cross lingual transfer when augmenting Danish with English.