UPB at SemEval-2020 Task 12: Multilingual Offensive Language Detection on Social Media by Fine-tuning a Variety of BERT-based Models

Offensive language detection is one of the most challenging problem in the natural language processing field, being imposed by the rising presence of this phenomenon in online social media. This paper describes our Transformer-based solutions for identifying offensive language on Twitter in five languages (i.e., English, Arabic, Danish, Greek, and Turkish), which was employed in Subtask A of the Offenseval 2020 shared task. Several neural architectures (i.e., BERT, mBERT, Roberta, XLM-Roberta, and ALBERT), pre-trained using both single-language and multilingual corpora, were fine-tuned and compared using multiple combinations of datasets. Finally, the highest-scoring models were used for our submissions in the competition, which ranked our team 21st of 85, 28th of 53, 19th of 39, 16th of 37, and 10th of 46 for English, Arabic, Danish, Greek, and Turkish, respectively.


Introduction
Social media platforms are gaining increasing popularity for both personal and political communication.
Recent studies uncovered disturbing trends in communications on the Internet. For example, Pew Research Center 1 discovered that 60% of Internet users have witnessed a form of online harassment, while 41% of them have personally experienced it. The majority of the latter category says the most recent such experience occurred on a social media platform. Although most of these platforms provide ways of reporting offensive or hateful content, only 9% of the victims have considered using these tools.
Traditionally, identifying and removing offensive or hateful content on the Internet is performed by human moderators that inspect each piece of content flagged by the users and label it appropriately. This process has two major disadvantages. As previously mentioned, the first one is that the proportion of users that even considered using the tools provided by the platforms in order to report the harmful content is very small. The second one is represented by the continuously growing volume of data that needs to be analyzed.
However, the task of automated offensive language detection on social media is a very complicated problem. This is because the process of labeling an offensive language dataset has proved to be very challenging, as every individual reacts differently to the same content, and the consensus on assigning a label to a piece of content is often difficult to obtain (Waseem et al., 2017).
The SemEval-2019 shared task 6 (Zampieri et al., 2019b), Offenseval 2019, was the first competition oriented towards detecting offensive language in social media, specifically Twitter. The SemEval-2020 shared task 12 , Offenseval 2020, proposes the same problem, the novelty being a very large automatically labeled English dataset and also smaller datasets for four other languages: Arabic, Danish, Greek, and Turkish. This paper describes the Transformer-based machine learning models we used in our submissions for each language in Subtask A, where the goal is to identify the offensive language in tweets -a binary classification problem.
The remainder of the paper is structured as follows: in section 2, a brief analysis of the state-of-the-art approaches is performed. In section 3, the methods employed for automated offensive language detection are presented. Section 4 describes the data used in this study. Section 5 presents the evaluation process, and in section 6, conclusions are drawn.

Related work
Automating the task of offensive language detection becomes a necessity on the Internet of today, especially on communication platforms. The efforts in this direction have substantially grown in the research community. The first approaches on a related subject include the detection of racist texts in web pages (Greevy and Smeaton, 2004), where the authors considered part-of-speech tags as inputs for support vector machines.
This type of problems gained a lot of interest in the last decade, as advanced machine learning techniques has been developed for NLP tasks, and also computing power became increasingly accessible. Cambria et.al. (2010) proposed the detection of web trolling (i.e., posting outrageous messages that are meant to provoke an emotional response). Hate speech and offensive language detection in Twitter samples was analyzed by . Their study presents a framework for differentiating between profanity and hate speech and also describes the annotation process of such a dataset. Moreover, their experiments with various text preprocessing techniques are described and the logistic regression algorithm is used for hate speech and offensive language classification.  continued the experiments on this dataset using n-gram and skip-gram features.
More recently, neural networks gained interest in this type of problems. For example, Gambäck and Sikdar (2017) relied on a Convolutional Neural Network (CNN) (Kim, 2014) to surpass the state-ofthe-art results on the previously mentioned dataset. Zhang et al. (2018) also improved these results by combining two different deep learning architectures, namely CNN and Gated Recurrent Unit (Cho et al., 2014).
There are a series of surveys and shared-tasks that took place in the last years on the subject of detecting the online offensive, abusive, hateful, or toxic content. Schmidt and Wiegand (2017) introduced a comprehensive survey on different methods to automatically recognize hate speech, focusing mostly on non-neural network approaches. Shared-tasks analyzing problems in the same areas include both editions of Abusive Language Online (Fišer et al., 2018), which focused mostly on cyberbullying, TRAC (Kumar et al., 2018), which mainly studied aggressiveness, HASOC (Mandl et al., 2019), which also addressed the problem of hate speech, the same as the SemEval-2019 Task 5 (Basile et al., 2019) competition.

Baseline
As a baseline, we used a non-neural network approach, which employs the XGBoost (Chen and Guestrin, 2016) algorithm for classification and multiple text processing techniques for feature extraction: • Firstly, the lemma of the words was extracted and the TF-IDF scores were computed for the n-grams obtained, with n = 1, 2, 3.
• Secondly, part-of-speech tags were extracted using the NLTK Python package (Loper and Bird, 2002) and the TF-IDF scores were computed for the tag n-grams obtained, with n = 1, 2, 3.
• Sentiment analysis features were obtained using the VADER tool (Hutto et al., 2015), which is based on a mapping between lexical features and sentiment scores.
• Finally, other lexical features were added, such as the number of characters, words, syllables, and the Flesch-Kincaid readability score (Kincaid et al., 1975).

BERT
Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019) is a novel deep learning architecture designed for NLP tasks by the Google team. It combines the WordPiece embeddings (Wu et al., 2016) and the Transformers (Wolf et al., 2019) which represents the last major breakthrough in NLP. The BERT models significantly outperformed the state-of-the-art approaches on various text classification or question answering benchmarks. The architecture is a multi-layer transformer encoder, with the novelty consisting of the usage of bidirectional attention instead of recurrent units. The BERT model can be used for any NLP classification task using a technique called fine-tuning. This consists of starting with a model that has been pre-trained on a very large and comprehensive dataset and training it further for the respective classification task, in our case offensive language identification. There are several pre-trained BERT versions available, differing in terms of model size (i.e., the number of transformers) and the corpora used for pre-training. Therefore, we experimented with the following BERT-aware models: • BERT-base, which is pre-trained on the English Wikipedia Corpus.
• BERT-base for Danish 2 , which is pre-trained on the entire dump of Danish Wikipedia pages.
• multilingual BERT (mBERT) (Pires et al., 2019), which is pre-trained on a corpus containing the top 104 languages, considering the Wikipedia pages for each language.

Roberta and XLM-Roberta
Liu et al. (2019) analyzed the BERT model and concluded that it is under-trained, claiming that the hyperparameter choice can significantly impact the obtained results. The robust pre-training method they proposed, namely Roberta, achieved better performances in the same NLP tasks. Based on their work, XLM-Roberta was developed by Conneau et al. (2019) for multilingual NLP tasks. It is pre-trained for more than 100 languages, similarly to mBERT, which it manages to outperform. An interesting observation is that the more significant improvement in results was obtained for low-represented languages, which recommends using it on all five languages in the subtask A of the current competition. Here, we used the base architectures of Roberta for English and XLM-Roberta, both pre-trained using large amounts of specific CommonCrawl 3 data.

ALBERT
ALBERT (Lan et al., 2019), A lite BERT, is a BERT variation that brings two novel parameter-reduction techniques, resulting in lower resource consumption at training time, and at the same time obtains similar performances with the original BERT model. Moreover, ALBERT uses a self-supervised loss focusing on improving modeling on inter-sentence coherence, which helps to obtain better results on NLP tasks with multi-sentence inputs. We fine-tuned the ALBERT-base model for our English language experiments.

Data
To analyze the influence of extending the competition-provided training datasets with other corpora constructed for related tasks, we used two additional datasets for fine-tuning the previously mentioned models. The summary of each dataset is presented in Table 1. We can observe that all the datasets have a similar structure (i.e., binary labels, unbalanced, and positive label ratio of 10%-50%).

Offenseval 2020 Dataset
The Offenseval 2020 dataset is composed of five subsets of tweets, one for each of the languages: English , Arabic (Mubarak et al., 2020), Danish (Sigurbergsson and Derczynski, 2020), Greek (Pitenis et al., 2020), and Turkish (Çöltekin, 2020). The last four are similar in structure: for each sample, the Subtask A label is binary, revealing whether the tweet contains offensive language or not. The English subset has a more complex annotation scheme for the Subtask A: for each sample, both the average and the standard deviation of the scores assigned by a pool of semi-supervised learning models are given. After exploring the distribution of these values, the following heuristic was used for converting them to binary labels: • If the model average score is greater than 0.6, the sample is labeled as positive.
• If the model average score is between 0.5 and 0.6, and the standard deviation is smaller than 0.1 (there is a consensus), the sample is labeled as positive.
• All other samples are labeled as negative.
Using this method, we obtained 114,223 English tweet samples labeled as positive. Finally, a 10% ratio of each language subset was put aside for validation purposes, preserving the label distribution. Zampieri et al. (2019a) introduced the Offensive Language Identification Dataset (OLID) to the Offenseval 2019 shared task. Note that this was actually the starting point for the Offenseval 2020 dataset. Moreover, OLID contains only English language tweets, and the label for the Subtask A (i.e., offensive language identification) is binary.

HASOC Dataset
The Hate Speech and Offensive Communication dataset 4 was proposed for the HASOC 2019 competition (Mandl et al., 2019) with the goal of identifying both hate speech and offensive content in Indo-European languages (i.e., English, German, and Hindi). Task 1 of this competition required identifying hateful or offensive tweet samples, and the labels were binary, so the dataset can be considered similar to the Offenseval 2020 subsets.

Preprocessing
The main preprocessing step is done using the BERT-specific tokenizer, which splits a sentence into tokens in a WordPiece manner. Two more Twitter-specific prior steps were performed: • Replacing the emojis with the corresponding textual representation by using the emojiPython package 5 .
For the non-English languages, we also explored the approach of translating the texts to English as a preprocessing step, in order to allow the usage of an English-only pre-trained model. We used the Yandex translation service 6 , but the translation quality proved to be poor. That is, while most of the words were correctly translated, the syntax and meaning of the sentences were lost. For instance, the Turkish tweet "yeniden dogup gelsem çocuk kalır büyümezdiiiim" is translated as "re-born child grow I do I remain", while a more accurate translation is "If I were born again, I would be a child and I would not grow up".

Experiments
All the experiments were performed using an Ubuntu machine with 64GB RAM and one NVIDIA Titan X GPU. The hardware limitations are the reason we only experimented with the base versions of the previously mentioned architectures. The Transformer Python package (Wolf et al., 2019) was used for training and evaluating the Transformer-based models, and each model was fine-tuned for four epochs. We also used the Adam algorithm (Kingma and Ba, 2014) with weight decay for optimization and a learning rate of 2e-5.
For each language, multiple experiments were performed with the same architecture using several combinations of datasets for fine-tuning, in order to assess the impact each of them has on the performance of a certain model. This implies that, for some experiments, several datasets were simply concatenated and used as a single fine-tuning set. No additional handling of the data was required, as all of the data shared the binary label structure presented in Section 4.
The results we obtained on the validation datasets are summarized in Tables 2, 3, 4, 5, and 6 for English, Arabic, Danish, Greek, and Turkish, respectively. The reported metrics are computed for the Offenseval 2020 language-specific validation sets as described in Section 4. For each language, the highest validation set F1-score is highlighted, meaning that the corresponding model was selected and employed for predicting the language-specific competition test data in our final submission.

Model
Pre    Results on the English Subset. Firstly, we observe that, although the baseline classifier does not obtain a negligible result, the Transformer-based models outperform it, even when pre-trained for multilingual tasks, proving the performance improvement that this type of models brings to the task of automated offensive language detection.
Secondly, ALBERT and Roberta perform better than the BERT-base architecture, thus confirming their better exploitation of the Transformer's representative power. Furthermore, we note that even the multilingual pre-trained model performs better than the English-specific baseline, although significantly worse than the English-only pre-trained models.
Furthermore, there is no evidence that adding the OLID and HASOC English datasets to the finetuning data affect the results in any way, most likely because the size of these datasets is very small in comparison to the size of the Offenseval 2020 English dataset. Finally, the best performing model is Roberta fine-tuned without adding the two additional subsets.
Results on the non-English Subsets. The very low scores obtained by the baseline approach for the non-English subsets are explained by the fact that most of the pre-processing employed is English language-specific and could not be applied for other languages. The approach of automatically translating the texts and then applying an English-language pre-trained and fine-tuned model also seems to fail,   with most of the obtained F1-scores being at least 10 points lower than the best score.
Another interesting observation is that the results are constantly improving when adding more data to the fine-tuning dataset, even if the added data is in a different language than the validation set, thus proving that the multilingual model is able to learn cross-lingual features. As expected, the XLM-Roberta model outperforms mBERT in most experimental setups. As opposed to the English language results, adding the HASOC subsets seems to improve the scores, with the sole exception of the Turkish subset. This could be partially explained by the fact that the HASOC dataset contains two other languages. For instance, the German subset from the HASOC data may have brought a performance boost to the multilingual models on the Danish validation set because Danish is more related to German, being considered a North Germanic language.
Finally, an interesting particularity can be observed for the Danish dataset. The Danish language pre-trained BERT model, fine-tuned using only the very small Danish training set, outperformed even the multilingual model, fine-tuned using all the available data. This proves that, for low-represented languages, a language-specific pre-trained model performs better than a multilingual one, even with smaller amounts of fine-tuning data.
Results on the Leaderboard. The results and the rankings obtained by our submissions can be observed in Table 7, in comparison to the best performing teams. The F1-scores obtained on the Offenseval 2020 Subtask A competition test sets are as follows: 91. 05%, 82.19%, 73.80%, 81.40%, and 77

Conclusions
This work presented our approaches to automatically detect offensive language in multilingual tweets, as part of SemEval-2020 Task 12. We proved that the deep learning solution of fine-tuning pre-trained Transformer-based models can be used successfully to classify offensive language, and we experimented with several such architectures, fine-tuned on multiple combinations of datasets. Comparing the validation set performances against the test set results, we discovered that the last of them were better for the non-English languages, which shows that our models generalize well and also that the proposed test data may be easier to classify than the development data. The smallest positive difference between validation and testing performances was obtained for the Danish subset, which may indicate that the XLM-Roberta model could have been more suitable for the Danish language too, as the small difference obtained in the validation phase could have been outweighed by the generalizing power given by the large multilingual fine-tuning dataset. Moreover, the performances of the multilingual models increased not only with the size of the finetuning dataset, but also with the number of languages it contains. The results also demonstrated that the potential of the multilingual Transformer-based models in offensive language detection could be improved if larger datasets are available for non-English languages. For future work, we intend to consider a transfer learning method in order to leverage datasets that were constructed for similar tasks in the same language.