Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for Offensive Language Detection

Nowadays, offensive content in social media has become a serious problem, and automatically detecting offensive language is an essential task. In this paper, we build an offensive language detection system, which combines multi-task learning with BERT-based models. Using a pre-trained language model such as BERT, we can effectively learn the representations for noisy text in social media. Besides, to boost the performance of offensive language detection, we leverage the supervision signals from other related tasks. In the OffensEval-2020 competition, our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place (92.23%F1). An empirical analysis is provided to explain the effectiveness of our approaches.


Introduction
Nowadays, offensive content has invaded social media and becomes a serious problem for government organizations, online communities, and social media platforms. Therefore, it is essential to automatically detect and throttle the offensive content before it appears in social media. Previous studies have investigated different aspects of offensive languages such as abusive language (Nobata et al., 2016;Mubarak et al., 2017) and hate speech (Malmasi and Zampieri, 2017;Davidson et al., 2017).
Recently, (Zampieri et al., 2019a) first studied the target of the offensive language in twitter and  expand it into the multilingual version, which is practical for studying hate speech concerning a specific target. The task is based on a three-level hierarchical annotation schema that encompasses the following three general sub-tasks: (A) Offensive Language Detection; (B) Categorization of Offensive Language; (C) Offensive Language Target Identification.
To tackle this problem, we emphasize that it is crucial to leverage pre-trained language model (e.g., BERT (Devlin et al., 2018)) to better understand the meaning of sentences and generate expressive wordlevel representations due to the inherent data noise (e.g., misspelling, grammatical mistakes) in social media (e.g., twitter). In addition, we hypothesize that the internal connections exist among the three general sub-tasks, and to improve one task, we can leverage the information of the other two tasks. Therefore, we first generate the representations of the input text based on the pre-trained language model BERT, and then we conduct multi-task learning based on the representations.
Experimental results show that leveraging more task information can improve the offensive language detection performance. In the OffensEval-2020 competition, our system achieves 91.51% macro-F1 score in English Sub-task A (ranked 7th out of 85 submissions). Especially, only the OLID (Zampieri et al., 2019a) is used to train our model and no additional data is used. Our code is available at: https: //github.com/wenliangdai/multi-task-offensive-language-detection.

Related Works
In general, offensive language detection includes some particular types, such as aggression identification (Kumar et al., 2018), bullying detection (Huang et al., 2014) and hate speech identification (Park and Fung, 2017). (Chen et al., 2012) applied concepts from NLP to exploit the lexical syntactic feature of sentences for offensive language detection. (Huang et al., 2014) integrated the textual features with social network features, which significantly improved cyberbullying detection. (Park and Fung, 2017) and (Gambäck and Sikdar, 2017) used convolutional neural network in the hate-speech detection in tweets.
Recently, (Zampieri et al., 2019a) introduce an offensive language identification dataset, which aims to detect the type and the target of offensive posts in social media.  expanded this dataset into the multilingual version, which advances the multilingual research in this area.
Pre-trained language models, such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) have achieved great performance on a variety of tasks. Many recent papers have used a basic recipe of finetuning such pre-trained models on a certain domain (Azzouza et al., 2019;Lee et al., 2019;Beltagy et al., 2019) or on downstream tasks (Howard and Ruder, 2018;Liu et al., 2019b;Su et al., 2019).

Datasets
In this project, two datasets are involved, which are the dataset of OffensEval-2019 and OffensEval-2020 respectively. In this section, we introduce the details of them and discuss our data pre-processing methods. Table 1 shows the types of labels and how they overlap.

Offensive Language Identification Dataset (OLID)
The OLID (Zampieri et al., 2019b) is a hierarchical dataset to identify the type and the target of offensive texts in social media. The dataset is collected on Twitter and publicly available. There are 14,100 tweets in total, in which 13,240 are in the training set, and 860 are in the test set. For each tweet, there are three levels of labels: (A) Offensive/Not-Offensive, (B) Targeted-Insult/Untargeted, (C) Individual/Group/Other. The relationship between them is hierarchical. If a tweet is offensive, it can have a target or no target. If it is offensive to a specific target, the target can be an individual, a group, or some other objects. This dataset is used in the OffensEval-2019 competition in SemEval-2019 (Zampieri et al., 2019c). The competition contains three sub-tasks, each corresponds to recognizing one level of label in the dataset.

Multilingual Offensive Language Identification Dataset (MOLID)
A multilingual offensive language detection dataset  is proposed in the OffensEval-2020 competition in SemEval-2020. It contains five languages: Arabic, Danish, English, Greek, and Turkish. For English data, similar to OLID (Zampieri et al., 2019b), it still has three levels, but this time only confidence scores, generated by different models, are provided rather than human annotated labels. In addition, the data in level A is separated from levels B and C. In level A, there are 9,089,140 tweets, in levels B and C, there are different 188,973 tweets. For the rest languages, they only have data in level A but with human annotated labels.

Data Pre-processing
Data pre-processing is crucial to this task as the data from Twitter is noisy and sometimes disordered. Moreover, people tend to use more Emojis and hashtags on Twitter, which are unusual in other situations. Firstly, all characters are converted to lowercase, and the spaces at ends are stripped. Then, inspired by (Zampieri et al., 2019c;Liu et al., 2019a), we further process the dataset in five specific aspects: Emoji to word. We convert all emojis to words with corresponding semantic meanings. For example, is converted to thumbs up. We achieve this by first utilizing a third-party Python library 1 , and then removing useless punctuation in it.
Hashtag segmentation. All hashtags in the tweets are segmented by recognizing the capital characters. For example, #KeithEllisonAbuse is transformed to keith ellison abuse. This is also achieved by using a third-party Python library 2 .
User mention replacement. After reviewing the dataset, we find out that the token @USER appears very frequently (a single tweet can have multiple of them), which is a typical phenomenon in tweets. As a result, for those with more than one @USER token, we replace all of them with one @USERS token. In this way, we remove the redundant words while keeping the key information, which is useful for recognizing targets if there is any.
Rare word substitution. We substitute some out-of-vocabulary (OOV) words with their synonyms. For example, every URL is replaced with a special token, http.
Truncation. We truncate all the tweets to a max length of 64. Although this can get rid of some information in the data, it lowers the GPU memory usage and slightly improves the performance.

Methodology
We propose a Multi-Task Learning (MTL) method (Figure 1(b)) for this offensive language detection task. It takes good advantage of the nature of the OLID (Zampieri et al., 2019b), and achieves an excellent result comparable to state-of-the-art performance only with the OLID (Zampieri et al., 2019b) and no external data resources. A thorough analysis is provided in Section 5.2 to explain the reasons of not using the new multilingual dataset created in OffensEval-2020 .

Task Description
The OffensEval-2020 ) is a task that organized at SemEval-2020 Workshop. As mentioned in Section 3.2, it proposes a multilingual offensive language detection dataset which contains five different languages. It has three sub-tasks: (A) Offensive Language Detection; (B) Categorization of Offensive Language; (C) Offensive Language Target Identification. In this paper, we mainly focus on the sub-task A of the English data .

Baseline
We re-implement the model of the best performing team (Liu et al., 2019a) in OffensEval-2019 (Zampieri et al., 2019c) as our baseline. As illustrated in Figure 1(a), (Liu et al., 2019a) fine-tuned the pre-trained model, BERT (Devlin et al., 2018), by adding a linear layer on top of it.
BERT. Bidirectional Encoder Representation from Transformer (BERT) (Devlin et al., 2018) is a largescale masked language model based on the encoder of Transformer model (Vaswani et al., 2017). It is pre-trained on the BookCorpus (Zhu et al., 2015) and English Wikipedia datasets using two unsupervised tasks: (a) Masked Language Model (MLM) (b) Next Sentence Prediction (NSP). In MLM, 15% of input tokens are masked, and the model is trained to recover them at the output. In NSP, two sentences are fed into the model and it is trained to predict whether the second sentence is the actual next sentence of the first one. As shown in (Devlin et al., 2018), by fine-tuning, BERT achieves superior results on many NLP downstream tasks.

Multi-task Offense Detection Model
In recent years, multi-task learning (MTL) technique is used in many machine learning fields to improve performance and generalization ability of a model (Kang et al., 2011;Long and Wang, 2015;Kokkinos, 2016;Güler et al., 2018;Liu and Zhao, 2018;Dankers et al., 2019). Generally, MTL has three advantages. Firstly, with multiple supervision signals, it can improve the quality of representation learning, because a good representation should have better performance on more tasks. Secondly, MTL can help the model generalize better because multiple tasks introduce more noises and prevent the model from over-fitting. Thirdly, sometimes it is hard to learn features by one task but easier to learn by another task. MTL provides complementary supervisions to one task and makes it possible to eavesdrop other tasks and get more information.
For this task, MTL is a very effective strategy. As mentioned in Section 3.1 and shown in Table 2, the three labels in OLID are hierarchical and they are designed to be inclusive from top to bottom. This makes it possible for one sub-task to eavesdrop information form the other tasks. For example, if a tweet is labelled as Targeted in sub-task B, then it must be classified to Offensive in sub-task A.
Our MTL architecture is shown in Figure 1(b). The bottom part is a BERT model, which is shared among all three sub-tasks. The upper parts are three separate modules dedicated for each sub-task, each module contains a Recurrent Neural Network (RNN) with Long-Short Term Memory (LSTM) cells (Hochreiter and Schmidhuber, 1997). The input X is first fed into the shared BERT, then each sub-task module takes the contextualized embeddings generated by BERT and produces a probability distribution for its own target labels. The overall loss L is calculated by L = I i w i L i . Here, I = {A, B, C} and w i is the loss weight for each task-specific Cross-Entropy loss L i , where I i w i = 1. The loss weights are chosen by cross validation.

Experiments
During the training phase, we evaluate our models on the test set of OLID (OffensEval-2019). As a reference, we also evaluate them on the test set of MOLID (OffensEval-2020), which is only released after the submission date.

Experimental Settings
To find the optimal architecture for this task within the models we have, we set five different experiments. For the first two, we train our baseline model on OLID and MOLID separately. As MOLID's labels are AVG CONF scores between 0 to 1 rather than binary classes, we set the threshold as 0.3 based on statistical analysis to convert MOLID to a classification dataset. Then, we set an experiment that pretrain the baseline model on MOLID and fine-tune on OLID by utilizing the pre-train strategy shown in Session 5.2. Finally, we train our Multi-task Offense Detection Model only on OLID and fine-tune the hyper-parameters based on Sub-task A. To further improve the generalization performance of our method, we ensemble five MTL models with different random seeds and generate final results through majority voting.  Table 2: Experimental results on sub-task A. The evaluation metric is macro F1 score, which is official in OffensEval-2020.
To evaluate the performance of each model, we use macro-F1 which is computed as a simple arithmetic mean of per-class F1-scores. Since OLID released its test set last year, we use this test set as our validation set and optimize the hyper-parameters manually over the successive runs on it. For our best MTL model, we set the learning rate as 3e-6 and batch size as 32, the loss weights of subtasks A, B, C are 0.4, 0.3, 0.3 respectively. We train the model with maximum 20 epochs and utilize an early stop strategy to stop training if the validation macro-F1 doesn't increase in three continuous epochs. Our code is implemented in PyTorch and all experiments are run on a single GTX 1080Ti.

Result Analysis
The results on Table 2 show the macro-F1 scores on OLID and MOLID's test set and they are consistent except the model with pre-training. Our ensembled MTL model achieves the best performance in both two test sets.
Pre-train vs. No pre-train on MOLID. Since the MOLID  contains more than 9 million samples with the AVG CONF score. To make full use of the dataset, we conduct pre-train strategy which let the model pre-trained on MOLID and then fine-tuned on the Offensive Language Identification Dataset(OLID) (Zampieri et al., 2019b). To pre-train the model on MOLID, we regard the Sub-task A as a regression problem based on the AVG CONF score. Instead of setting a threshold to divide the data into two classes(OFF, NOT), we directly apply Mean Square Error(MSE) loss function on AVG CONF. However, our result shows that conducting pre-training makes little difference. We believe it is because the MOLID contains lots of noisy data which is also the reason why the baseline model trained on MOLID is much worse than on OLID.
BERT and Multi-Task Learning From the result, we find that incorporating BERT and multi-task learning can help improve the macro-F1 score of Sub-task A a lot. This can be attributed to two reasons. Firstly, BERT model is pre-trained on a huge corpus which helps to produce more meaningful representations for the input text. Meanwhile, the large model size increases the learning ability for the task. Secondly, with the large capacity of BERT, through multi-task learning, sub-task A can get more information from the other shared part of the model, and it will be more certain to some cases. For example, if the label of sub-task B is NULL, then label of sub-task A must be NOT. If the label of sub-task B is TIN or UNT, then the label of sub-task A must be OFF.

Conclusion and Future work
From all of our experiments, we conclude that MTL improves the performance of Sub-task A in both OLID and MOLID. Moreover, our finding shows that pre-training Sub-task A as a regression task doesn't improve the model's performance. We think that there are several paths for further work. Firstly, more studies about the combination of the sub-tasks can be investigated for MTL. This can show us more about the interaction between sub-tasks, and how much does one influence another. Secondly, as mentioned in (Kokkinos, 2016), simultaneously updating the model's parameters during MTL can have negative effects on optimization as the total gradients are too noisy. This becomes more significant when the number of tasks is large or the batch size is small. As a result, asynchronous optimizations for each task may provide a more stable gradient descent.