Galileo at SemEval-2020 Task 12: Multi-lingual Learning for Offensive Language Identification Using Pre-trained Language Models

This paper describes Galileo’s performance in SemEval-2020 Task 12 on detecting and categorizing offensive language in social media. For Offensive Language Identification, we proposed a multi-lingual method using Pre-trained Language Models, ERNIE and XLM-R. For offensive language categorization, we proposed a knowledge distillation method trained on soft labels generated by several supervised models. Our team participated in all three sub-tasks. In Sub-task A - Offensive Language Identification, we ranked first in terms of average F1 scores in all languages. We are also the only team which ranked among the top three across all languages. We also took the first place in Sub-task B - Automatic Categorization of Offense Types and Sub-task C - Offence Target Identification.


Introduction
Due to the growing number of Internet users, cyber-violence emerged with offensive language pervasive across social media. With anonymity as a "privilege", netizens hide behind the screens, behaving in a manner most of them would not otherwise in reality. Thus, government organizations, online communities, and technology companies are all striving for ways to detect aggressive language in social media and help build a more friendly online environment.
Manual filtering is very time consuming and it can cause post-traumatic stress disorder-like symptoms to human annotators. One of the most common strategies (Waseem et al., 2017;Kumar et al., 2018) to tackle the problem is to train systems capable of recognizing offensive content, which can then be deleted or set aside human moderation.
SemEval 2020 Task-12  is the second edition of OffensEval (Zampieri et al., 2019). In this competition, organizers offers 5 languages datasets including Arabic (Mubarak et al., 2020), Danish (Sigurbergsson and Derczynski, 2020), English , Turkish (Çöltekin, 2020) and Greek (Pitenis et al., 2020). In Sub-task A, the participants need to predict whether a post uses offensives language. Besides, the organizers provide other two sub-tasks which mainly focus on English, to predict the type and target of offensive language.
Participating in all 3 Sub-tasks, we proposed several methods based on pre-training language models including ERNIE and XLM-R. In Sub-task A, we scored 0.9199, 0.851, 0.8258, 0.802, 0.8989 in English, Greek, Turkish, Danish and Arabic respectively. We ranked first in average F1 scores, and ranked in top three across all languages. In Sub-task B and Sub-task C, we also took the first place with 0.7462 and 0.7145. In the following sections, we will elaborate the methods, dataset and experiments of our system. transferred to down-stream tasks for task-specific fine-tuning. The following are some representatives. (Peters et al., 2018) proposed context-sensitive word vectors (ELMo) that enhance downstream tasks by acting as features. (Radford et al., 2018) proposed GPT which enhanced the context-sensitive embedding by adjusting the Transformer (Vaswani et al., 2017). (Devlin et al., 2019) modeled a bidirectional language model (BERT) through a task similar to Cloze. (Yang et al., 2019) proposed a permuted language model (XLNet) which is a generalized autoregressive pre-training method. (Liu et al., 2019b) remove the next prediction task and pre-train longer to get a better pre-trained model (RoBERTa) . (Clark et al., 2019) proposed a method to joint generator and discriminator in ELECTRA. (Lan et al., 2019) and (Raffel et al., 2019) explored the larger model structure while optimizing the pre-training strategy in ALBERT and T5. (Sun et al., 2019) enhanced pre-trained language models with full masking of spans in ERNIE. (Sun et al., 2020) proposed continuous multi-task pre-training and several pre-training tasks in ERNIE 2.0. The researchers of ERNIE 2.0 released a new version recently which made a few improvements on knowledge masking and application-oriented tasks, with the aim to advance the model's general semantic representation capability. In order to improve the knowledge masking strategy, they proposed a new mutual information based dynamic knowledge masking algorithm. They also constructed pre-training tasks that are specific for different applications. For example, they added a coreference resolution task to identify all expressions in a text that refer to the same entity. For more details, please go to this website 1 .

Cross-lingual Pre-trained Language Models
In addition, there are also a lot of works on multilingual language models. (Devlin et al., 2019) provided a multilingual version of BERT that demonstrates surprising cross-language capabilities (Wu and Dredze, 2019). (Conneau and Lample, 2019) proposed two tasks, Masked Language Model and Translation Language Model, to model monolingual corpus and bilingual parallel corpus respectively. (Huang et al., 2019) proposed Unicoder incorporate more bilingual parallel corpus modeling methods. (Song et al., 2019) and (Liu et al., 2020) proposed modeling methods that are more suitable for machine translation tasks in MASS and MBART.  used the ideas of RoBERTa in XLM-R and achieved better results than XLM.

Methods of Offensive Language Detection and Categorization
In the last few years, there have been several studies on the application of computational methods to cope with offensive language. (Waseem et al., 2017) proposed a typology that captures central similarities and differences between subtasks.  trained a multi-class classifier to distinguish between these different categories.  employed supervised classification along with a set of features that includes n-grams, skip-grams and clustering-based word representations. There are also several workshops for this problem. Such as AWL 2 and TRAC 3 (Kumar et al., 2018) .

Multi-lingual Offensive Language Detection
In Sub-task A, we expected to build a unified approach to detect offensive language in all languages.
Our algorithm has two steps. In the first step, pre-training using large scale multilingual unsupervised texts yields a unified pre-training model that can learn all the language representations together. In the second step, the pre-trained model was fine-tuned with labeled data. The detailed process is shown in Figure 1.

Multi-lingual Pretraining with Unsupervised Data
Multi-lingual Fine-tuning with Offensive Detection Data This approach can benefit from dataset in other languages and enhance the generality of the model. We will compare methods trained on multilingual data with those on monolingual data in Section 5.

Offensive Language Categorization using Knowledge Distillation trained on Soft Labels
In Sub-task B and Sub-task C, we constructed a knowledge distillation approach (Hinton et al., 2015;Liu et al., 2019a) . Several supervised models provided calculated the probability of each label and generated a weighted probability (here we call it soft label). Then the student model was trained on those soft labels. Detailed process is shown in Figure 2.
Suppose that X is the contextual embedding of the token [CLS], which can be viewed as the semantic representation of input sentence. Let Q(c|X) be the class probabilities produced by the ensemble of several supervised models. The probability P r (c|X) that X is labeled as class c is predicted by a softmax layer. We use the standard cross entropy loss to learn the soft target: We used ERNIE 2.0 and ALBERT as our candidates of pre-training language models in Sub-task B and Sub-task C.

Dataset
We used datasets of OffensEval 2019 and OffensEval 2020 as our training data. In OffensEval 2019, the organizers provide a dataset containing English tweets annotated using a hierarchical three-level annotation. In OffensEval 2020, the organizers did not provide additional data in English for training. They provided training data for four other languages, Turkish, Danish, Greek and Arabic. In addition, they provide a large amount of weakly labeled data generated by several supervised models.

Sub-Task A -Offensive Language Identification
In Sub-task A, the goal is to discriminate between offensive and non-offensive posts. Offensive posts include insults, threats, and posts containing any form of untargeted profanity. Each instance is assigned one of the following two labels. 'NOT' means posts which do not contain offense or profanity. 'OFF' means posts containing offense any form of non-acceptable language or a targeted offense. In order to avoid uneven proportions of data across languages, we did not use the unannotated English data from OffensEval 2020. Instead, we used a mix of English data from OffensEval 2019 (including training data and test data) and training data from OffensEval 2020 in the other 4 languages as our training data. Details are shown in Table 1.

Sub-Task B -Automatic Offense Language Categorization
In Sub-task B, the goal is to predict the type of offense. There are two types in sub-task B are the following. 'TIN' means posts containing an insult or threat to an individual, group, or others. 'UNT' means posts containing non-targeted profanity and swearing. The dataset consists of two parts, a small portion of the manually annotated dataset from OffensEval 2019 and a large portion of the dataset from OffensEval 2020 constructed based on multiple supervision models. All the training data in OffensEval 2020 provides the confidence that it has a target to attack. Details are shown in Table 2.

Sub-Task C -Offense Target Identification
In Sub-Task C, the goal is to predict the target of offense. The three labels in Sub-task C are the following. 'IND' means posts targeting an individual. 'GRP' means the target of these offensive posts is a group of people. 'OTH' means the target of these offensive posts does not belong to any of the previous two categories. As with Sub-task B, all training data in OffensEval 2020 provide the confidence level for each label. Details are shown in Table 3.   Table 4: Results for Sub-Task A. We report the F1 score of all languages.

Results of Sub-task B and Sub-task C
In both Sub-Task B and Sub-Task C, we made a comparison between hard target-based approach and soft target-based approach. Two models were used for validation, which are ALBERT-XXLarge and ERNIE 2.0. The results are shown in Table 5 and Table 6, where it can be seen that the knowledge distillation approach is helpful for offensive categorization. Same with Sub-Task A, the metric used is the average F1 score of all labels. Again, to make it more reliable, the average score of 5 repeated experiments was adopted. We also listed our final submitted results below, which were obtained using ten-fold cross-validation-based ensemble of ERNIE 2.0.

Conclusion
In this paper, we presented our approach on detecting and categorizing offensive language in social media. We proposed a multi-lingual learning method to detect offensive language and a knowledge   Table 6: Results for Sub-task C distillation method to categorize offensive language. We will further our exploration of multilingual offensive language identification in future, e.g. validating the zero-shot performance of our model in more languages.