KEIS@JUST at SemEval-2020 Task 12: Identifying Multilingual Offensive Tweets Using Weighted Ensemble and Fine-Tuned BERT

This research presents our team KEIS@JUST participation at SemEval-2020 Task 12 which represents shared task on multilingual offensive language. We participated in all the provided languages for all subtasks except sub-task-A for the English language. Two main approaches have been developed the first is performed to tackle both languages Arabic and English, a weighted ensemble consists of Bi-GRU and CNN followed by Gaussian noise and global pooling layer multiplied by weights to improve the overall performance. The second is performed for other languages, a transfer learning from BERT beside the recurrent neural networks such as Bi-LSTM and Bi-GRU followed by a global average pooling layer. Word embedding and contextual embedding have been used as features, moreover, data augmentation has been used only for the Arabic language.


Introduction
Natural language processing field has the researchers' attention especially with the rapid use of social media sites, for instance, Twitter, Facebook, YouTube comments, and macro blogs. Consequently, offensive, aggressive and hate-speech language identification problems that perform the automatic detection of these problems from textual data. Moreover, the main motivation to reduce the behavior of hate speech and offensive/aggressive language on user attitude and content, in particular, on social media.
The offensive detection in Arabic social media is a serious task. This refers to the Arabic language contains violent words represent both violent context and not a violent context besides Arabic is dialects language (Elfardy and Diab, 2013). For instance, the word Killing in Arabic meaning represents a violent meaning and not a violent meaning in different contexts appears in (Alhelbawy et al., 2016) tweets. The overall studies that detected on the offensive language have been applied to the English language. However, the research that regards in the Arabic language in this domain of NLP applications has been restricted according to the lack of the resources that tackle the same issue compared to the English language (Mubarak et al., 2017;Abozinadah et al., 2015). The researches that had been conducted in Arabic evaluated on small datasets collected from Twitter API such as Mubarak et al. (2017) has been evaluated their proposed approach on 1100 annotated tweets.
In this research, we describe our participation team KEIS@JUST at SemEval-2020 Task 12 which describes the shared task on offensive language as a multilingual shared task (i.e. Arabic, English, Danish, Greek, and Turkish). Moreover, We participated in all languages for the provided subtasks except sub-task-A for the English language. Two approaches have been implemented aim to solve the shared task, the first is performed to tackle both languages Arabic and English, a weighted ensemble consists of Bi-GRU and CNN followed by Gaussian noise and global pooling layer multiplied by weights to improve the overall performance. Consequently, the implemented approach performed to solve sub-task-A (offensive language identification), sub-task-B (automatic categorization), and sub-task-C (offense target identification). The second performed for other languages, a state of art transfer learning from BERT embedding multi cased 12A pre-trained model besides the recurrent neural networks such as Bi-LSTM and Bi-GRU followed by a global average pooling layer. Consequently, the implemented approach performed to solve sub-task-A (offensive language identification). Word embedding and contextual embedding have been used as features, moreover, data augmentation has been used only for the Arabic language and we rely on the AraVec embedding (Soliman et al., 2017) for data augmentation which aims to create more dataset that helps to train the model. To evaluate our results OffensEval 2020  provided multilingual Dataset. The best results for the KEIS@JUST team ranked 11th place out of 56 teams with 86.55% F1-macro in the Arabic language, ranked 12th place out of 39 teams with 76.1% F1-macro in the Danish language, ranked 28th place out of 37 teams with 76.1% F1-macro in the Greek language, ranked 32th place out of 46 teams with 73.3% F1-macro in the Greek language.

Related Work
Offensive content on social media has recent attention (Schmidt and Wiegand, 2017;Founta et al., 2018; according to the negative effects on its users, for instance, demeaning comments or hate speech utterance. The offensive language detection on Arabic social media users considered an important step to prevent social society from these negative effects. Several of previous researches have been presented comprehensive studies which tend to describe the main key of the proposed task Schmidt and Wiegand (2017), and (Fortuna and Nunes, 2018), moreover (Davidson et al., 2017) presents dataset for hate speech detection, (Kumar et al., 2018) presents dataset for aggressive language, and (Zampieri et al., 2019a) presents OLID dataset for the previous shared task of offensive language. Additionally, (Spertus, 1997) shows the earliest efforts in hate speech detection that performs a decision tree-based classifier. Moreover, Offensive identification for sentences have been tried for several languages behind the English such that, Arabic Mubarak et al. (2017) and (Al-Hassan and Al-Dossari, 2019), German (Ross et al., 2017;Fišer et al., 2017;Su et al., 2017).
There are lack of researches in the offensive language for Arabic research community, for instance, (Abdelfatah et al., 2017) introduced k-means for violence utterance on Twitter. MADIMARA has been used to extract morphological features as well as they used TF-IDF to represent dataset on the vector space model. (Malmasi and Zampieri, 2017) presented system to detect hate speech using lexical features and a linear SVM classifier depending on n-grams. Similarity, (Alakrot et al., 2018) introduced SVM classifier trained on word-level features. N-grams and stemming used as features. (Mulki et al., 2019) proposed L-HSAB the first dataset for hate speech and abusive language. The dataset collected from Twitter API focusing on Syrian and Lebanese tweets rich of toxic utterance. The dataset has been trained on Naive Bayes classifier.
For English language, several reseacrchers used transformers ,for instance, ) Proposed a fine-tuned technique for the Bidirectional Encoder Representation from Transformer (BERT) with word unigrams, word2vec, and Hatebase have been used as features. Similarity, (Zhu et al., 2019) Introduced a fine-tuned a BERT based classifier depends on linear SVM trained on character n-gram as a feature. (Pelicon et al., 2019) Proposed a fine-tuned a BERT and LSTM neural network architecture with automatically and manually crafted features were used namely: word embedding, TFIDF, POS sequences, BOW, the length of the longest punctuation sequence, and the sentiment of the tweets features. However, several researchers applied machine and deep learning, for instance, (Mahata et al., 2019) Proposed an ensemble technique consist of Convolutional Neural Network, Bidirectional LSTM with attention, and Bidirectional LSTM + Bidirectional GRU. (Han et al., 2019) Presented two approaches namely: bidirectional with GRU and probabilistic model modified sentence offensiveness calculation (MSOC) trained using word2vec embedding.

Methodology
Shared task on Multilingual Offensive Language Identification in Social Media (OffensEval 2020) , in this section, we will describe the shared task and the implemented system.

Sub-task A
Offensive language identification, aims to identify whether a tweet contains a non-acceptable language (profanity) or an offensive content. Moreover, sub-task A is a multilingual sub-task for five languages namely: Arabic, English, Danish, Greek,and Turkish. This sub-task is a binary classification, where each tweet has a labeled offensive (OFF) or not offensive (NOT). Our team (KEIS@JUST) participate in all languages for sub-task A except sub-task A for English language.

Sub-task B
Automatic categorization of offense types, aims to identify whether an offense tweet contains targeted or non-targeted profanity and swearing. This sub-task is a binary classification provided only for English language, where each tweet has a labeled targeted (TIN) or untargeted (UNT).

Sub-task C
Offense target identification, aims to determine whether the offense target of the tweet is one of three tags namely: an individual (IND), group (GRP) or other (OTH). Other contains several tags (i.g. a situation, an organization, an event, or an issue). This sub-task provided only for English language.

Weighted Ensemble (KEIS-BiGRUCNN)
The main intuition of ensemble that combining the predictions comes from (Yu, 1977) that combines two regression techniques, after that (Dasarathy and Sheela, 1979) had been presented the combination of two or more models. In this research, we will perform weighted ensemble technique consist of two models namely: KEIS-BiGRU and KEIS-CNN. We have applied a weighted ensemble that aims to boost the performance of our system. The following provides more details about the implemented techniques.
• Bidirectional-GRU (KEIS-BiGRU): Recurrent Neural Network (RNN) suffers from a gradient vanishing problem. Long Short Term Memory (LSTM) Hochreiter and Schmidhuber (1997) has been proposed to solve the mentioned problem. Gated Recurrent Unit (GRU) (Chung et al., 2014) as well as have been proposed to solve the gradient vanishing problem. Two gates have been used (reset and update gate) in GRU architecture.
It's started with passing a sequence of words through an embedding layer followed Bidirectional GRU layer of 128 neurons, then Gaussian Noise of (0.1). Afterword, Global Average Pooling has been used to extract the discriminative features of the input tweet to prepare that for the next layer. Dense layer of 35 neurons has been applied followed by Dropout layer of (0.2) to prevent overfitting. Finally, the output layer will be Dense of (1) neuron with sigmoid function.   (Kim, 2014) which provides a remarkable enhancement on the performance of NLP tasks. Consequently, it can obtain the linguistic patterns from window of sequence words represented as embedding vectors.
It's started with passing a sequence of words through an embedding layer followed by Gaussian Noise of (0.1). As we know Conv2D have to reshape the input to be compatible to receive the previous shape. Four Conv2D layers have been used. Each individual Conv2D sharing different filter size (1, 3, 5, 7) respectively. Moreover, the number of filters is 36 for all layers which aims to obtain the local information features. Afterward, each Conv2D layer passed to Max Pool 2D layer (MaxPool2D). In the last step, each layer has been concatenated together which aims to identify better output. In order to feed the next layer, we used Dropout layer of (0.25) to reduce over-fitting followed by dense layers of 35 neurons. The output is fed into single sigmoid which can obtain the output class of the given tweet.
For training step, the implemented KEIS-BiGRUCNN ensemble approach applied Soliman et al. (2017) embedding as pre trained model with 300 dimensions for Arabic language that prepared for training step.
In contrast, we applied word2vec embedding for English proposed by (Mikolov et al., 2013), the pre trained embedding avalible at github acount 1 with 400 dimensions. Several hyper-parameter have been used for optimization. Table 1 provides more details about the value of each parameter have been used during the training step for both approaches. It's worth mentioning that amsgrad optimizer proposed by (Tran and others, 2019) the updated version of adam optimizer with slight enhancement regards to the system performance. The final step, after we obtained the final prediction for each model, the predictions have been multiplied by the best chosen weight to enhance the over all results. The ensemble architecture is shown in Fig. 1 KEIS-BiGRUCNN has been used to solve sub-task A for Arabic language and sub-task B,C for English language.

BERT Fine-Tuned (BERT-Bi)
In the recent years, contextual embedding shows the significant progress in the NLP research field. Consequently, according to the reported results in several researches (i.e. Zhu et al. (2019)) it's it's outperform the deep learning approaches. The transformer considered as an encoder-decoder architecture applied on attention mechanisms tasks. More particularly, Google has been released BERT (Devlin et al., 2018) which stands for Bidirectional Encoder Representations from Transformers. Our intention to solve the offensive detection shared task using fine-tuned the BERT by adding Gaussian Noise layer followed by bidirectional LSTM (Hochreiter and Schmidhuber, 1997)   The implemented BERT-Bi based on transfer learning architecture that has used in common specially in image classification and computer vision (Litjens et al., 2017). Moreover as we mentioned earlier in Sec 2, the applied of transformers show the promising results compared to deep learning approaches. For instance, BERT developers created several pre-trained models such as uncased, cased, and multi cased to represent the semantic relationships among text as well as it could be applied as an independent classifier in different NLP domains (i.e. offensive language detection). In this research, we used multi cased model since it trained on multi languages based on transfer learning architecture to tackle the shared task problem. The BERT-Bi architecture shown in Fig. 2 used to solve sub-task A for three languages namely: Greek, Danish, and Turkish. Whereas the special [CLS] should be added at the beginning of each tokinzed tweet. the special [SEP] should be added at the end of each tokinzed tweet. The the Attention Mask represented as an array of 1's and 0's. In order to implement proposed model, several parameters have been used. According to the experimental results the best parameter as follows: batch size= 16, optimizer= Adam, learning rate= 2e-5, and finally BERT max length= 60.   Multilingual dataset have been provided with five languages namely: Arabic, Danish, English, Greek, and Turkish. The annotation follows the hierarchical tagset for the prevuos Offensive Language Identification Dataset (OLID) Zampieri et al. (2019a). The provided dataset to tackle Task-12 at SemEval 2020 which has been obtained from Twitter using API's. Task 12 offensEval 2020 provided three sub-tasks: (1) if the tweet offensive (OFF) or non-offensive (NOT), (2) if the tweet is targeted (TIN) or un-targeted (UNT), and (3) If the target is an individual (IND), group (GRP) or other (OTH). The provided dataset is multilingual and imbalanced refers to the distribution for each sub-task including the labels provided for the three sub-tasks with tab separated file format. Table 2 shows the distribution of the available dataset. Table 3 provide examples that represents dataset for all languages.

Data Pre-processing
The convenient process regarding social network dataset such that, Facebook and Twitter, tweets, and posts which contain such noisy data and slang language. In the raw text, it should remove the special character, punctuation marks ( *,@#-(-), URLs, and user mentions. The normalization was necessary since some words written on short-cut format, the elongation was also removed (e.g congrats ). Finally, numbers and English characters were also removed for Arabic. Moreover, the emojis have been removed.

Embeddings
Several well-known word embedding are provided to extract the vector representation of the input tweets with aims to capture the semantic features for each word and the relationship among them. Word2Vec has been provided by Mikolov et al. (2013), Glove (Pennington et al., 2014), AraVec Soliman et al. (2017 and the recent contextual embedding ElMo by (Peters et al., 2018) and BERT Devlin et al. (2018). In this research, we used AraVec, Word2Vec and the pre-trained BERT embedding to trained the performed model. It is a language representation model and becoming the state of art model for the most of NLP research.

Data Augmentation
It is a way to improve the performance of NLP models, data augmentation should appear on a deep understanding of our dataset including structure and content. The impact of using data augmentation technique will depend on that technique itself, where each one able to learn something different compare to

Discussion
Our results extracted using SAJA CODALab user name and the team name is KEIS@JUST. The reported results on the validation set are presented in table 4 for Arabic and table 5 for other languages. Table 4 presents the results using data augmentation for Arabic language. It shows the enhancement of using the augmentation regarding the overall performance the KEIS-BiGRUCNN approach achieved F1= 87.9% on Arabic validation set. Moreover, BERT-Bi approach achieved F1= 78% on Danish validation set (see table 5). As we mentioned above in sec 3.2 presents KEIS@JUST System to present the results which consist of a)KEIS-BiGRUCNN used to solve sub-task A for Arabic language and sub-task B,C for English language. b) KEIS-BERT-Bi used to solve sub-task A for other languages. To prevent overfitting during the training step, the early stopping and checkpoints have used among the training set and the validation set and keep track of the loss value at the end of each training epoch. Moreover, the learning rate reduction has used. Fig. 3 and Fig. 4 show the model training.

Results and Findings
In order to evaluate the implemented approaches, F1-Macro has been used according to the shared task instruction. Table 6 presents the results of the participants models for Arabic, Danish, and Greek

Conclusion
In this research, presented the KEIS@JUST participation at SemEval-2020 Task 12 which represents shared task on multilingual offensive language. We have participated in all the provided languages for all subtasks except sub-task-A for the English language. Two main approaches have been developed the first one is performed to tackle both languages Arabic and English, a weighted ensemble consists of Bi-GRU and CNN followed by Gaussian noise and global pooling layer multiplied by weights to improve the overall performance. The second one performed for other languages, we investigated the main impact of developing a transfer learning approach from BERT transformer beside the recurrent neural networks such as Bi-LSTM and Bi-GRU followed by the global average pooling layer for other languages. Word embedding and contextual embedding have been used as features, moreover, we investigated how data augmentation affect the results using Arabic dataset.