iCompass at SemEval-2020 Task 12: From a Syntax-ignorant N-gram Embeddings Model to a Deep Bidirectional Language Model

We describe our submitted system to the SemEval 2020. We tackled Task 12 entitled “Multilingual Offensive Language Identification in Social Media”, specifically subtask 4A-Arabic. We propose three Arabic offensive language identification models: Tw-StAR, BERT and BERT+BiLSTM. Two Arabic abusive/hate datasets were added to the training dataset: L-HSAB and T-HSAB. The final submission was chosen based on the best performances which was achieved by the BERT+BiLSTM model.


Introduction
With the freedom of expression privilege granted after the Arab countries revolution, sensitive topics such as religion and politics have become popular and widely discussed across social media platforms. However, on the down side, offensive language spreads easily. Indeed, recent events; like Persian's gulf crisis, the parliamentary-presidential elections held in Tunisia or a football game between two Arabic clubs; caused intensive debates, most of them took place on social media networks leading to a high emergence of offensive speech. This evokes the need for tools to identify such online offensive language content.
Analyzing Arabic offensive language is significantly challenging due to the complex nature and morphology of the Arabic language. Furthermore, Arabic language in Social Media is mostly informal and written in Arabic dialects.
We describe our participation in SemEval 2020 Task 12 entitled "Multilingual Offensive Language Identification in Social Media" Zampieri et al. (2020), specifically Task 12 4A-Arabic: "Arabic Offensive Language Identification in Social Media" Mubarak et al. (2020) under the team name "iCompass". The task requires to distinguish between offensive (containing any form of non-acceptable language (profanity)) and non-offensive posts.
The remainder of the paper is organized as follows: in Section 2, we introduce L-HSAB and T-HSAB datasets. In Section 3, we describe the preprocessing step. Section 4 introduces the learning strategies and datasets used in the presented models. Results are reviewed and discussed in Section 5 while Section 6 concludes the study.

Arabic offensive language datasets
The proposed dataset for the SemEval task 12 subtask A-Arabic is composed mainly by Egyptian dialect comments and few other dialects like Libyan, Sudanese, Syrian, etc. Some comments, after the preprocessing phase, in different dialects are given in Table 1

Data Preprocessing
Data was preprocessed by cleaning the tweets from the social media-inherited symbols such as (Rt,<LF> and @), URLs, Usernames, dates, retweets, symbols, punctuations, emojis and non-Arabic characters, in order to remove noise and to get the Arabic text only. Table 2 shows an example of the comment number 2372 of the training set, before and after preprocessing.

Used models and Learning Strategies
In this section, we describe the learning strategies and the different architectures used. The mechanism of each strategy is briefly reviewed. To accomplish this mission, we have used three classification approaches.

Tw-StAR
Tw-StAR is a syntax-ignorant n-gram embeddings model used in sentiment analysis of several Arabic dialects Mulki et al. (2019b). Tw-StAR's embeddings are composed and learned using an unordered composition function and a shallow neural model. The model performed state of art results at Semeval-2017 Task 4: "Sentiment Analysis in Twitter" Mulki et al. (2017) including Arabic language and SemEval-2018 Task 1: "Affect in Tweets", subtask Ec "Detecting Emotions (multi-label classification)" Mulki et al. (2018).
In addition, we have used an offensive lexicon extracted from L-HSAB and T-HSAB datasets. The offensive lexicon is composed by offensive words in Arabic dialects such as that mean Thief, traitor... We have combined the predictions of our classifier with the results of the lexicon-based method to have a hybrid solution. This mixed solution have decreased the results since it can neither identify negations nor detect non-offensive comments that contain offensive words, example in table 3. Arabic ... English ... this was a good and fearful person, not a thief and a traitor as they said Table 3: Example of a non offensive comment that contains offensive words.
Tw-StAR+Lexicon approach has classified this comment as offensive because it contains and , which means Thief and Traitor, that appear in the lexicon of offensive words and confuse the prediction results of this type of sentences. This approach decreased the performance results as given in Table 5.

BERT
Context-free models such as word2vec or GloVe generate a single word embedding representation for each word in the vocabulary. Differently, contextual models generate a representation of each word based on the other words in the sentence. For this reason, we selected the Bidirectional Encoder Representations from Transformers (BERT) Devlin et al. (2019) as a contextual language model in its light multilingual version as an embedding technique.
BERT is a deep bidirectional language model, pretrained on large corpora (BooksCorpus: 800M words, and Wikipedia: 2,500M words), that can be fine-tuned to solve many NLP tasks such as named entity recognition (NER), question answering (QA) and text classification Devlin et al. (2019).
We used the BERT base multilingual cased model, with 12 transformer layers, 12 attention heads and 110M parameters. We used the already pretrained model and trained the classifier to predict the probabilities of the labels (OFF or NOT) tuning different values of hyper parameters in order to have the best performances. Values for fine-tuning are stated in Table 4.

BERT + BiLSTM
The dataset was tokenized using the BERT tokenizer mapping Arabic words to their indexes. BERT embedding matrix was used at the embedding layer level. Then, BILSTM model was used as classifier in order to predict label's probabilities. Batch size  128  8  16  Sequence length -128  128  Epochs  6  3 6 N-grams (N) 8 --

Results and Discussion
Three datasets are provided by SemEval 2020 Task 12 subtask 4A-Arabic: TRAIN (7000 comments) for training models, DEV (1000 comments) for tuning models, and TEST (2000 comments) for the official evaluation. Data was preprocessed using regular expressions recognition and regular expressions substitution provided by the re Python module 2 . Having the data preprocessed and the features extracted, training dataset was splitted into 80% for training and 10% for cross-validation. We have trained our Tw-StAR, BERT, and BiLSTM models on the DEV set. Table 5 lists the results of the four classification architectures: Tw-StAR, Tw-StAR+Lexicon, BERT, and BERT+BiLSTM. As a result, a slight improvement was achieved by BERT when compared to the Tw-StAR baseline and BiLSTM by achieving the best performances with a macro F-score (Macro F1) of 0.825 and an accuracy of 0.898.

Model
Accuracy  In order to improve the performances, we used T-HSAB and L-HSAB datasets for training in addition to the SemEval 2020 Task 12 subtask 4A-Arabic provided training dataset. Characteristics of each dataset are given in Table 6.  Considering the results stated in Table 5 and Table 7, the supervised learning-based model with BiLSTM algorithm achieved the best average F-score (micro and marco) performances compared to the BERT base and the Tw-StAR model. Hence, BERT+BiLSTM was used to provide the TEST set classification results for the final submission.

Conclusion and Future work
Three classification architectures were used to identify offensive language Arabic tweets. The best F-score macro results were obtained by the BiLSTM classification architecture using BERT base multilingual word embedding which was selected for the final submission.