LISAC FSDM-USMBA Team at SemEval-2020 Task 12: Overcoming AraBERT’s pretrain-finetune discrepancy for Arabic offensive language identification

AraBERT is an Arabic version of the state-of-the-art Bidirectional Encoder Representations from Transformers (BERT) model. The latter has achieved good performance in a variety of Natural Language Processing (NLP) tasks. In this paper, we propose an effective AraBERT embeddings-based method for dealing with offensive Arabic language in Twitter. First, we pre-process tweets by handling emojis and including their Arabic meanings. To overcome the pretrain-finetune discrepancy, we substitute each detected emojis by the special token [MASK] into both fine tuning and inference phases. Then, we represent tweets tokens by applying AraBERT model. Finally, we feed the tweet representation into a sigmoid function to decide whether a tweet is offensive or not. The proposed method achieved the best results on OffensEval 2020: Arabic task and reached a macro F1 score equal to 90.17%.


Introduction
Negative social media behaviors including cyberbullying, trolling and offensive language are intended to hurt or embarrass a victim. Prevention of antisocial internet use is then a protection for internet users that let them live a positive experience through the internet without any type of harassment. Zampier et al. (2019; introduced a shared task for identifying and categorizing offensive language in social media. The task is divided into three sub-tasks including Sub-task A: Detect if a post is offensive (OFF) or not (NOT); Sub-task B: Categorize the offense type of an offense post as targeted insult (TIN), targeted threat (TTH), or untargeted (UNT); and Sub-task C: Identify the offense target as individual (IND), group of people (GRP), organization or entity (ORG), or other (OTH). The three sub-tasks were run for English while only sub-task A was run for Arabic, Danish, Greek, and Turkish.
The majority of OffensEval 2019 (Zampieri et al., 2019) participants used deep neural network models such as Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019) to build their offensive language detector. These models have the capability to learn and automatically extract complex features from raw data, thus it has achieved some state of the art results in different Natural Language Processing (NLP) tasks.
In this paper, we follow the same path by building an Arabic offensive language detector based on BERT more specifically AraBERT (Antoun et al., 2020). Our method overcomes AraBERT's pretrainfinetune discrepancy by including the special token [MASK] during fine tuning and inference phases. We obtained promising results since we achieved 90.17% macro F1 score on test set and we ranked on top of the participants list.
This paper is organised as follows: Section 2 describes the proposed method; Section 3 presents the experimental settings; Finally, Section 4 concludes and outlines future directions. This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/.

Method
Our method is composed of three main steps including 1) tweet preprocessing; 2) representation; and 3) classification.

Tweet Preprocessing
The aim of this step is to extract new features from emojis and then tokenize tweets in order to use AraBERT model (Devlin et al., 2019;Antoun et al., 2020). Given that a tweet may contain words and emojis, we handle emojis in a special way. Thus, our preprocessing pipeline follows these phases: 1) Detect emojis: if emojis exist, we extract the position and the meaning of each emoji; 2) Substitute emojis with the special token [MASK] and translate emojis meanings from English to Arabic. By substituting emojis with [MASK], we overcame the pretrain-finetune discrepancy of AraBERT (Yang et al., 2019). The latter can be explained by the fact that the special tokens such as [MASK] used by AraBERT during pretraining are absent from specific datasets at finetuning step; 3) Concatenate the emojis-free tweets with their respective emojis Arabic meanings. The special token [CLS] is added to the head of each sentence and the special token [SEP] delimits the sentence and the emojis Arabic meanings; 4) Tokenize the output sentence. All the words except the special tokens are segmented by Farasa segmenter (Abdelali et al., 2016) and then tokenized with AraBERT tokenizer. If the tweet is free of emojis it is directly passed to Farasa segmenter and AraBERT tokenizer. Figure 1. illustrates the flowchart of our tweet preprocessing step.

Tweet Representation
After preprocessing a tweet, we segment the obtained tokens into two segments. The first one contains tweet tokens, while the second segment contains the tokens of the Arabic meanings of detected emojis. The inputs of AraBERT model are then the indices and the segments of tokens. The AraBERT model (Antoun et al., 2020) is applied to compute token's embeddings. This model has the same configuration as BERT BASE model (Devlin et al., 2019) which is 12 encoder blocks, 768 hidden dimensions, 12 attention heads, 512 maximum sequence length, and a total of ∼110M parameters. The model is trained on two objectives: 1) Masked Language Model where the model is trained to predict a masked token; and 2) Next Sentence Prediction in which the model is optimized to predict if the second sentence follows the first sentence. AraBERT was pre-trained on 70 million Arabic sentences, corresponding to ∼24GB of text. Finally, a tweet representation is computed by applying global max pooling function on top of AraBERT tokens representation. Figure 2 shows the flowchart of tweet representation step.

Tweet Classification
The last step of our method consists in classifying a tweet given its representation. Thus, in order to compute the likelihood of a tweet to be offensive or not, we use a sigmoid function that takes a tweet representation as input and we train the model to optimize the binary cross entropy loss. We notice that during training, AraBERT parameters are fine-tuned on this specific task: Arabic Offensive Language identification on Twitter.

Dataset
The dataset 1 used in OffensEval 2020: Arabic is proposed by Mubarak et al. (2020). It is a large dataset that contains 10000 labeled tweets with OFF or NOT, for offensive and not offensive. The dataset is split as follows: the training set which includes 7000 (70%) of tweets, the development set that contains 1000 (10%) of tweets, and the test set that contains 2000 (20%) tweets. Figure 3. shows the distributions of the train dataset and development dataset. The test labels were kept unknown to evaluate different participants models.

Experimental settings
We used Tensorflow 2.0 and keras-bert 2 libraries to build and train our models. During training, the number of epochs was 5 and the batch size was 32 for all models. We apply Adam optimizer with a learning rate 10 −4 , warmup, and weight decay. The maximum input sequence length of AraBERT was 128. All our experiments were run on google colab 3 .

Evaluation
In this work, we built three models following the same steps described in Section 2, where each model has a different preprocessing step. The first model, named vanilla model, do not apply any emojis preprocessing. The sentence was directly segmented with Farasa segmenter then tokenized with AraBERT tokenizer. The second model, entitled AraBERTEmojisIn, substitutes emojis with their Arabic meanings. In the last model, AraBERTEmojisOut, we substitute emojis with the special token [MASK] and create another input segment that contains the Arabic meaning of each emoji within the tweet. The meanings were separated by the special token [SEP]. Table 1 illustrates the obtained results according to the precision, recall and F1 score metrics. As illustrated in Figure 4, we also used the ROC curve to compare further the vanilla, AraBERTEmojisIN, and AraBERTEmojisOUT models. The AraBERTEmojisOUT model achieves the best results in terms of macro F1 score, weighted F1 score and the AUC score. This is due to the use of AraBERT embeddings, and the way the model handles the pretrain-finetune discrepancy by including [MASK] token in training and inference phases. The prediction results based on our AraBERTEmojisOUT model achieved 90.17% in terms of macro F1 score and ranked on the top of "OffensEval 2020: Arabic" participations ranking list. In order to illustrate in more details the performance of AraBERTEmojisOUT model using of test data, we made the confusion matrix as shown in Figure 5. The model achieved 96% precision and 96% recall on NOT class labels and 83% precision and 85% recall on OFF class labels. This is clarified by the fact that the training set is unbalanced where the NOT samples size is approximately two times the OFF samples sizes.

Conclusion
In this paper, we described our method for identifying Arabic offensive language in Twitter. The preprocessing step consists in extracting new features from emojis to enrich the information within tweets. Afterward, AraBERT model is applied to build tweets representations by taking into consideration all the words and emojis contained in the tweet. To enhance tweet representation, all AraBERT parameters  were fine-tuned on this specific task: Arabic Offensive Language Identification on Twitter. Finally, a tweet label is predicted by applying a sigmoid function to its representation. Our method for OffensEval 2020: Arabic ranked on the top of participants list in terms of macro F1 score. To extend our work, we plan to improve tweets representation by applying more advanced word embeddings such as an Arabic version of XLNet (Yang et al., 2019).