CyberTronics at SemEval-2020 Task 12: Multilingual Offensive Language Identification over Social Media

The SemEval-2020 Task 12 (OffensEval) challenge focuses on detection of signs of offensiveness using posts or comments over social media. This task has been organized for several languages, e.g., Arabic, Danish, English, Greek and Turkish. It has featured three related sub-tasks for English language: sub-task A was to discriminate between offensive and non-offensive posts, the focus of sub-task B was on the type of offensive content in the post and finally, in sub-task C, proposed systems had to identify the target of the offensive posts. The corpus for each of the languages is developed using the posts and comments over Twitter, a popular social media platform. We have participated in this challenge and submitted results for different languages. The current work presents different machine learning and deep learning techniques and analyzes their performance for offensiveness prediction which involves various classifiers and feature engineering schemes. The experimental analysis on the training set shows that SVM using language specific pre-trained word embedding (Fasttext) outperforms the other methods. Our system achieves a macro-averaged F1 score of 0.45 for Arabic language, 0.43 for Greek language and 0.54 for Turkish language.


Introduction
Offensive language is very common in social media, now-a-days. Individual users frequently take advantage of the perceived anonymity of computer mediated communication, using this to engage in behavior that many of them would not consider in real life. The SemEval-2020 Task 12 (OffensEval) challenge focuses on prediction of presence of offensive language using the social media. The main goal of this task is to instigate discussion on the creation of reusable benchmarks for evaluating proposed algorithms by exploring issues of evaluation methodology and other processes related to the creation of test collections. The given corpora are developed using the posts and comments over Twitter, a popular social media.  organized a multilingual offensive language classification task with a particular focus on Twitter posts. It has released different corpora for the individual languages, e.g., Arabic (Mubarak et al. (2020)), Danish (Sigurbergsson et al. (2020)), English ), Greek (Pitenis et al. (2020a)) and Turkish (Coltekikin (2020)), and intended to identify and capture the offensive language. All the languages except English have only one task to be performed, e.g., sub-task A which is to identify offensive language. For English, the task has been divided in three different sub-tasks, namely, sub-task B for offense type categorization in which the offense type is categorized into either targeted or untargeted, and sub-task C focuses on identification of target offense.
In this paper, different machine learning and deep learning frameworks have been proposed to accomplish the given task. Each of the proposed systems has been implemented using language specific fasttext, pre-trained word vectors developed over crawling the web. Support Vector Machine (SVM) is widely used for text categorization as introduced by Tong et al. (2001). Fan et al. (2008 recommended the linear kernel for text categorization as it performs well when there exists a lot of features. Hence linear SVM has been used in our experiments. Convolutional Neural Network (CNN) had been invented by LeCun (1998) This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http:// creativecommons.org/licenses/by/4.0/ for extraction of local features, which later had been proven to be the standard choice for image processing tasks. Also, Hochreiter and Schmidhuber (1997) introduced Long Short Term Memory (LSTM) to capture implicit ordering of the sequence data in terms of words and sentences, which is widely used in various Natural Language Processing tasks. Subsequently, SVM as machine learning classifier and hybrid network of CNN with LSTM as deep learning classification system are proposed and those have been implemented for efficiently identifying the presence of offensive languages in text. The results on the test set submitted to the challenge suggest that these frameworks achieve reasonably good performance. However, there are some submissions for this task, whose performances are better than our proposed framework.
The paper is organised as follows. Related literature reviews have been provided in Section 2. The corpora used in these experiments is described in Section 3. The proposed machine learning and deep learning frameworks are explained in Section 4. Section 5 describes the experimental evaluation. The conclusion is presented in Section 6.

Related work
Identifying offensive language over social media has been an increasingly trending issue over the past few years. Fortuna et al. (2018) showed a survey of different text mining approaches for effectively identifying these issues. Each existing language contains it's own language rule which comprises of different syntactic and semantic guidelines. For this, to accomplish the same goal over different languages, it requires different methodologies and approaches.

Arabic language
Offensive language detection in Arabic language is bit challenging due to the lexical variations of different Arabic dialects. Mubarak (2017) proposed abusive language detection framework on Arabic social media. They have first extracted a list of offensive words and related hashtags using common patterns used in offensive and rude conversations and then classified Twitter users according to presence of these words or not in their tweets. Alakrot (2018a) presented a Arabic corpus collected and developed from YouTube comments to be used for the detection of offensive language. Correspondingly, authors have also presented a brief statistical analysis for predictive modelling. Alakrot (2018b) again introduced Support Vector Machine classifier with combinations of different features and a variety of preprocessing techniques in order to detect offensive language from Arabic text.

Greek language
Greek language has a distinct writing system due to the Greek alphabet and an independent branch of the Indo-European family of languages. For this, there is a lack of available computational tools to analyze and process Greek language. Lekea et al. (2018) presented a methodology for automatically detecting the presence of hate speech within the Greek text. Again, Pitenis et al. (2020b) introduced the first Greek annotated corpus for offensive language detection. Authors have also shown a detailed data analysis.

Turkish language
Very small amount of research activities have been seen in several low-resource languages e.g., Turkish. S.A. Özel et al. (2017) introduced a text based approach for detecting cyberbullying from social media text. Authors have collected and developed corpus of this Turkish dataset from Instagram and Twitter messages written in Turkish. A few text classification algorithms have been used for predictive analysis.

Data Description
The corpora released as part of the SemEval 2020 ) are the collection of posts or comments from a set of users over Twitter. Each of the corpora is divided into two categories -Offensive(OFF) and Not-offensive(NOT). Figures 1 shows the distribution of the classes in the data provided for Arabic, Greek and Turkish languages, respectively. The distributions clearly show the imbalance in class labels. The overview of each of the corpora is presented in Table 1.

Proposed Methodologies
The given corpus for each of the languages of SemEval-2020 Task 12 ) is further divided into two sets, namely, training set and validation set. The new training set is developed by randomly choosing 80% tweet samples from NOT and OFF categories; the rest 20% of these categories form the validation set. To train our models, support vector machine -a classical machine learning text classification algorithm is utilized. As a deep learning model, the hybrid model integrating networks of CNN with LSTM has been trained for accomplishing the task. Our proposed framework consists of three crucial layers: data preprocessing, feature extraction and text categorization. A high-level overview of our proposed framework and steps of producing clean text from raw text is shown in Figure 2. We have kept this architecture uniform for offensive language detection for all the three languages.

Data Preprocessing
In order to gain meaningful information from the available text data, it is important to remove noise from it to improve its quality before analyzing it. The steps of data preprocessing involve removal of stopwords, unnecessary URLs and punctuations and Twitter mentions. As the part of data normalization, we have

Feature Engineering
Our proposed approach is entirely based on text features. It can be found in the literature that different linguistic features and quantitative features contribute significantly for fine-grained text classification, as described by Argamon et al.(2007) and HaCohen et al.(2010). Any classical machine learning or deep learning model can not take raw text as its input. Therefore we convert these clean text into the corresponding numerical feature vectors. We have used language specific word embedding models of embedding dimension 300 for each of the language corpora along with TF-IDF/Count models. The language specific embeddings are trained on Common Crawl and Wikipedia data using fastText 2 . These models were trained using CBOW (as described in Mikolov et al. (2013)) with position-weights, in dimension 300, with character n-grams of length 5 as described in Grave et al. (2018). In our case, a vector of a word is being predicted based on context words. For example, we want to predict the vector of a particular word w 0 based on its context words w −n , ..., w −1 , w 1 , ..., w n . A vector representation h of this context is obtained by considering the average of the corresponding word vectors, which can be defined as:

Classification
The classical machine learning model has been implemented using scikit-learn 3 and deep learning model has been implemented in Keras 4 on top of Tensorflow. We then fed the produced word vectors to the classifiers. As ML classifier, we have used SVM. As DL framework, we have proposed a hybrid framework of CNN with LSTM in order to identify offensive instances in twitter text. For SVM, the optimal set of parameters are as follows: regularization parameter (C) as 0.5, class weight kept as balanced as our corpora is highly imbalanced, and kernel=linear. Similarly, for deep learning framework, the best set of  Figure 3: Diagram of CNN based LSTM network hyperparameters are: batch size is 16, dropout probability is 0.5 and the models are trained on 20 epochs. In order to find the optimal set of hyparparameter, we have used Bayesian optimization technique. The overview of our proposed deep neural network can be seen in Figure 3.

Experimental Results
We have reported the performance of both the machine learning and deep learning frameworks on our validation set for each of the corpora in Table 2, Table 3, Table 4, respectively, for Arabic, Greek and Turkish languages. It can be noted that the performance of these classifiers are measured in terms of F1 score. These results are useful in analyzing the performance of different proposed frameworks and subsequently the results are communicated. Classical machine learning model -Support Vector machine(SVM) outperforms the deep learning model deployed. Figure 4, 5 and 6 present the confusion matrices of our submission for Sub-task A for Arabic, Greek and Turkish languages, respectively.   The final results of the challenges are presented in Table 5. It can be observed that the macro F1 score obtained for the test set is less when predicted with the training corpora. This may happen due to class imbalance problem which leads to poor generalization capability of classifiers.

Conclusion
In this paper, we have reported our proposed frameworks and their corresponding performances in Sub-task A of SemEval-2020 Task 12 OffensEval, for the languages of Arabic, Greek and Turkish: Multilingual    Offensive Language Identification in Social Media (OffensEval 2020). We have learned that our proposed frameworks did not perform well because of two possible issues which are as follows: firstly, all the language corpora are highly imbalanced, i.e., insufficient quantity of tweet samples in either of the classes and secondly, for the low resource language, e.g., Greek, the qualities of produced word vectors are much lower than those of other resourceful languages. We understand that an increase in the volume of data could add more potential to our framework, and use of external resources of data could be advantageous.