NAYEL at SemEval-2020 Task 12: TF/IDF-Based Approach for Automatic Offensive Language Detection in Arabic Tweets

In this paper, we present the system submitted to “SemEval-2020 Task 12”. The proposed system aims at automatically identify the Offensive Language in Arabic Tweets. A machine learning based approach has been used to design our system. We implemented a linear classifier with Stochastic Gradient Descent (SGD) as optimization algorithm. Our model reported 84.20%, 81.82% f1-score on development set and test set respectively. The best performed system and the system in the last rank reported 90.17% and 44.51% f1-score on test set respectively.


Introduction
The tremendous usage of social media platforms makes it important to apply different Natural Language Processing (NLP) tasks on these platforms. Different tasks, such as cyberbullying identification, hate speech detection, sarcasm detection and offensive language detection attracted NLP researchers to concentrate on automation of these tasks (Kwok and Wang, 2013). One of these tasks which gained a research interests is automatic offensive language detection. Offensive language is widespread in social media. Computational offensive language detection is a solution to identify such hostility and has shown promising performance (Nayel and L, 2019).
Arabic is a significant language having an immense number of speakers as it is the official language of 22 countries (Guellil et al., 2019). It is recognized as the 4th most used language of the Internet (Boudad et al., 2018). The research in NLP for Arabic is constantly increasing . Automatic offensive language detection becomes an important NLP task due to the overwhelming usage of social media. Automatic offensive language identification in Arabic is a challenge due to the complexity of Arabic language (Nayel, 2019).
In this paper we describe the model that has been submitted to the offensive language detection shared task "OffensEval 2020" (Zampieri et al., 2020). Given a tweet, then the task in brief is to determine whether it contains an offensive language or not. The first version, "OffensEval 2019", was held at SemEval 2019 (Zampieri et al., 2019b). A dataset containing English tweets and annotated using a hierarchical three-level annotation model has been used in "OffensEval 2019" (Zampieri et al., 2019a). In "OffensEval 2020", in addition to English, four more languages have been added to the dataset namely, Arabic, Danish, Greek and Turkish. We participated in "OffensEval 2020" for Arabic. A machine learning based approach has been used to develop our submission. Term Frequency/ Inverse Document Frequency (TF/IDF) vector space model has been used to represent the given tweets.

Related Work
Recently, offensive language detection has gained significant attention and a lot of contributions have been recorded in this area (Waseem et al., 2017;Kumar et al., 2018;Mubarak et al., 2017;Mandl et  Neural Network (CNN) and Bidirectional Long Short-Term-Memory (BiLSTM) for offensive language detection. Nayel and Shashireka used classical machine learning algorithms to detect hate speech for multi-lingual tweets (Nayel and L, 2019).

Task Description
Given a tweet, the objective of the task is to determine if the tweet contains offensive language or not. Suppose C = {NOT, OFF}, a set of two classes where NOT is the class of non-Offensive tweets and OFF is the class of offensive tweets. We have formulated the task as a binary classification problem that assigns one of the two predefined classes of C to a new unlabelled tweet.

Methodology
Our approach depends on TF/IDF vector space model, convert the tweet into a vector and then apply the linear classifier on the vector space. Linear classifier is a simple classifier that uses a set of linear discriminant functions to distinguish between different classes (Theodoridis and Koutroumbas, 2009).

General Framework
The general framework of the proposed model consists of the following stages:

Preprocessing
Preprocessing was the first stage in our pipeline. In this stage the following steps have been applied to tweets: 1. Abbreviation Removal '@USER', 'URL' and '<LF>' were commonly used in tweets. These are English abbreviations and refers to private information about users.

Elongation Elimination
Majority of Arabic tweets are free of following the standard rules of Arabic language. A common manner of users is to repeat a specific letter in a word. Elongation elimination encompasses removing this redundancy to reduce the feature space. In our experiments, the letter is assumed to be redundant if it is repeated more than two times. For example the words " " [pronounced "mabrook" and the meaning is congratulation] and " " [pronounced "aaagel" and the meaning is "urgent") containing redundant letter and will be reduced to " " and " " respectively.

Feature Extraction
The second stage in our pipeline was feature extraction. TF/IDF with range of n-grams has been used to represent all the tweets in the training set. TF/IDF has been calculated as given in (Nayel and Shashirekha, 2017). We used range of 3-grams model, i.e. unigram, bigram and trigram terms. For example the sentence " " [ pronounced "eldawry ya zamalek" and the meaning is League oh Zamalek 1 ] has following set of features {" " , " " , " " , " " , " " , " "}.

Training Classifier
In this phase, we used the features that have been extracted in previous phase to train the classifier. We tried a set of different classifiers, namely, linear classifier, Support Vector Machines (SVM), Multilayer Perceptron (MLP), as well as ensemble approach. According to the task's rules only one run can be submitted. The output of the best performed classifier on the development set has been submitted.

Dataset
The dataset that was used to build the model has been distributed by organizers contains a set of tweets and divided into training, dev and test set (Mubarak et al., 2020). A statistics about the training and development sets is given in

Experiments and Results
In the proposed models, the Stochastic Gradient Descent (SGD) optimization algorithm has been used for optimizing the parameters of linear classifier. The loss function used in linear classifier was "Hinge" loss function (Rosasco et al., 2004). Linear kernel has been used for SVM classifier. In MLP classifier the logistic function has been used as activation function using 20 neurons in the hidden layer. We used hard voting approach for ensembles the output of all classifiers. The performance of the proposed classifiers on development, and test set is represented as f1-score and given in Table2.  The local context representation of tweets, TF/IDF, affected the performance of our model negatively. In addition, the usage of classical classification algorithms limits the performance of the proposed models. Deep learning models show improvement in different NLP tasks, where deep models depend on the word embeddings (a semi-supervised approach for global word representation).

Conclusion
In this working notes, a model which performs satisfactorily in the given task has been presented. The model is based on a simple framework, where TF/IDF was used as as weighting scores and classical machine learning algorithms as classifiers. The improvement of our work can be done using deep learning architecture with better word representation. Another hitch of the model is that it does not use any external data other than the provided dataset which may affects results based on the small size of the data. Investment of the related domain knowledge may improve the performance of the model.