Garain at SemEval-2020 Task 12: Sequence Based Deep Learning for Categorizing Offensive Language in Social Media

SemEval-2020 Task 12 was OffenseEval: Multilingual Offensive Language Identification inSocial Media (Zampieri et al., 2020). The task was subdivided into multiple languages anddatasets were provided for each one. The task was further divided into three sub-tasks: offensivelanguage identification, automatic categorization of offense types, and offense target identification.I participated in the task-C, that is, offense target identification. For preparing the proposed system,I made use of Deep Learning networks like LSTMs and frameworks like Keras which combine thebag of words model with automatically generated sequence based features and manually extractedfeatures from the given dataset. My system on training on 25% of the whole dataset achieves macro averaged f1 score of 47.763%.


Introduction
The ever-growing amount of user-generated data on social media platforms be it Facebook, Twitter, blogs or any other electronic medium introduces new challenges in terms of automatic content moderation, especially regarding hate speech and offensive language detection. Not only is hate speech more likely to happen on the Internet, where anonymity is easily obtained and speakers are psychologically distant from their audience, but its online nature also gives it a far-reaching and determinative impact (Shaw, 2011). User content mostly consists of microposts, where the context of a post can be missing or inferred only from current events. Manual verification of each posting by a human moderator is infeasible due to the high amount of postings created every day. Consequently, automated detection of such attacking postings is the only feasible way to counter this kind of hostility. However, this task is challenging because natural language is fraught with ambiguities, and language in social media is extremely noisy. The classification system that would be prepared for the task, needed to be generalized for various test corpora as well. In this paper I have described the system consisting of a sequential pipeline with text feature extraction and classification as its main components. Firstly, a bag-of-words model is used for encoding the sentences into corresponding integer sequence. Thereafter, vectors are generated from these sequences and fed to a series of BiLSTM layers for training. Then a softmax layer is used for ternary classification into the corresponding offensive language categories. The rest of the paper has been organized as follows. Section 2 describes the data, on which, the task was performed. The methodology followed is described in Section 3. This is followed by the results and concluding remarks in Section 4 and 5 respectively.

Data
The dataset that has been used to train and validate the model is an annotation following the tagset proposed in the Offensive Language Identification Dataset (OLID) (Zampieri et al., 2019a) and has been used in OffensEval 2019. This year more accurate and quantified dataset was generated and annotated using a semisupervised approach to increase the quality and quantity of the dataset named the Semi-Supervised Offensive Language Identification Dataset (SOLID) . The resulting dataset that is the newer version of the OLID dataset has been used for training and validating purposes. It was collected from Twitter; the data being retrieved the data using the Twitter API by searching for keywords and constructions that are often included in offensive messages. The vast majority of content is not offensive so different strategies have been tried to keep a reasonable number of annotations in the offensive class amounting to around 26% of the dataset.
The dataset provided consisted of annotations in their original form along with the corresponding labels. Subtask C consisted of the labels IND, GRP and OTH.

IND
Offensive tweet targeting an individual GRP Offensive tweet targeting a group OTH Offensive tweet targeting neither group or individual

Methodology
My approach is to preprocess the annotations and then convert the annotations into a sequence of words and convert them into word embeddings. I then run a neural-network based algorithm on the processed tweet. Label based categorical division of data is given in Table 2. I have used SenticNet5 (Cambria et al., 2018) for finding sentiment values of individual words. The sentiment features play a vital role in context of offensive language as it is during sad state of mind and hatred towards someone that offensiveness is at its peak. The Sentiment values range from -1 to 1 depicting Negative, Neutral and Positive sentiments. The use of BiLSTM networks in our model might have resulted in better results. The work done by Sepp Hochreiter et al. (Hochreiter and Schmidhuber, 1997) led the road-map to newer domain of work by bringing the concept of memory into usage for sequence based learning problems. I first have taken the annotations and sent the raw data through some pre-processing steps, for which I took inspiration from the work on Hate Speech against immigrants in Twitter(Garain and Basu, 2019a), part of SemEval2019. The steps used here are built as an advancement of this work. It consisted of the following steps: 1. Replacing emojis and emoticons by their corresponding meanings (Garain, 2019) 2. Removing mentions 3. Replacing words like "XX" or "XXX" with "sexual" 4. Contracting whitespace

Removing URLs
In step 1, for example, ",-) " is replaced by "winking happy" ":-C" is replaced with "real unhappy" ";-(" is replaced with "crying" Similarly I replaced 110 emoticons by their feelings. The step 3 consists of manually identifying a feature of inclusion of words "XX" or "XXX" in sexual harassment cases and taking advantage of this identified feature. Using this extraction, contributed to increasing accuracy in classification to some extent. They played an important role during the modeltraining stage. The pre-processed annotations are treated as a sequence of words with interdependence among various words contributing to its meaning. I then take a Bag-of-words model approach as well as use pre-trained GloVe vectors. For the Bag-of-words approach I convert the annotations into one-hot vectors as shown in Fig. 1. The Bag-of-words approach outperformed so I have shown results related to this encoding of text. 1. Counts of words with positive sentiment, negative sentiment and neutral sentiment in English 2. Frequency of auxiliary verbs like "is", "was", "are", "were" 3. Subjectivity score of the tweet (calculated using predefined libraries) 4. Frequency of difficult and easy words  5. Number of question marks, exclamations and full-stops in the tweet 6. Frequency of words like "they", "he", "she", "we" My model for the sub-task is a neural-network based model. For the task, first, I have merged the manually-extracted features with the feature vector obtained after converting the processed tweet to one-hot encoding. The output is processed through an embedding layer which transformed the tweet into a 128 length vector. The embedding layer maps the unique indices in the one-hot vector to the embedding vector space. I pass the embeddings through a Bidirectional LSTM layer containing 256 units. This is followed by another bidirectional LSTM layer containing 512 units with its dropout and regular dropout set to 0.45 and activation being a sigmoid activation. This is followed by a Bidirectional LSTM layer with 128 units for better learning followed by a Dense Layer. The Dense layer consists of 3 units for three classes representing the classes Individual, Group and Others. The softmax layer gives a probability prediction percentage against each of the classes and thus finally gives the final results by assigning the label with maximum probability as output. The model is compiled using the Nadam optimization algorithm with a learning rate of 0.001. Categorical crossentropy is used as the loss function.
The architecture is depicted in Table 4.
Layer (  For the neural models in the language context, most popular are LSTMs (Long short term memory) which are a type of RNN (Recurrent neural network), which preserve the long term dependency of text. I use a Bidirectional-LSTM based approach to capture information from both the past and future context thus enhancing the context grasping capacity of the architecture. Figure 2 shows working of a BiLSTM unit.

Figure 2: BiLSTM Unit
The dataset is skewed in nature. If trained on the entire dataset without any validation, the model tends to completely overfit to the class with higher frequency as it leads to a higher training accuracy score. To overcome this problem, I have taken some steps. Firstly, the training data has already been splitted into two parts -one for training and one for validation thus, solving overfitting to some extent. The training is stopped when two consecutive epochs increased the measured loss function value and decrease in Validation accuracy for the validation set. Secondly, class weights have been assigned to the different classes present in the data which are chosen to be proportional to the inverse of the respective frequencies of the classes. Hypothetically, the model then gives equal weight to the skewed classes and this penalizes tendencies to overfit to the data.

Results
I participated in subtask C of OffenseEval 2020 which is task 12 of SemEval 2020 and my system worked quite well. I have included the automatically generated macro averaged evaluation metrics along with the detailed metrics calculated using the gold labels. The results are depicted in Tables 5-6   The confusion matrix is shown in Fig. 3.

Figure 3: Confusion Matrix and Kappa value
The skewness in the data for the OTH class has led to systematic ignorance of the system to learning features related to the class thus resulting in worse results compared to other class labels of Offensiveness.

Conclusion
In this paper, I have presented a model which performs satisfactorily in the given task. The model is based on a many2one sequence learning based architecture. There is scope for improvement by including more manually extracted features (like those removed in the pre-processing step) to increase the performance. I could use only 25% of the whole dataset due to lack of computational resources. Data required for a Deep Learning model is quite high. Using the whole database would surely give excellent results. Removing the data constraints might lead to better accuracies and f1 metrics. Use of regularizers can also lead to proper generalization of model, henceforth increasing the metrics.