UTFPR at SemEval 2020 Task 12: Identifying Offensive Tweets with Lightweight Ensembles

Offensive language is a common issue on social media platforms nowadays. In an effort to address this issue, the SemEval 2020 event held the OffensEval 2020 shared task where the participants were challenged to develop systems that identify and classify offensive language in tweets. In this paper, we present a system that uses an Ensemble model stacking a BOW model and a CNN model that led us to place 29th in the ranking for English sub-task A.


Task Summary
The OffensEval shared task is part of the SemEval 2020 workshop  and consists of developing a classification system for each of the sub-tasks described below: • Task A: Categorization of a tweet as offensive or not.
• Task B: Classification of an offensive tweet as targeted or not to someone.
• Task C: Identification of the target of the offensive tweet (individual, group, or other).
To help the participants create a solution for the shared task, datasets in five different languages were provided (Arabic (Mubarak et al., 2020), Danish (Sigurbergsson and Derczynski, 2020), English , Greek (Pitenis et al., 2020), and Turkish (Çöltekin, 2020)). However, only the English dataset ran all three sub-tasks, the other languages ran only sub-task A. Our team focused on the English dataset, so only this dataset will be described ahead. For sub-task A, the organizers provided a dataset containing 9,087,118 instances; for sub-task B, a dataset containing 188,973 instances; and for sub-task C, a dataset containing 188,973 instances.
Our team focused on developing a system based on the English dataset but only for solving the sub-task A. For this sub-task, a total of 9,087,118 instances were released to be used as training and development set and 3,887 instances as test set.
Analyzing the training/dev sets, we found they are tabular datasets separated into 4 columns: The first one is the id, the second one is the sentence, the third one is the average, and the last one is the standard deviation (see Table 1). The values on the third and fourth columns are the results estimated by a weakly supervised model trained by the task organizers . The range of values on the average column is from 0 to 1, where a value near 0 represents a sentence less offensive and a value near 1 represents a sentence more offensive.

Pre-processing
As the objective of this task is to classify a sentence as offensive (OFF) or not offensive (NOT) but the data was only labeled with numerical offensive scores, we set a threshold of 0.5 to determine what is offensive and what is not offensive. Every sentence that has an average value under 0.5 was judged as not offensive (NOT), thus, every sentence with an average value equal or above 0.5 was judged as offensive (OFF). After doing that, this new dataset was analyzed and we could observe that the dataset was imbalanced, since we got considerably more NOT labels than OFF labels. In order to get it balanced and get better performance on our system, we used an under-sampling method to get equal counts for each label. The under-sampling method randomly selected the sentences from the NOT label so that we obtained a completely balanced dataset (see Table 2   After balancing the dataset, no further processing was done on the training/dev dataset. We also decided to create a pseudo test set using the dataset provided in OffensEval 2019 so that we could use the OffensEval 2020 dev set for validation during training. The pseudo test set is a combination of the training and test sets provided for OffensEval 2019 sub-task A.

Methodology
BOW: The bag-of-words model is a method to represent texts numerically. Given a set of texts, this model counts how many times a word appears in each text. Thus, using the data provided for the task, the whole training set is the corpus used to get the vocabulary from and each sentence is further vectorized in an array where each column represents how many times that word appeared in the sentence.
CNN: The Convolutional Neural Network (CNN) was first presented as an approach to object recognition in 1999 (LeCun et al., 1999). Since then, this kind of neural network has been applied as a solution to many other problems, even in Natural Language Processing tasks, and one of them is the Text Classification (Bhandare et al., 2016). A CNN model can detect variations of patterns in the input data, which promotes good feature extraction even from dirty data (Kim, 2014). Looking for a robust and computationally efficient model, we choose this approach to address this task. This choice was backed too by the good results obtained by other teams that used CNN models at OffensEval 2019 (Zampieri et al., 2019).
spaCy TextCategorizer: The TextCategorizer 1 is a spaCy 2 model that, as its name suggests, is used for text classification and it is the model used for the system herein described. Inside of TextCategorizer, there are four parameters: vocab, which is a Vocab object that is the vocabulary that will be used in the model; model, the language model used which, if not provided, will be created based on the data provided; exclusive classes (True or False), which decides whether the classes provided will be mutually exclusive; and architecture ("ensemble", "simple cnn", and "bow"), which is the architecture utilized by the classifier. Below are some relevant parameters we used for our model: Training: We used the OffensEval 2020 training set to train the models, the OffensEval 2020 dev set to validate the models during training, and chose as our final submitted model the one with the highest score over the pseudo test set.

Performance on Shared Task
The official evaluation metric used for the OffensEval 2020 shared task is the macro-averaged F1-score. In Table 4 are the results obtained by the UTFPR system on the pseudo test set and the official OffensEval 2020 test set. As it can be noticed, the official test set scores were better than expected compared to the pseudo test set. Analyzing the results in order to discover where our system had failed to classify some tweets, we found out that our system tends to classify as offensive ("OFF" label) tweets that contain bad words even when these words are used to emphasize something. The model misclassifies as not offensive ("NOT" label) tweets that have any level of irony.

Dataset
F-score Pseudo test set 0.793 Official test set 0.909 Table 4: Macro-averaged F-scores of the UTFPR system on the pseudo test set and the official test set. Table 5 shows the top 3 and bottom 3 team scores on sub-task A. The score our system obtained placed us in 29th place on the ranking, but we see a difference of less than 2% from top 1.

Conclusions
The UTFPR system herein presented for the OffensEval sub-task A used an ensemble model that stacks a BOW and a CNN model. Despite being lightweight and easily adaptable to low-resource languages, our model performs well when compared to more sophisticated and resource-dependent systems. In this task our system got the 29th position out of a total of 82 participants staying only 2% F-score points away from first place. Malicious users on social media platforms generally write their posts using spelling variations on bad words to bypass algorithms that check the offensive language in their texts. Because of that, in the future, we plan to use text normalization techniques (mainly on incorrectly-spelled words) in order to train more robust and reliable models for this task.