I2C at SemEval-2020 Task 12: Simple but Effective Approaches to Offensive Speech Detection in Twitter

This paper describes the systems developed for I2C Group to participate on Subtasks A and B in English, and Subtask A in Turkish and Arabic in OffensEval (Task 12 of SemEval 2020). In our experiments we compare three architectures we have developed, two based on Transformer and the other based on classical machine learning algorithms. In this paper, the proposed architectures are described, and the results obtained by our systems are presented.


Introduction
Social media is, currently, one of the most popular mass media. Much of the world's population communicates and expresses their opinions through these media. Unfortunately, the use of offensive language is very frequent in social networks, having detrimental effects for its users. In (Salminen et al. 2020), the authors categorize the hate language as abusive language, aggression, cyberbullying, hatefulness, insults, personal attacks, provocation, racism, sexism, threats or toxicity. All of them have in common the use of offensive language, being a major threat to social media platforms.
In recent years, automatic detection of offensive language on social media, using machine learning algorithms, has aroused great interest among researchers. In (Schmidt and Wiegand 2017), authors present a survey on automatic hate speech detection using NLP, and a critical overview on how the automatic detection of hate speech in text has evolved over the last years is described in (Fortuna and Nunes 2018). The early works related to offensive language detection employing the use of features like bag of words, word and character n-grams in conjunction with classical machine learning classifiers (Nobata et al. 2016). NLP approaches have the drawback of being dependent on the language used in the text so, recently, other machine learning approaches such as neural networks or deep learning are being used (Pitsilis, Ramampiaro, and Langseth 2018).
Most of the researches to detect offensive language focus on the English language. However, there are not many studies in other languages. In this sense, the proposal of 2020 edition of OffensEval is very novel, also offering corpus labeled in Turkish, Arabic, Danish and Greek.
In this paper, we present our participation in OffensEval (task 12 of SemEval 2020) as I2C team. We focus our efforts on subtasks A and B for English, but we also have worked on Subtask A in Arabic and Turkish.

Data description and preprocess
We participated in Subtask A (Offensive language identification) in English, Turkish and Arabic, and in Subtask B (Automatic categorization of offense types) in English .
Dataset Subtask A in English. The data set consisted of 9,075,418 file records. Each of file record have the following format : Where ID is the tweet identifier, TWEET is the text of the tweet, AVG_CONF is the average of the confidences predicted by several supervised models for this specific instance and CONF_STD is the confidence standard deviation for an instance and a corresponding class.
Since there was no criteria to set the threshold for selecting the tweets with offensive language, and based on our experience with other similar corpus, we decided that values of AVG_CONF bigger than 0.6 should be considered as offensive (1) and not offensive (0) on the opposite case. Using this split, the number of tweets with offensive content was 1,042,905 and not offensive 8,032,513 (11% of the tweets is offensive as opposed of 89% with not offensive content). Due to the large number of tweets we decided to reduce to 20,000 tweets chosen randomly. Different types of files, balanced and imbalanced (keeping the percentages to carry out the experiments described in Section 3), were created. Also, we created a version of each of these datasets with "clean tweets" where the "#" of each hashtag was removed and when the hashtags had words joined by "_", we divided it into simple words. The "URL" and "@user" was removed, but stop words stayed so as to not lose the sentence meaning. Finally, emojis were substituted by words that represents the same meaning, using the demojize tool 2 .
Dataset Subtask B in English. The dataset consisted of 188,974 file records (same Subtask A format) ). However, the meaning of AVG_CONF is different (low values meant the tweet does not show targeted insult). In this case, AVG_CONF represents tweets with targeted insult (if the value is higher than 0.6) or not (if this value es less than 0.6). The number of tweets with targeted insult is 17,825 and 171,148 without. It represents 9% and 91% respectively. Also, we decided to reduce to data sample to 20,000 file records chosen randomly in two files with data balanced and imbalanced with the original percentage. The rest of preprocessing is equals to those in Subtask A.
Dataset Subtask A in Turkish. The Turkish dataset for Subtask A consisted of 31,756 file records with the following format (Çöltekin 2020):

ID | INSTANCE | SUBA
Where ID is the tweet identifier, INSTANCE is the text of the tweet, SUBA could be two values: NOT (the tweet has no offensive language or profanity) or OFF (the tweet has offensive language or profanity).
We changed the values NOT and OFF by 0 and 1 respectively. The number of tweets labelled as NOT was 25,622, while the OFF as 6,134 (81% class NOT, 19% the class OFF). Because there are more tools for English language, we tried to translate each tweet to English, but results were not as good as we expected so we decided not translate to English and use each tweet in its original language. The rest of preprocessing is equal to those in Subtask A in English.
Dataset Subtask A in Arabic. The Arabic dataset consisted of 8,000 file records with the same format as dataset subtask in Turkish . We applied the same preprocessing to the subtask in Turkish. The class balance is the following, the tweet labelled as NOT is 6,411, while the tweet labelled as OFF is 1,589. The percentage is 80% to the class NOT and 20% to the class OFF.

Methodology and experiments
For all the subtasks, the methodology behind our approach was oriented to find the best combination of algorithms and training dataset to elaborate our models.
We decided to test BERT (Devlin et al. 2018 (Breiman 2001) and K-Neighbours (J. Laaksonen and E. Oja 1996) for Subtask A and Subtask B in English and only BERT for Subtask A in Arabic and Turkish.
To obtain the best model for each subtask, 3 architectures were tested. In the first (#Architecture 1), each tweet was preprocessed (hashtags, urls, @users, emojis, ... were removed) using the "Bag of Word" model with TF-IDF for giving weight to the word. Stop words were also removed and all words were transformed to lowercase, and lemmatization was used to reduce the vocabulary of words. In this approach, the generated vector is passed as input to the machine learning method. The algorithms applied were Lineal Regression, SVM (RBF kernel), XGBoost, RandomForest and K-Neighbors. We added the hate words existence on each tweet, lexical characteristics and entities using the spaCy 3 tool. The number of times that any of these hate words appears in the tweet indicates the weight of the characteristic itself. Due to the large number of characteristics, a selection of them was made. To make this selection we used a method which eliminates all the characteristics except for the best N of them, where N was 20, 50 and 100. In our case, the chi 2 test was used as selection criterion for the best characteristics.
For this architecture we used, for all the experiments, balanced and imbalanced datasets with 20,000 random tweets (training 80%, test 20%). For the balanced dataset, 10,000 random tweets were selected from each of the classes in the original file, and for the imbalanced dataset, the same proportion as the original files had, that is 89% and 11% for Subtask A, and 91% and 9% for Subtask B.
The second tested architecture (#Architecture 2) is based on generating the features with a set of transformers ) and using them to feed the same machine learning methods mentioned above (Figure 1). We decided to test the models BERT, TransfomerXL, XLM, XLMNet, DistilBERT, RoBERTa y XLM-RoBERTa. We did not use pre-processing, in order to benefit from the representation's capabilities of transformers. We tokenized the tweets with a specific tokenizer provided by transformer for each model.
The third approach tested (#Architecture 3) is based only in transformers. We decided to test BERT, RoBERTa, DistilBERT, ALBERT and XLM-RoBERTa pretrained models from transformers. Again, we did not use pre-processing, in order to benefit from the representation's capabilities of transformers. We tokenized the tweets with a specific tokenizer provided by transformer for each model. For the finetuning of BERT and others, we used the default parameters of their respective repository but trained on 2 epochs and a maximum sequence-length of 128.
Because the original file was huge and imbalanced, for #Architecture 2 and #Architecture 3 we carried out two tests: one with a random data set with the same proportion of tweets of each class as the original file supplied by the organizers, and the other with a balanced data set (both datasets with 20,000 tweets) and a maximum sequence length of 128, epoch.

Results
In this section, we describe the systems developed for the SemEval20 Subtask A in English, Turkish and Arabic and Subtask B in English.
Subtasks A and B in English. In first place, the corpus of tweets provided by the organizers was preprocessed. Using #Architecture 1 described in the previous section, we tested the algorithms Logistic Regression, RandomForest, XGBoost, SVM and K-Neighbours. Results are shown in Tables 1, 2, 3 and 4 (best result marked in bold). Before testing #Architecture 2, we decided to test how transformers behave with balanced and imbalanced datasets. In the same way as #Architecture 1, two datasets of 20,000 tweets each were built. For the balanced dataset, 10,000 tweets from each class were randomly selected, and for the imbalanced dataset, we selected 20,000 tweets with the same ratio of classes as the initial dataset.

F1-score
The results are shown in Table 5 for Subtask A in English and Table 6 for Subtask B. For both subtasks, as it was expected, balanced dataset performed, on the whole, the best results. Values of 0.33 and 0.34 were obtained when the model classified all the instances as belonging to only one class.
Using a balanced dataset, we decided to fine-tune the pretrained models showed in tables 5 and 6 for Subtasks A and B in English respectively. Best results were performed by bert-base-uncased and distiltbert-base-cased pretrained models. Our approach for Subtasks A and B in English is based on bertbase-uncased fine-tuned with a 20,000 random tweets dataset without cleaning phase.
Results for balanced training dataset for #Architecture2 are shown in Tables 7 and 8. Subtask A in Arabic. For this task, it was decided to test only BERT multilingual-cased pretrained model. Again, file supplied was imbalanced. Due to the number of files of the minority class, the original file was divided into train (85%) and test (15%) and several tests were carried out to find out which technique was best suited to the multilingual BERT pretrained model. As it can be seen in Table 12, Smote-Tomek approach reached the best results so we fine tune the multilingual BERT pretrained model with balanced Smote-Tomek dataset obtained from the full dataset supplied by the organization. We also tried to translate each tweet into English, but the results did not improve. Results are shown in Table 12

Conclusions
In this paper, we have presented I2C's contribution to SemEval20 for Subtasks A in English, Arabic and Turkish and Subtask B in English. In order to achieve optimal results in the challenge suggested by the organization, three architectures were developed and tested: the first one was based on classical machine learning methods for text classification tasks, the second used a mixed architecture for features generation to embed all the tweets and feed machine learning classical models, and in the last one we fine-tuned a set of pretrained models from transformers to classify the tweets. Among the three architectures, the one based on transformers only obtained the best results, in particular, by BERT using the bert-base-uncased pretrained model for the two tasks in English, and BERT using the bert-basemultilingual pretrained model for tasks in Arabic and Turkish. We consider that more research should be put into multilingual solutions. Our experiments were done using the default parameters so there is room for improvement with many adjustments that we plan to consider for future works.