Voice@SRIB at SemEval-2020 Tasks 9 and 12: Stacked Ensemblingmethod for Sentiment and Offensiveness detection in Social Media

In social-media platforms such as Twitter, Facebook, and Reddit, people prefer to use code-mixed language such as Spanish-English, Hindi-English to express their opinions. In this paper, we describe different models we used, using the external dataset to train embeddings, ensembling methods for Sentimix, and OffensEval tasks. The use of pre-trained embeddings usually helps in multiple tasks such as sentence classification, and machine translation. In this experiment, we have used our trained code-mixed embeddings and twitter pre-trained embeddings to SemEval tasks. We evaluate our models on macro F1-score, precision, accuracy, and recall on the datasets. We intend to show that hyper-parameter tuning and data pre-processing steps help a lot in improving the scores. In our experiments, we are able to achieve 0.886 F1-Macro on OffenEval Greek language subtask post-evaluation, whereas the highest is 0.852 during the Evaluation Period. We stood third in Spanglish competition with our best F1-score of 0.756. Codalab username is asking28.


Introduction
SemEval Task-9 Sentimix (Patwa et al., 2020) is divided into two tasks, one for Hinglish (Hindi-English) and the other for Spanglish (Spanish-English) code-mixed subtasks. In the Spanglish task, the dataset contains tweets in Spanglish (Spanish-English) code-mixed language, and it is labeled into three categories positive, negative, and neutral sentiments. The task is to classify codemixed tweets into these three sentiments. SemEval Task-12 OffensEval  is divided into different subtasks, English , Danish (Sigurbergsson and Derczynski, 2020), Arabic (Mubarak et al., 2020), Turkish (Çöltekin, 2020), and Greek (Pitenis et al., 2020) languages. English task is divided into three subtasks-A, B, and C. Subtask-A of OffensEval is offensive language identification, subtask-B is categorization of offensive types into targeted and untargeted, and subtask-C is offensive target identification as individual, group, or other.
In the last decade, there has been proliferation in the use of social media web sites. It has led to pervasive use of hate inducing speech and offensive language to express opinions. The use of profane language has been growing in face-to-face interactions as well as online communications in recent years. The anonymity provided by these websites and lack of stringent action has led to adoption aggressive behavior by people.Youth who experienced cyberbullying, as either an offender or a victim, had more suicidal thoughts and were more likely to attempt suicide than those who had not experienced such forms of peer aggression (Hinduja and Patchin, 2010). Hence it's necessary to auto-remove offensive and profane language in an online environment.
Since the inflow of such type of content is huge, manual filtering is time-consuming and requires much manual labor; hence it becomes almost impractical to do manual filtering. Due to this reason, researchers have proposed methods to automate filtering process by training machine learning models in pre-annotated datasets hate speech and offensive language by (Davidson et al., 2017a) (Malmasi and Zampieri, 2017) , cyberbullying (Xu et al., 2012) and detection of racism by (Tulkens et al., 2016).
In this work (team name is SRIB2020), we try to classify twitter tweets in different languages codemixed for Sentimix tasks and Monolingual tweets in different languages in the OffensEval task into different classes. In OffensEval tasks, tweets are classified as offensive or non-offensive, whereas in the Sentimix task, tweets are classified as positive, negative, and neutral. In the Sentimix task, the "neutral" class is bit ambiguous as many positive tweets in the dataset are labeled as neutral, and negative tweets are labeled as neutral. Class "neutral" has a very thin boundary with the other two classes, "positive" and "negative".
1. ID-7229 -WOO hoo Cricket world cup starts today. Good luck to host @englandcricket hope for a good start. -This sentence is positive in tone, but it is labeled as neutral.
2. ID-8199-@hardikpandya7 best wishes for WorldCup and Eid-Mubarak from MUJAFFAR Hasan National General Secretary LJP URL-This tweet is positive in its sentiment but is labeled as neutral.
(Lal et al., 2019) first generates subword level representations for the sentences using a CNN architecture.
The generated representations are used as inputs to a Dual Encoder Network, consisting of two different BiLSTMs -the Collective and Specific Encoder. The Collective Encoder captures the overall sentiment of the sentence, while the Specific Encoder utilizes an attention mechanism to focus on individual sentiment-bearing sub-words. (Sharma et al., 2016)have annotated the data, developed a language identifier, a normalizer, a part-of-speech tagger, and a shallow parser for sentiment analysis of code-mixed data. (Pravalika et al., 2017) used a lexicon lookup approach to perform domain-specific sentiment analysis. (Joshi et al., 2016) introduce learning sub-word level representations in LSTM (Subword-LSTM) architecture instead of character-level or word-level representations; this enables to learn the information about sentiment value of meaningful morphemes. (Choudhary et al., 2018) uses the shared parameters of siamese networks to map the sentences of code-mixed and standard languages to a common sentiment space. They introduce a primary clustering-based preprocessing method to capture variations of code-mixed transliterated words. Supervised learning techniques for hate detection, offensive detection, and target and sentiment classification on social media datasets have been explored in recent times. (Davidson et al., 2017b) described a way of multi-class classification of offensive language and hate speech in tweets, using SVM, random forest, naive Bayes, and Logistic Regression. (Del Vigna12 et al., 2017) reported performance for a simple LSTM classifier not better than an ordinary SVM, when evaluated on a small sample of Facebook data for only two classes (Hate, No-Hate), and three different levels of strength of hatred. (Pitsilis et al., 2018) propose a detection scheme that is an ensemble of Recurrent Neural Network (RNN) classifiers. It incorporates various features associated with user-related information, such as the users' tendency towards racism or sexism. This paper can be summarised into five key points-1. Applied a variation of Focal Loss by applying class weight along with Gamma parameter in the loss function.
2. Applied multiple preprocessing on the raw text, since in social media platforms people tend to use incorrect grammatical forms and incorrect spellings, it helped us to increase F1 scores.
3. For Hinglish subtask we trained our own word embedding by collecting code-mixed datasets from multiple sources. Rest of the paper is organized as follows: Section-2 presents the methodology in our paper, data description, pre-processing steps, model description, and parameter tuning. Section-3 presents various experiments performed on different models and their results. Finally in Section-4 conclusion based on experiments performed and the future work is discussed. Code is available at github 1 .  • In Demojisation step, different types of emojis present in the corpus is converted into corresponding text representation. Since these combined datasets contain large number of tweets and contain different types of emojis, it becomes necessary to convert emojis into corresponding text representations using the cheatsheet list 2 .
• Removing different types of patterns such as URLs were replaced with URL token in the dataset, @USERNAME was converted to USER token and hashtags, # symbol was removed from the dataset. The dataset is cleaned for different punctuation marks, as punctuation marks are not needed to train the embeddings.
• Acronyms and Contractions were replaced with their corresponding English words. We replace it by creating a dictionary of acronyms and contractions mapping to their expanded form. Acronyms such as 4ever are converted to forever, abt to about, cb to comeback, etc. These acronyms are commonly used in social media platforms. Contractions such as can't, aren't, i've, etc. were again converted to their corresponding text cannot, are not, and I have for this case.
Spanglish Data Processing-The NLTK Snowball Stemmer 3 package was used because it offers to stem in both English and Spanish. The flexibility to use the stemmer in both languages played a key role in the Spanglish Sentiment Analysis system. The list of stop words was constructed from the stop words corpus provided in NLTK. While pre-processing tokenized tweets, any word included in the NLTK English stop word corpus is excluded. Close attention is paid to elongated words (i.e. -"helloooooo" , "orrrrrale"), and after considering possible features of elongated words, spelling normalization is applied to these tokens. It would also be beneficial to apply spelling normalization to slang or purposely misspelled words that are common in tweets or other informally written texts. We removed character repetition by removing characters that occurred more than two times continuously. Emoticons are replaced with their corresponding text in the tweets. English Data Preprocessing-Pre-processing steps such as emoticon replacement, contraction replacement, acronym replacements are done in a similar manner as in previous datasets. In social media platforms, people tend to use short forms such as forget maybe written as frgt. So to deal with this problem we have applied multiple spell correction steps. We have used PySpellchecker 4 , it uses a Levenshtein Distance algorithm to find permutations within an edit distance of 2 from the original word. Then it compares all permutations (insertions, deletions, replacements, and transpositions) to known words in a word frequency list. Those words that are found more often in the frequency list are more likely the correct results. Then we delete characters having more than two continuous occurrences, as it is very rare that a character occurs more than twice continuously.
Turkish Data preprocessing-We have followed pre-processing steps as mentioned above with an extra step of Turkish word lemmatization using lemmatization model by (Sak et al., 2008) which is trained by nearly one million Turkish sentences.
Arabic, Danish, and Greek Data processing-Arabic data is first transliterated to Roman script using Classic Language Toolkit (CLTK) 5 and then all the steps used for other languages are applied to datasets. Danish data was used as it is. We applied Greek Stemmer from 6 for Greek language competition, and then followed pre-processing steps as described above.

Model Description
We have used Ensemble model for all of the tasks mentioned above by combining CNN, self-attention, and LSTM based model.
Step 3: Learn a meta classifier H 12: learn H based on D h 13: return H In the above algorithm, T base level classifiers are trained on training dataset D. These base classifiers are named as h t , where t ranges from 1 to T. In step two new dataset D h is created for meta classifier where input is taken as base classifiers' output and model output as y i . Once the dataset is created meta classifier "H" is trained on dataset D h . In the end, algorithm returns trained meta-classifier "H".
Stacking ensemble model is used for training RNN, CNN, Sequential self-attention with LSTM based architecture together in our model. In stacking, the algorithm takes output of sub-models as inputs and attempts to learn how to best combine inputs to get better output results. The idea of stacking is to learn several weak learners and combine them by training a meta-model to output predictions based on multiple predictions returned by these weak models (Zhou, 2012). In the above algorithm we have three different Deep Learning models with labeled tweets separately. Then output of these models are used as independent variable for stacked model training and labels are same as previous steps.

LSTM
After preprocessing, dataset is split into two parts i.e. training set= 90%, and validation set=10%. For Recurrent Neural Network based model, we have used single LSTM layer with 256 cell units and then MaxPooling Layer to get maximum of all the tokens and then four dense layers with Dropout (dropout rate = 0.3) and BatchNormalization. Final layer is softmax or sigmoid layer depending upon the task. Softmax is used in Sentimix tasks and OffensEval English Subtask-C, and for rest of the subtasks sigloid is used. Model is trained on Focal loss function (Lin et al., 2017), Adam optimizer (Kingma and Ba, 2014), and metrics as accuracy and F1-score. Since maximum number of characters in a tweet are limited to 140 characters including space, we have taken the maximum number of words in a sentence to be 25 considering average lengths of words to be 5 and 3 to 4 spaces.

CNN
A stack of convolutional neural networks (CNN) is used for capturing the hierarchical hidden relations among embedding features. We trained data using CNN model with three convolution layer having filter sizes of (3,4, and 5) respectively, three max pooling layers with filter size of 2 and stride of 2, dense layers of size 4096 and 2048 with Dropout rate of 0.2. Dense layer is connected to softmax or sigmoid layer depending upon the task. Model is trained on Focal loss function (Lin et al., 2017), Adam optimizer (Kingma and Ba, 2014), and metrics as accuracy and F1-score. Tweets are padded in same way as in LSTM model.

Sequential self-attention model
We have used attention as explained in (Bahdanau et al., 2014), after Gated Recurrent Unit layer (Chung et al., 2014) by returning cell outputs from each steps. This model has 256 GRU cells and each cells return output state which are then fed to self-attention layer. Then there are same number of Dense layers and dropout rate as used in LSTM model. Rest parameters are same as in LSTM layer.

Experiments and Results
We performed our experiments on three Deep-Learning Models CNN, LSTM, and ensemble of CNN, LSTM, and self-attention. In Spanglish, we achieved F1-Macro of 0.770, precision of 0.749, and recall of 0.803 with Ensemble model on pre-processed data, whereas it was 0.709, 0.755 and 0.672 respectively for raw text without pre-processing of the data. In Hinglish challenge, Ensemble model out performed other models on pre-processed dataset. Ensemble model on pre-processed data achieved F1-score of 0.682, precision of 0.695 and recall of 0.679 whereas same model when trained on raw data achieved 0.665, 0.681 and 0.665 respectively. From these results we can infer that cleaning steps involved helped to improve the results . We have also experimented our models with and without pretrained Embeddings in Hinglish task but it did not help in improving the scores. We trained Hinglish embeddings by collecting code mixed Hinglish data from various sources such as blogs and scraping twitter data using Fasttext library (Bojanowski et al., 2016). Post evaluation on Spanglish task was not performed since gold labels were not released after the competition. Error analysis of Hinglish and Spanglish tasks is presented in Appendix A and B respectively.
Bert multilingual and Bert-uncased mode (Devlin et al., 2018) are trained and fine-tuned by adding three delta layers (dense) layers on top of pre-trained models. We trained Bert model using both ways one by freezing Bert pre-trained parameters and other by keeping parameters as trainable during the complete training process. From our experiments on Hinglish and Spanglish datasets using these models and techniques 7 we found that Bert-uncased model performed better than the Bert-multilingual model. Keeping the pre-trained parameters of Bert as trainable during the complete process performed better than freezing the Bert parameters during fine-tuning. We attribute this behavior of Bert to the difference in data distributions of Bert pre-training and Sentimix tasks.    Table 3: Spanglish Testset Evaluation achieved F1-score of 0.886, which is more than the highest F1-score of 0.852 achieved during evaluation period. In English Subtask-B, our model 9 is able to achieve F1-score of 0.685 in post-evaluation, which was 0.580 in Evaluation period. In Subtask-A and C F1-score in post-evaluation period is 0.9084 and 0.5106 respectively, with very small difference from evaluation period. In Danish task, our model 10 was able to achieve F1-score of 0.6585 in post evaluation period and 0.613 during evaluation period. For Turkish and Arabic tasks there is small difference in the results in evaluation and post-evaluation experiments. Error analysis of English subtasks B and C are presented in Appendix C, D respectively.

Conclusion and Future work
In this paper, we present description of the system that we have used in all OffensEval and Sentimix tasks. With our best model we were able to achieve third position in Spanglish task in evaluation period. In post evaluation experiments our model is able to achieve F1-score more than the highest score in Greek task evaluation period. From our experiments, we have found that pre-processing steps played a huge role in increasing F1-scores. Since we have used deep learning models, our model could not perform very well in the tasks where dataset was small like in English Subtask-C. In this work paper we present different data pre-processing steps that played important role. In English Subtasks we experimented with pre-trained Embeddings 11 trained in twitter corpus, and found that pre-trained embeddings helped to increase F1-score in English sub-tasks but did not help in Hinglish task. The results obtained through our experiments in test data are lower than the results obtained in development set data. We inferred from our experiments that F1-score partly depends upon data distribution of different classes in training and development data which is used to tune hyper-parameters. In most of the tasks, data is not distributed equally among different classes. Exploratory data analysis reveals that there is huge difference in class distribution in the datasets.
Our system presents a solid baseline for Sentiment analysis of code-mixed languages and Offensiveness detection in multiple languages. In our future work, we plan to add handcrafted features along with current features and train it on different machine learning models. We also plan to explore techniques of data augmentation as Deep learning models need large amount of data to train. Corpus used to train code-mixed language models and languages other than English is very small as compared to corpus used to train English language models. Lot of research needs to be done in this direction.
were incorrectly predicted. Our study found that in most cases, either predicted class or ground truth class were labeled as neutral. In the hinglish dataset, our model failed to predict 747 tweets out of 1869 tweets in the validation dataset, and out of 747 tweets, 618 tweets (82%) were labeled as "neutral" in either ground truth labels or predicted labels. From this, we can say that "neutral" class is a bit ambiguous in the Sentimix dataset. Consider a tweet-"rubika di umar mein aap se kaafi chota hun par am big fan of yours kabhi naseeb ne chaha to ". It is labelled as "neutral" whereas we feel that it should be labeled as "positive". In most of the cases, our proposed model gets confused when it is/can be labeled as "neutral".

B Spanglish Error analysis
We found the same distribution in the Spanglish dataset. We performed validation in 2998 tweets, out of which 1025 were erroneous predictions. Out of 1025 tweets, 764 (74%) of the tweets were labeled as "neutral" in either predicted or ground-truth class. Hence from our analysis, we found that labeling a tweet as "neutral" is somewhat ambiguous even for human annotators.

C OffensEval English Subtask-B Error Analysis
We analyzed the categorization of offensive tweets on 37795 tweets as validation data. Out of this validation set, we got 7339 incorrect predictions, and out of these inaccurate predictions, 6912 tweets labeled as "targeted" were incorrectly predicted as "Un-Targeted". Most of the targeted tweets have pronouns like "you, your, they, them, ur, u, she, her, these, him, his, he", contain names of personalities who is targeted, and words like "people, bitch, boys, girls, variations of nigga ". Our model is biased towards such kind of words in a sentence. It can predict a sentence as "targeted", which contains such type of terms. We feel that the dataset includes some incorrect labels, for example -"might fuck around and sleep without my feet covered". It is not explicitly directed towards a person or a group. Our model sometimes fails to identify tweets directly targeting with names; for example, "This is some high level shit. Someone needs to dumb it down for Trump voters". In most of the cases where our model fails to determine a tweet as targeted, the target is from "others" category where the objective is some event, situation, organization or an issue for example-"And here's another fucking breakdown", "I'm sick of it all, April to August has been utter bullshit".

D OffensEval English Subtask-C Error Analysis
We analyzed target identification of OffensEval English subtask-C on 213 targeted tweets validation set. Out of 213 validation data points, 64 tweets' target is incorrectly predicted by our model. In this dataset we found that targets of some tweets are incorrectly labeled for example -"he should be ashamed of himself but he's not because he's #zionel" is targeted towards an individual but labeled as other in the dataset and "#arunjaitleystepdown he is most shameless #fm in history of india and audacity and shamelessness with which is lies in public is disgrace to post." is labeled as "Group" targeted but it is targeted towards an individual. Our model is able to classify these tweets as targeted towards an "Individual" correctly. Our model may be biased towards some pronouns, for example-"Dollar for a phone. you all are fucking dumb." is classified as "Individual" targeted, but its correct label is "Group" targeted possibly due to the presence of "you" in the sentence. Also in tweet "anyway this game sucks", model predicts as "Individual" targeted possibly because it is not able to decode what "this" refers to in the context, here "this" refers to an event "game".