Gundapusunil at SemEval-2020 Task 9: Syntactic Semantic LSTM Architecture for SENTIment Analysis of Code-MIXed Data

The phenomenon of mixing the vocabulary and syntax of multiple languages within the same utterance is called Code-Mixing. This is more evident in multilingual societies. In this paper, we have developed a system for SemEval 2020: Task 9 on Sentiment Analysis of Hindi-English code-mixed social media text. Our system first generates two types of embeddings for the social media text. In those, the first one is character level embeddings to encode the character level information and to handle the out-of-vocabulary entries and the second one is FastText word embeddings for capturing morphology and semantics. These two embeddings were passed to the LSTM network and the system outperformed the baseline model.


Introduction
Code-Mixing is a phenomenon which is evident in multilingual societies (Shana Poplack et al., 2003). It reflects the use of distinct grammatical systems and vocabulary of the languages being used simultaneously in a single utterance or conversation. This technique used in communication commonly is widely found today in popular social media platforms like Twitter, Facebook, Instagram in the form of posts, comments, replies, especially in chats. This is evident in multilingual societies like India, Canada, Ireland, South Africa, Switzerland, and many others.
India has officially recognized 22 regional languages 1 . So, in multilingual societies like India, most of the social media users predominantly integrate the well-known language, like English, with their native languages. 560 million Internet users 2 in India exchange information by mixing their regional languages with prominent language like English, which produces a huge amount of code-mixed social media corpus. One such trending combination is the mixing of Hindi and English with the output in Hinglish 3 (Hi-En) code-mixed data. Consider the example sentence which illustrates the code-mixing phenomenon being addressed in this paper. "Congratulations/Eng Sir/Eng Ji/Hin Dobara/Hin PM/Eng banee/Hin ki/Hin hardik/Hin subhkamnaye/Hin aapko/Hin ./O". (Translation into English: Congratulations sir, Best wishes to become Prime Minister again). The words followed by language tags /Hin, /Eng, and /O correspond to Hindi, English and Other respectively. Sentiment analysis of code-mixed data has become a prominent research area in recent times in the field of NLP. But identifying the sentiment in code-mixed data is hard since it poses the following challenges: (i) Romanized code-mixed data is noisy and ambiguous in nature. (ii) Accessible datasets are smaller in size to tune neural networks. (iii) Frequent occurence of non-standard spellings (such as pyaaarr, goooood). (iv) The phrase/word contractions (cmng for coming, IDK for I don't know). (v) Spelling variations. A single word pyaar (love), can be written as "piyar", "pyaarrrr", "peyar", "pyar", or "piyaar", etc. To handle these challenges we postulate that FastText embeddings enrich the word vectors with sub-word information and that character level embeddings should be able to assist deep learning models to handle unknown Hindi words.
In this paper, we propose models to predict the sentiment label of a given code-mixed tweet/text. The sentiment labels are positive, negative, or neutral and the code-mixed languages selected are Hindi and English. This task is conducted in CodaLab 4 website and our CodaLab username is gundapusunil. All our models are trained using only the trained dataset provided by SentiMix organizers. We started experiments with traditional machine learning models like Logistic Regression, Support Vector Machines but the f1-score on development set was below 0.63 then we moved to complex models like Long Short Term Memory (LSTM) based models with different types of word embeddings where we were able to conquer the baseline model.
Our paper is divided into the following sections: We begin with an introduction to code-mixing and its challenges in Section 1. Related work of various code-mixing strategies demonstrated in Section 2. Section 3 present the description of SentiMix dataset. We then discuss the pre-processing steps and compare machine learning and deep learning approaches with baseline model results in Section 4. The results are reported in Section 5, and Section 6 concludes the paper.

Related Work
Several studies have been made in the areas of sentiment analysis and code-mixed data. One of the earlier studies on code-mixed data was proposed by Gold (1967) with the goal of language identification in which it was stated that the structure of the language is procured by learning the language structure from the given text and informant. Braj B., Kachru (1976) described the structure of multilingual languages and language dependency in linguistic convergence of code-mixing from an Indian Perspective. SentiWordNet for English language introduced by Esuli and Sebastiani (2006), became the primary source for all sense based lexical analysis and opinion classification.
Later, researchers have extended the work on machine learning and sentiment analysis methods for Indian languages. In their study Siersdorfer, Chelaru et al. (2010) used the SVM and Naıve Bayes classifiers to label millions of comments for sentiment polarity. R. Sharma and P. Bhattacharyya (2014) developed a lexicon-based sentiment analyzer for product reviews and the same subjective lexicon-based model has been extended to Punjabi language. For Malayalam movie reviews D. S. Nair, J. P. Jayan et al. (2014,2015) initially came up with a rule based system for sentiment analysis, later improved the system with the support of the machine learning algorithms. MIKE 2015 sentiment analysis task for Hindi and Bengali used the Multinomial Naive Bayes classifier with the features for building the vector space constrained by filtering the words based on WordNet.
In recent years, researchers have seen a huge improvement in the task of sentiment analysis of English as well as Hinglish using deep neural networks. M.G. Jhanwar, A. Das (2018) proposed an ensemble of character-trigrams based LSTM model and a n-grams based Multinomial Naive Bayes model to classify the sentiments of Hinglish code-mixed data. Shalini K, Barathi Ganesh HB et al. (2018) addressed the performance of distributed representation methods in sentiment analysis and reported comparisons among different machine learning and deep learning techniques. Other attempts include using sub-word level compositions with LSTMs to capture sentiment at the morpheme level (Joshi et al., 2016). We attempted to perform SemEval 2020 task-9 (Patwa et al., 2020) with various classification and deep learning models to analyze the results and also how such models contributed to a great advance in this task.

Dataset
In this paper, we used the dataset provided by Task 3: SentiMix in SemEval 2020. The corpus contains a total of 20000 tweets and it is sub-divided into three sets (train, validation and test). Each corpus except test contains code-mixed tweets along with their corresponding sentiment labels. These code-mixed tweets are tokenized into tokens. And the tokens of each tweet are separated by a new line. Each token is manually annotated with a language identification tag which are: Hindi (Hin), English (Eng), Other (O). Train  4634  4102  5264  14000  Valid  982  890  1128  3000  Test  1000  900  1100  3000  20000   Table 1: Dataset Statistics. Table 1 shows the distribution of sentiment classes in the SentiMix dataset. Table 2 shows some examples of code-mixed tweets from the SentiMix dataset. Here, the first column contains Hinglish tweet, the second column contains English translation of tweet, the third column contains sentiment label.

Code-Mixed Tweet
Approximate English Translation Sentiment Label All the best Team India Jeet ke aana Team India all the best, come back with win. Positive Aap bhi aisa drama kr do kam se km You also do this drama and at least. Negative Aisa PM naa hua hai aur naa hee hoga Neither there has been a PM like him, nor there will be Positive

System Architecture
In this section, we present our models that are trained and validated on the SentiMix dataset described in the previous section. We compare our approach with Machine Learning and Neural Network based baselines. The full code of system architecture can be found at GitHub 5 .

Preprocessing of Code-Mixed tweets
Our code-mixed data consists of excessive noise in the form of punctuations, Uniform Resource Locator (URL's), few Devanagari script Hindi words, twitter mentions, hashtags, stop words, etc. In the preprocessing step, we take a stab to overcome the noise in the the data by remove/normalize the unnecessary tokens. Figure 1 explains the code-mixed dataset pre-processing pipeline. The input for the pipeline is a tokenized tweet and the output is a cleaned tweet.

Classical Supervised Machine Learning Algorithms
To design the finest system for sentiment analysis in code-mixed data, we begin our experiments with traditional machine learning algorithms like Support Vector Machines (SVM), One-vs-rest classifier with Logistic Regression (OvRLR), Random Forest Classifier (RFC) and Multilayer Perceptron (MLP). The input for these methods is a single d dimensional feature vector of a single code-mixed tweet.
We analyze the results of the above classical algorithms with the combination of two types of vectors. (i) Word level term frequency-inverse document frequency (tf -idf) vector and (ii) Glove word embeddings. For all tokens in the code-mixed tweet a feature vector is created by averaging over d dimensional Glove embeddings and also it is experimented with tf-idf weighted averaging. The code-mixed tweet vector construction scheme is described below: Empirically, we found that standard averaging of Glove and tf-idf gave better results than normal tf-idf weighted averaging.

Deep Neural Networks
In this subsection, we describe the character and word embedding based deep neural network called "Syntactic and Semantic LSTM (SS-LSTM)" that gave better predictions on our corpus. Initially, We tried with Word2Vec (T. Mikolov et al., 2013), GloVe (J. Pennington et al., 2014), FastText (Bojanowski et al., 2016, Character embeddings for each word in the input code-mixed tweet. We train a simple LSTM model using each of these embeddings to test the effectiveness of these embeddings for sentiment classification. FastText and Character level embeddings gave slightly better results than other embeddings. By considering these results we modeled the SS-LSTM architecture given below.

Character Level Embeddings
Character level embeddings use a one-dimensional convolutional neural network (1D-CNN) to find the numeric representation of words by looking at their character-level compositions. 1D-CNN (Yoon Kim, 2014) is an algorithm capable of handling unseen words and also extracting syntactic information from the segments of input. The character embedding step converts tweet tokens into a d × T matrix. d is the dimension of vector and T is the number of tokens in code-mixed tweet.

Word Level Embeddings
For word level embeddings we used the Facebook's FastText. The main advantage of FastText embeddings is to capture the hidden knowledge about a language, like word analogies or semantic. And it is looking into the internal structure of words, which could be very useful for morphologically rich languages like Hindi. The FastText enrich word vectors with subword information. We model the task of SentiMix as a multi-class classification problem where given a code-mixed tweet, the model outputs probabilities of it belonging to three output classes -Positive, Negative, and Neutral. The proposed system architecture (SS-LSTM) is shown in Figure 2. The input tweet is fed into 1D-CNN and FastText. These two word embedding models generate two d × T matrices, one is for the 1D-CNN and the other for the FastText. Here the dimension of each matrix is 256 × T. These two word embedding matrices are passed to two LSTM layers. One LSTM layer uses a character level embeddings, whereas the other layer uses a FastText word embeddings. These two layers learn syntactic and semantic feature representation and encode sequential patterns in the tweet. And each LSTM layer gives a 128 × 1 dimensions vector.

Model Architecture
The character embedding LSTM layer output is C ∈ R 128×1 and the FastText word embedding LSTM layer output is W ∈ R 128×1 . These two output feature representations are row-wise concatenated and the output vector dimension is O ∈ R 256×1 . The output vector O passed to a fully connected network with one hidden layer which models interactions between these features and outputs probabilities per sentiment class. We used Keras neural network library to implement this model.

Results
A summary of results from various techniques on the SentiMix test dataset is present in Table 3. SS-LSTM gave the best performance on f1-score for each sentiment class as well as on average f1-score. Our results thus indicate that combining syntactic and semantic representations in SS-LSTM outperforms individual LSTM-Character and LSTM-FastText embedding models.

Model
Representations  Machine learning models with tf-idf feature representations gave the approximate baseline results. We observe that the tf-idf weighted average of GloVe performed better than the simple average of vectors. And we used the grid search (Srivastava et al., 2014) to find the better hyper-parameters like number of LSTM layers, learning rate, and the number of epochs. We used GPU for training deep learning models.

Conclusion
In this paper, we experimented the code-mixed dataset with various machine learning and deep learning models. We see that the LSTM models performed far better than traditional ML methods. In the first phase of the SentiMix competition (development set), we were able to achieve a score of 0.6357. But in the second phase (test dataset), our best score was only 0.6758. After competition we attain the f1-score of 0.6789 by changing few parameters like learning rate and number of LSTM's. Till now we handled problems like unseen words, spelling variations, dataset imbalance, emojis, short form of words, etc.
In future work, we plan to focus on issues like free ordering of words in sentence constructions, short sentences with unclear semantic structure, etc. And we would like to explore more deep neural network architectures that can capture sentiments in code-mixed data.