LIMSI_UPV at SemEval-2020 Task 9: Recurrent Convolutional Neural Network for Code-mixed Sentiment Analysis

This paper describes the participation of LIMSI_UPV team in SemEval-2020 Task 9: Sentiment Analysis for Code-Mixed Social Media Text. The proposed approach competed in SentiMix HindiEnglish subtask, that addresses the problem of predicting the sentiment of a given Hindi-English code-mixed tweet. We propose Recurrent Convolutional Neural Network that combines both the recurrent neural network and the convolutional network to better capture the semantics of the text, for code-mixed sentiment analysis. The proposed system obtained 0.69 (best run) in terms of F1 score on the given test data and achieved the 9th place (Codalab username: somban) in the SentiMix Hindi-English subtask.


Introduction
In this digital era, users express their personal thoughts and opinions regarding a wide range of topics on social media platforms such as blogs, micro-blogs (e.g., Twitter), and chats (e.g., WhatsApp and Facebook messages). Multilingual societies like India with a decent amount of internet penetration widely adopted such social media platforms. However, the regional language influences the proliferation of the Hindi-English Code-Mixed (CM) data. Sentiment analysis of these end-user data from social media is a crucial resource for commerce and governance. However, in contrast to the classical sentiment analysis methods, which were originally designed for dealing with well-written product reviews, CM texts from social media often contain misspellings (often intentional), badly cased words, letter substitutions, ambiguities, non standard abbreviations, improper use of grammar, etc.
CM poses several unseen difficulties to natural language processing (NLP) tasks such as word-level language identification, part-of-speech tagging, dependency parsing, machine translation and semantic processing. In the last few years, a number of workshops such as Linguistic Code-Switching Workshops 1 and shared tasks such as Mixed Script Information Retrieval (Banerjee et al., 2020) have been organized due to the emerging popularity of code-mixing. To promote research in this area, Task 9 of SemEval-2020 was devoted to CM sentiment analysis in Twitter. The goal of the task was to automatically classify the polarity of a given CM Twitter post into one of the three predefined categories: positive, negative and neutral. The CM languages are English-Hindi and English-Spanish; for a more detailed description of the task see (Patwa et al., 2020).
In this paper, we present a deep learning approach, using a Recurrent Convolutional Neural Network for the task of automatic CM sentiment classification of tweets.
The rest of the paper is structured as follows. Section 2 provides background in brief. Section 3 provides the system overview and Section 4 describes our approach in detail. In Section 5, we discuss the analysis and evaluation results for our system. We conclude our work in Section 6. This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http:// creativecommons.org/licenses/by/4.0/.
1 https://code-switching.github.io/2020/ 2 Background Sentiment classification is the task of detecting whether a textual item (e.g., a product review, a blog post, an editorial, etc.) expresses a POSITIVE or a NEGATIVE opinion in general or about a given entity, e.g., a product, a person, a political party, or a policy (Nakov et al., 2016). Classifying tweets according to sentiment has many applications in political science, social sciences, market research, and many others (Martínez-Cámara et al., 2014;Mejova et al., 2015). Although initially sentiment identification was focused on newswire text (Baccianella et al., 2010), later research turned towards social media (Rosenthal et al., 2015). Since 2013, a sentiment classification task on Twitter data have been organized in SemEval campaigns.
Most of the earlier approaches to this problem were based on hand crafted features and sentiment lexicons (Pak and Paroubek, 2010;Mohammad et al., 2013). These features were then used as input to classifiers (e.g., Support Vector Machines). However, such approaches required extensive domain knowledge, were laborious to define, and can lead to incomplete or over-specific features.
Recently, researchers pay their attention to sentiment polarity detection on CM data. However, a few research work have been carried out in particular Hindi-English CM data with different approaches: lexicon lookup (Sharma et al., 2015), sub-word with CNN-LSTM (Joshi et al., 2016), Siamese networks (Choudhary et al., 2018), dual Encoder Network with features (Lal et al., 2019). Lai et al. (2015) proposed Recurrent Convolutional Network for text classification which is a foundational task in many NLP applications. We followed this model in our task.

System overview
We are inspired by the model proposed in (Lai et al., 2015) particularly proposed for the text classification task. The proposed model takes sequence of CM words as input and provides sentiment polarity class as output. The recurrent structure of the proposed model captures the contextual information during the learning of the word representation, and the max-pooling layer identifies the key CM words. If T , s and θ denote a CM tweet made up of sequence of CM words (T = w 1 , w 2 , w 3 , . . . , w n ), the sentiment polarity class and parameters of the neural network respectively, p(s T, θ)) denotes the probability of the tweet T having sentiment polarity s, where s could be any one of sentiment polarity classes, i.e., s ∈{positive, negative, neutral}.
In this model, a CM word is represented by combining the word and its both side contexts. If c l (w i ) and c r (w i ) denote the left-side and right-side contexts of a CM word w i , a word is represented as a concatenation of the left-side context vector c l (w i ), the word embedding e(w i ) and the right-side context vector c r (w i ). x where, x i is the representation of the i th word (i.e., w i ) in T . Both the contexts (i.e., c l (w i ) and c r (w i )) are calculated as follows: Where, f : is a non-linear activation function; e(w i−1 ): the word embedding of word w i−1 ; W (l) : a matrix that transforms the hidden layer (context) into the next hidden layer; W (sl) : a matrix that is used to combine the semantic of the current word with the next word's left context.
In the forward scan, the recurrent structure of the model obtains all the c l of the CM tweet, whereas, it obtains all the c r in a backward scan of the CM tweet. After obtaining the x i for the word w i , a linear transformation together with the tanh activation function is applied to x i and the result is sent to the next layer: After calculating all of the representations of words, a max-pooling layer is applied. Hence, the pooling layer utilizes the output of the recurrent structure as the input. This layer attempts to find the most important latent semantic factors in the CM tweet: Finally, in the output layer, a softmax function is applied that provides the sentiment polarity probabilities: Figure 1: Recurrent Convolutional Neural Network (Lai et al., 2015) model The Figure 1 depicts the architecture of the proposed model. From Figure 1, we could see that c l (w n ) captures the left-side context, whereas, c l (w 1 ) captures the right-side context of a word w i .

Experimental setup
Data: For sentiment analysis on Hindi-English CM tweets, we used the dataset provided by the organizers of Task 9 at SemEval-2020. The training dataset consists of 14 thousand tweets. Whereas, the validation dataset as well as the test dataset contain 3 thousand tweets each. The details of the dataset are given in (Patwa et al., 2020). For this task, we did not use any external dataset.
Preprocessing: The CM Twitter data is not formal in nature. Also, often casing is not followed properly. As a result, it misleads the language processing task. Therefore, we converted the entire tweets into lower case. After observing the CM dataset, we found that the urls are not contributing any information. While writing, the writer of a tweet addresses a person by mentioning the person's name followed by @ or by twitter id which starts with @. Similarly, to specify a particular topic, the writer uses the topic name that starts with #. To capture the semantics and syntax of a tweet, we replaced the person and a topic being addressed with a MENTION and a TOPIC token respectively. The following steps were applied on each CM tweet: • Tokens were converted to lowercase.
• If a token is http or https, the following tokens are merged to identify as links and were deleted.
• If a token is @, the following token is merged with the first token (i.e., @) and the new token is replaced with a MENTION token. For example, the tokens @ and bomanirani are merged and replaced with a MENTION token.
• If a token is #, the following token is merged with the first token (i.e., #) and the new token is replaced with a TOPIC token. For example, the tokens # and LoveIsLove are merged and replaced with a TOPIC token.
• Emoji's with text were divided into two tokens. For example, he becomes he and .
• If a token contains more than one emoji, each emoji was considered as a token. For example, becomes and .
Embeddings: Following Collobert et al. (2011), a lot of authors argued that word embedding plays a vital role to improve natural language task performance. Hence, we experimented the use of word embeddings to improve the performance of our proposed models. Using the fastText (Bojanowski et al., 2017), we prepared two embedding models: Skip-gram and Cbow. After empirically evaluating the performance on validation set, the embeddings' dimensionality was set to 300 for all the embeddings. The embeddings are trained on training data using the parameters: lr=0.05, context window=5, epochs=5, minimal number of word occurences=5, dimensionality=300.
Experiment: We carried out two experiments with similar settings except different word embedding approaches: Skip-gram for SkipGRun, and Cbow for CbowRun.
Hyper-parameters: After evaluating the model performance on the validation data, the optimal values of the hyper-parameters were set. We used the following list of hyper-parameters: learning rate = 0.6, word embedding vector size = 300, hidden layers= 2, hidden layer size = 64, context vector size = 5, dropout rate = 0.1, optimizer = stochastic gradient descent, loss function= negative log likelihood, and batch size = 64.

Results and Analysis
As mentioned in the previous section, we carried out experiments with our proposed model using two settings. All the presented experiments are evaluated on the test data for the given task. The performance of the systems were evaluated using F1 averaged across the positive, negative and the neutral. The organizer baseline F1 scores for the validation and test data are 0.58 and 0.654 respectively. The details of the baseline are given in (Patwa et al., 2020). The obtained results with our submitted runs are given in Table 1. For SkipGRun, we achieved 0.6913 of F1 score with 0.6952 and 0.6893 of precision and recall respectively. The SkipGRun outperformed the CbowRun by around 0.40 in terms of F1 score. CbowRun outperformed the organizers' baseline by a slight margin, however, SkipGRun outperformed the baseline by around 0.4 in terms of averaged F score.
For SkipGRun, the confusion matrix and the performance of class-wise sentiment polarities are presented in Table 2 and Table 3, respectively. The SkipGRun is the best among the two submitted runs. From the confusion matrix (cf. Table 2), we can observe that the system is more successful in identifying the positive polarity class (F1 score 0.76) than the polarity classes: negative (F1 score 0.71) and neutral (F1 score 0.62). This run performed well on distinguishing a positive tweet from negative tweet and vice versa. Therefore, only 35 positive tweets were misclassified as negative and 24 negative tweets were misclassified as positive. However, it has a problem to separate positive and negative tweets from neutral. Hence, a lot of positive and negative tweets were misclassified as neutral. From Table 2, we could say that around 24% of positive tweets and 28% of negative tweets were identified as neutral. Similarly, a number of neutral tweets were misclassified as positive and negative. From Table 3, it is evident that the positive tweets were the easiest class to predict. The F1 scores were 0.76 (positive class), 0.71 (negative class) and 0.62 (neutral class).   Table 3: SkipGRun: Polarity class-wise performance Table 4 and Table 5 present the confusion matrix and the performance of class-wise sentiment polarities for CbowRun. Like SkipGRun, the CbowRun is also able to successfully identify the positive polarity class (F1 score 0.72) in comparison to others (cf. Table 4). For CbowRun, we observed similar characteristics of results like SkipGRun. Although the results obtained for negative tweets are almost similar with SkipGRun (F1 score 0.71) and CbowRun (F1 score 0.70), the difference in performance is notable for positive and neutral tweets. From   To get a deeper analysis of the results, we also performed the error analysis. We observed that the system could not identify the sentiment when Hindi lyrics are used in tweets. For example: RT MENTION Aankho ki hai ye khawaise ki chehre se teri na hate ... (id:36925, gold:positive, predicted:negative); RT MENTION Saare jahan se achha # Hindustan hamara Ham bulbulain hai iski yeh gulsitan hamara... (id:32707, gold:positive,predicted:neutral).
The system often failed to identify the sentiment, when a long tweet consists of a number of complete or incomplete sentence with mixed sentiments. The proposed system often failed to identify the tweets that have any punctuation or delimiter to separate the sentence clauses. For example: Jb Koi aapke liye kam krta h aapke liye 5sal din rat ek krta h aap pe vishvas krta h aapka vishvas jit ta h jis (id:31020, gold:neutral, predicted:positive) We observed that mostly the system misclassified the positive and negative tweets as neutral and vice versa. However, we observed that there are some tweets that may arise some arguments, such as MENTION Beautiful words and its one of my fav song (id:31662), MENTION Sir me from Bihar Love you You have Great job for Bjp continue We are with you (id:39755), etc. tagged as neutral and our system identified as positive. Identifying the polarity of an entity or entity's aspect in the tweets along with more training data could help to resolve these issues.

Conclusion
This paper describes the approach we proposed for SemEval-2020 Task 9: Sentiment Analysis for CM Social Media Text (SentiMix Hindi-English). In our approach, we pre-processed the CM tweets and proposed a Recurrent Convolutional Neural Network for the sentiment analysis of CM tweets. We submitted two runs and obtaining promising results: our best run obtained 0.691 of F1 averaged across the positives, negatives and the neutral. We observed that the proposed architecture occasionally strives to separate the positive and negative polarities from the neutral and vice versa.
For future work, we will explore the performance of our model with larger corpora against the testing set. Also, we would like to investigate other embedding choices such as BERT (Devlin et al., 2019). Moreover, due to the impact that irony and sarcasm have on sentiment analysis (Hernández Farıas and Rosso, 2016) it would be interesting to apply deep learning techniques to detect irony (Zhang et al., 2019) but in a code-mixed scenario.