IIT Gandhinagar at SemEval-2020 Task 9: Code-Mixed Sentiment Classification Using Candidate Sentence Generation and Selection

Code-mixing is the phenomenon of using multiple languages in the same utterance. It is a frequently used pattern of communication on social media sites such as Facebook, Twitter, etc. Sentiment analysis of the monolingual text is a well-studied task. Code-mixing adds to the challenge of analyzing the sentiment of the text on various platforms such as social media, online gaming, forums, product reviews, etc. We present a candidate sentence generation and selection based approach on top of the Bi-LSTM based neural classifier to classify the Hinglish code-mixed text into one of the three sentiment classes positive, negative, or neutral. The proposed candidate sentence generation and selection based approach show an improvement in the system performance as compared to the Bi-LSTM based neural classifier. We can extend the proposed method to solve other problems with code-mixing in the textual data, such as humor-detection, intent classification, etc.


Introduction
Code mixing is one of the most frequent styles of communication in multilingual communities, such as India. This pattern of communication on various platforms such as social media, online gaming, online product reviews, etc. makes it difficult to understand the sentiment of the text. Sentiment classification of the code-mixed text is useful in the scenarios of socially or politically driven discussions, fake news propagation, etc. Some of the major challenges with the text in the code-mixed language are: • Ambiguity in language identification: is, me, to are some examples of the words that are ambiguous to classify as English and Hindi without proper knowledge of context. • Spelling variations: E.g., jaldi, jldi, jldiii,.. are some variations for the word hurry in English.
The sentence in the example misses a question mark(?) apart from other necessary modifications to make the structure of the sentence correct. • Missing context: E.g., Note kr lijiye.. Bandi chal rahi h ;) is a code-mixed sentence and demonetisation (notebandi) is the hidden context. With the increasing popularity of using code-mixing on social media platforms, the interest to study the various dynamics of code-mixing is seeking a boom. Multiple works on language identification (Barman et al., 2014;Das and Gambäck, 2014), POS tagging (Vyas et al., 2014;Ghosh et al., 2016), named entity recognition (Singh et al., 2018a;Singh et al., 2018b), etc. shows the challenges and the opportunities with the code-mixed data. Pang et al. (2008) presents a survey of the approaches to understand the opinions and sentiments on various platforms. Dos Santos and Gatti (2014) performs the sentiment analysis task of the short text messages on two corpora from different domains and present their findings. Kouloumpis et al. (2011) presents multiple experiments to understand the sentiment of Twitter messages using linguistic features and lexical resources. Sentiment analysis of the code-mixed Tweets using a sub-word level representation (Prabhu and Verma, 2016) in the LSTM can improve the performance of the system. Swami Contributions: We present a candidate sentence generation and selection based procedure on top of the Bi-LSTM neural classifier. We observe the increase in the system performance by using the proposed architecture as compared to the Bi-LSTM classifier.

Dataset
We use the dataset (Patwa et al., 2020) provided by the task organizers for building our system (Codalab username: vivek IITGN). Each sentence in the dataset has a sentiment label as positive, negative, or neutral. Table 1 shows the distribution of the sentences in the train, validation, and test dataset for each class. We have 15131, 3000, and 3000 sentences in train, validation, and test set, respectively. On manual inspection of the dataset, we observe ambiguity in the annotation of the sentences. To examine this further, we extract the top 20 most frequently used words in the dataset. We remove the English stopwords, and we set a threshold of 4 characters on the length of the tokens to filter out the Romanized Hindi stopwords. Table 2 shows the percentage overlap of most frequent 20 words of length more than four characters in the train, validation, and test set. The high percentage overlap of most frequent neutral words with positive and negative words also indicates the presence of ambiguity as a challenge in the annotation. Ambiguity in the label for the sentence is one of the major challenges for understanding the sentiment of the sentence. Figure 1 shows some of the example sentences in training set with ambiguous sentiment label. There could be multiple reasons for the ambiguity in the annotation of the sentences such as hidden sarcasm, targeting individual or institution, unclear intent, etc. It leads to human bias due to the annotator's perception of the event or the individual in the sentence. To preprocess the dataset, we remove hyperlinks, mentions, hashtags, emoticons, and special characters from the sentences. We lowercase the sentences. To identify and remove the emoticons from the sentences, we use the emoji sentiment dataset 1 .

Experiments
The availability of code-mixed embedding is a challenging task due to the scarcity of large scale codemixed corpora. We are using the Glove embedding (Pennington et al., 2014) for the English words, and we train the embedding on the PHINC dataset (Srivastava and Singh, 2020) for the Romanized Hindi words. We are using the code-mixed sentences from PHINC to train the code-mixed embedding. Initially, we train the system using Bi-LSTM based neural architecture. The architecture of the Bi-LSTM classifier has the embedding layer followed by the Bi-LSTM layer and then two dense layers and, finally, the softmax prediction for the three sentiment classes. For prediction on the test set, we pre-filter CODE-MIXED SENTENCE: Twitter k baghair apna roza mumkin nahi hota ? Apna chutiyaap dusron per thopna band karo Bhai ! https // t . co / APKD4G8lh0 ORIGINAL LABEL: Positive CODE-MIXED SENTENCE: @ JDeepDhillonz Ha ha ha isko issi baat ka darr the tabhi Congi se alliance ke peechey pada hua tha ! ORIGINAL LABEL: Negative CODE-MIXED SENTENCE: @ Shaan pathan 14 @ DwivediAnukriti Ikk toh sarkar job ni de rhi or upper se apne india ke log kaam karna nahi chahat . . . https // t . co / zfkm4obLd6 ORIGINAL LABEL: Neutral the sentence based on the list of abusive words. If a sentence contains any words from this list, we label that sentence as negative. In the pre-filtering process, we identify 123 sentences in the test set containing one or more of the abusive words from the list. Post pre-filtering step, we generate 15 candidate sentences for each of the remaining test instance using the Candidate Sentence Generation (CSG) procedure. We then select the best sentiment prediction for the sentence using the Candidate Sentence Selection (CSS) procedure. Algorithm 1 shows the CSG procedure. Algorithm 2 shows the CSS procedure. Figure 2 shows the flow diagram of the proposed approach. In the CSG procedure, we try to confuse the model with nearly similar sentences with additional phrases. We generate five similar sentences to the original code-mixed sentence for each of the three buckets (positive, negative, and neutral). We detect the degree of confusion in the sentiment prediction using the CSS procedure. We also keep track of the degree of confusion by sentences in each bucket. If the degree of confusion is significantly higher, we change the previous prediction by the model using the rules (as discussed in Algorithm 2).

Results and Analysis
To evaluate the system performance, we use accuracy, precision, recall, and f-score as the evaluation metric. Table 4 shows the distribution of the successfully and unsuccessfully modified sentences for the Algorithm 1 Candidate Sentence Generation (CSG) procedure 1: procedure CSG(CM sent  ) 2: Load the set of positive and negative phrases P. Table 3 shows the set P. final prediction by the Bi-LSTM + CSG + CSS model. We use the prediction by the Bi-LSTM classifier as the baseline. We observe relatively better successful modifications for the neutral sentences to and from the positive and negative sentences. This result can be attributed to the high overlap in the most frequent words in the neutral sentences with both the other classes (as discussed in section 2). Table 5 shows the system performance of the two models on the test dataset. We observe an increase in the system performance with the use of CSG and CSS procedures on top of the Bi-LSTM classifier. Table 6 shows the system performance on the test dataset with the classwise F-score as the evaluation metric.    Table 6: Evaluation of the system performance based on classwise F-score.