Zyy1510 Team at SemEval-2020 Task 9: Sentiment Analysis for Code-Mixed Social Media Text with Sub-word Level Representations

This paper reports the zyy1510 team’s work in the International Workshop on Semantic Evaluation (SemEval-2020) shared task on Sentiment analysis for Code-Mixed (Hindi-English, English-Spanish) Social Media Text. The purpose of this task is to determine the polarity of the text, dividing it into one of the three labels positive, negative and neutral. To achieve this goal, we propose an ensemble model of word n-grams-based Multinomial Naive Bayes (MNB) and sub-word level representations in LSTM (Sub-word LSTM) to identify the sentiments of code-mixed data of Hindi-English and English-Spanish. This ensemble model combines the advantage of rich sequential patterns and the intermediate features after convolution from the LSTM model, and the polarity of keywords from the MNB model to obtain the final sentiment score. We have tested our system on Hindi-English and English-Spanish code-mixed social media data sets released for the task. Our model achieves the F1 score of 0.647 in the Hindi-English task and 0.682 in the English-Spanish task, respectively.


Introduction
Mixing language, also known as code-mixing, is a norm in multilingual societies. Many multilingual people tend to be code-mixed by using English-based speech types and the insertion of English into their main language (Patwa et al., 2020), which share their views on social media by combining local and English languages, creating lots of code-mixed text such as Hindi-English and English-Spanish (Ramanarayanan and Suendermann-Oeft, 2017). Today, many organizations rely heavily on sentiment analysis of social media texts for product performance and consider user feedback when upgrading to newer versions (Jhanwar and Das, 2018). The government can predict people's emotions and know people's opinions on the new policy and so on.
Code-mixing (Vyas et al., 2014) is a relatively new field compared to the general field of sentiment analysis (Zhao et al., 2010). Social media code-mixed texts generally have three forms: i) Mixed script: a combination of the native-Roman script; ii) Code-Mixed script: a script written in Roman script in native and English languages; iii) Native script: local languages written in native languages.
This type of text needs to be handled differently, which is very different from traditional English texts (Prabhu et al., 2016). Beyond some of the challenges of general sentiment analysis, code-mixed texts have some unseen difficulties in natural language processing (NLP) tasks. Traditional NLP systems heavily rely on monolingual resources to address code-mixed text, which limits their ability to handle problems such as English-based speech input, word-level code-mixing, et al (Patwa et al., 2020). Furthermore, there are several variations when switching from a phonetic language into a Roman script (Jhanwar and Das, 2018). To solve this problem, we preprocess the text and normalize irregular words. Before we preprocess the text, we also need to eliminate the noise in the text, and translate the abbreviations into the appropriate regular words, and perform a clustering algorithm to get the most suitable one of the last few variants when transliterating non-Roman script code-mixed data as Roman scripts in preprocessing step.
In our work, we introduce a Sentiment Analysis (SA) system, an ensemble model of word n-gramsbased Multinomial Naive Bayes(MNB) and sub-word level representations in LSTM (Sub-word LSTM) (Prabhu et al., 2016). For Indian social media text which is developed for the SemEval-2020 shared task on sentiment analysis for Code-Mixed (Hindi-English, English-Spanish) social media text aims to detect the sentiment polarity (Wang et al., 2018) of the code-mixed text written in two different languages, Hindi and Spanish mixed with English. The traditional method MNB captures low-level word-groups of keywords to make up for grammatical inconsistencies, while the sub-word LSTM model encodes the rich sequential patterns in sparse and unstable text (Norouzi and Fleet, 2013). Our model achieves the F1 scores of 0.647 in the Hindi-English task and 0.682 in the English-Spanish task, respectively. The implementation of our system is made available via Github 2 .

Related Work
Recently, research on emotion and mood analysis in texts became increasingly common, in part because of the availability of new sources of subjective information on the web. (Ortony et al., 1987) is one of the earliest in the area of sentiment classification. It is concerned with the actual classification and segregation of terms with emotional connotations. (Solorio et al., 2014) (Ghosh et al., 2017) tried to use machine learning methods to automatically extract sentiment (positive or negative) from Facebook posts. (Shalini et al., 2018) addressed the performance of distributed representation methods for Bengali-English and Hindi-English languages in sentiment analysis tasks. (Kannan et al., 2016) used a machine learning algorithm called Multinomial Naive Bayes trained by using n-gram and SentiWordnet features, they also used a small SentiWordnet for English and Bengali without using any SentiWordnet for Hindi language, Hindi-English and Bengali-English code-mixed data. An ensemble model of character tri-grams based LSTM model and word n-grams based Multinomial Naive Bayes (MNB) model to classify the sentiments of Hindi-English code-mixed data was introduced by (Jhanwar and Das, 2018). (Ansari and Govilkar, 2018) designed the system which classifies Hindi as well as Marathi text transliterated (Romanized) documents automatically using supervised learning methods (KNN), Naive Bayes and Support Vector Machine (SVM) and ontology-based classification. (Lal et al., 2019) presented a hybrid architecture for the task of sentiment analysis of English-Hindi code-mixed data. (Mandal et al., 2018) prepared gold standard Bengali-English code-mixed data with language and polarity tag for sentiment analysis purposes, and a hybrid system combining rule-based and supervised models were developed for both languages.

Dataset
A recent shared task was conducted by International Workshop on Semantic Evaluation 2020 on NLP (SemEval-2020), for sentiment analysis of transliterated social media text. The organizer of SemEval-2020 provided the code-mixed data of Hindi-English and Spanish-English. The training and validation tweets were labeled one of the three labels -positive, negative and neutral. But the test was not labeled. The data split details are shown in Table 1 There are many inherent challenges of the code-mixed data as described previously. Examples like abbreviations of words ('please' to 'plz') and non-standard spellings (such as 'suppeerrrr' or 'timeeeeeee'). And there are several variations when switching from a phonetic language into a Roman script, as illustrated in Table 2.

Word
Meaning Variation (Bahut) more Bahut bohot bohut (mubaarak) wishes Mobarak mubarak mubark (pyaar) love Pyaar peyar pyara ... piyaar pyar Table 2: Spelling variations of romanized words Before training, we preprocessed the code-mixed raw data. We replaced the link in the data as the URL and removed the punctuation, stop-words and the useless emoji. We tried to make the data noiseless. We also found that some certain characters appear multiple times in a word. For instance, lol (meaning laughing out loud) can be written as loool, looool or looooool. We used a clustering algorithm to process it as lool in the pre-processing stage. And we divided the hash form into the appropriate form (HappyBirthdaySonakshiSinha as Happy Birthday Sonakshi Sinha). In addition, we transliterated non-Roman script ( , means wishes) code-mixed data as Roman scripts (mubarak). Here we chose a clustering algorithm(k-means clustering algorithm) to get the most suitable variant (Fard et al., 2018). All text is converted to lower-case and then fed to the classifier.

System Description
Our proposed system architecture is shown in Figure 1, which is an ensemble model of word n-gramsbased Multinomial Naive Bayes (MNB) and sub-word level representations in LSTM (Sub-word LSTM) to identify the sentiments of Hindi-English and English-Spanish code-mixed data into one of the sentiment classes positive, negative or neutral. After pre-processing the sentence, we generate word-based uni-gram and bi-gram features of the sentence and then fed them to the MNB classifier. Finally, it outputs the probability of the sentence belonging to each class. In our deep learning model, we feed an embedded matrix with length of 128 to the LSTM cell. We use middle level representations of the sub-word that the filter learned in the convolution operation. It propagates serviceable information with LSTM and obtains the final score of the text as illustrated in Figure 2. The sub-word level representation is a better unit of language than characters, which can produce new lexical structures by combining characters of semantic weight.  Then the information is fed into a full connection (FC) layer, which achieves the interactions between these features and classes (Hochreiter and Schmidhuber, 1997). The soft maximum activation function was used to output the correct probability value. We use Adamax (Kingma and Ba, 2014) as the optimizer, a variant of Adam, to train this setup. The optimal hyperparameter configuration of the model is shown in Table 3.

Experiments detail
This section presents the results and compares them to several baselines. We submitted a run for each language: (1) one for Hindi-English and (2) one for Spanish-English. The final ranking for all participating systems would be based on the F1 score averaged across the positives, negatives, and the neutral.
We experimented an ensemble model of word-n-grams based Multinomial Naive Bayes (MNB) and sub-word level representations in LSTM (Sub-word LSTM) to identify the sentiments of code-mixed data of Hindi-English and English-Spanish. We chose the model with the highest combined probability by multiplying the output probability of the two models of each class. Table 4 shows the performances of our system for Hindi-English and English-Spanish language.
We observed that MNB (Unigram+Bigram) performed better for the sparse and inconsistent code-mixed data, especially rare keywords like 'fadu', meaning awesome in English (Jhanwar and Das, 2018) than SVM (Unigram+Bigram). Because the n-gram-based MNB model can successfully capture the unusual keywords. The Sub-word LSTM can extract better sequence information for long sentences. That's why we decided to use the ensemble model.

Conclusion and Future Work
Social media is becoming increasingly influential in people's lives. People in different positions and occupations express their views and attitudes on a certain topic. Some researchers are fascinated for the  (Prabhu et al.2016) 0.633 0.651 Subword-LSTM (Prabhu et al.2016) 0.635 0.653 Ensemble(our system) 0.647 0.682 Table 4: Quantitative comparison of various model proposed of Hindi-English and English-Spanish sentiment analysis of social media text. In this paper, we proposed an ensemble model for sentiment analysis of code-mixed data for Hindi-English and English-Spanish. In the future, we're going to put emotional information into the system and also introduce new networks such as transformers, bert, attention mechanism etc.