BAKSA at SemEval-2020 Task 9: Bolstering CNN with Self-Attention for Sentiment Analysis of Code Mixed Text

Sentiment Analysis of code-mixed text has diversified applications in opinion mining ranging from tagging user reviews to identifying social or political sentiments of a sub-population. In this paper, we present an ensemble architecture of convolutional neural net (CNN) and self-attention based LSTM for sentiment analysis of code-mixed tweets. While the CNN component helps in the classification of positive and negative tweets, the self-attention based LSTM, helps in the classification of neutral tweets, because of its ability to identify correct sentiment among multiple sentiment bearing units. We achieved F1 scores of 0.707 (ranked 5th) and 0.725 (ranked 13th) on Hindi-English (Hinglish) and Spanish-English (Spanglish) datasets, respectively. The submissions for Hinglish and Spanglish tasks were made under the usernames ayushk and harsh_6 respectively.


Introduction
The research problem of Sentiment Analysis of Code-Mixed Social Media Text appeared as part of the SemEval Shared Challenge 2020 (Patwa et al., 2020). Mixing languages while writing text, also called code-mixing, is a typical pattern observed in almost all forms of communication, including social media text. We only focus on two popular bilingual code-mixing styles namely Hinglish and Spanglish.
Sentiment Analysis is a term broadly used to classify states of human affection and emotion. Interpreting code-mixed languages is difficult not only because the sentences may not fit a particular language model, but also because mixed text on social-media usually contains tokens such as hashtags, and usernames.
In this paper, we present an ensemble of CNN and self-attention based LSTM, utilizing the XLM-R embeddings (Conneau et al., 2019). While CNNs have been used for sentiment analysis before (Wang et al., 2016;Yoon and Kim, 2017), none of the previous works have used a self-attention based LSTM along with it. We found that while the CNN component worked well for positive and negative tweets, the self-attention component worked better for neutral tweets, necessitating an ensemble of the two. The implementation of our system is made available via Github 1 .

Related Work
Performing standard NLP tasks on code-mixed data has presented significant challenges. Vyas et al. (2014) attempted to find methods for POS tagging of code-mixed social media text.
Another work by Joshi et al. (2016) used CNNs to learn subword level embeddings and then utilized these embeddings in a BiLSTM network to learn subword level information from social media text. Subword level representations are particularly important while dealing with noisy texts containing misspellings and punctuations. However, this work doesn't capture information about word-level semantics.
More recent work by Lal et al. (2019) uses two parallel BiLSTMs, which they call the Collective and Specific Encoder and an additional feature network. This approach combines recurrent neural networks utilizing attention mechanisms, which helps in evaluating the overall sentiment using attention weights when presented with a mixture of local sentiments.  [<cls>,'_watched' ,'_Para' ,'site' ,' . ' ,'_काफी' ,'_अच् छी' ,'_movie' ,'_थी' ,'_' ,'imo' ,'!' ,'_#' ,'_ review' ,<pad>{135},<eos>]   The tweets have been originally provided in the Latin script with their corresponding language tags. Before feeding the tweets to any training stage, they are preprocessed using the following procedure ( Figure 1): 1. Back-Transliteration: All the words with "Hindi" language tags are converted into Devanagari words using phonetic transliteration. Google's Transliteration API 2 was used for this purpose. The words with "Spanish" language tags are not transliterated.
2. Noise removal: Usernames (annotated as @username), URLs, and emoticons present in the tweets are removed altogether, while hashtags (annotated as #hashtag) are left as it is. We also experimented with replacing emoticons by their corresponding textual meaning, but removing them led to better performance.
3. Tokenization: Tweets after noise removal are tokenized into subwords using the XLM-R (Conneau et al., 2019) vocabulary and later converted into their corresponding IDs.

Embedding layer
Since our data comprised of code-mixed tweets, it was essential to use a multilingual model. For our proposed architecture, we used the XLM-R embeddings. XLM-R is a transformer-based masked language model trained on one hundred languages, using more than two terabytes of filtered CommonCrawl data (Conneau et al., 2019). The subword IDs from the pre-processing stage are fed to the XLM-R encoder. The final hidden state corresponding to each token is used for the classification task as inputs to the proceeding components (See figure 2). The XLM-R encoder is fine-tuned during training to generate better encodings for the code-mixed text.
We also experimented with the Multilingual BERT (henceforth, M-BERT), released by Devlin et al. (2018). We found that XLM-R performed much better than M-BERT for our dataset.

Architecture
We propose an ensemble model comprising of two main components.

CNN Classifier
The first component is a convolutional neural network (Lecun, 1989) (henceforth, CNN). CNNs, to some extent, take into account the ordering of the words and the context in which each word appears.
We generate the required embedding by passing the subword embeddings of a sentence individually into 1-D CNN. We perform a convolution with 3 different filter sizes (2, 3 and 4), before adding a bias and applying a non-linear RELU activation.
The idea behind using several filter sizes was to capture contexts of varying lengths. The convolution layer is used to extract local features around each word window, while the max-pooling layer is used to extract the essential features in the feature map. XLM-R embeddings are passed through this component and, ultimately, through a softmax function to obtain the predictions of the first component. We call these predictions p CN N .

Self-Attention Classifier
The second component is a self-attention based classifier (See figure 4). It helps in choosing the overall sentiment when presented with a mixture of sentiments. We use soft-attention (Xu et al., 2015), a deterministic, differentiable attention mechanism, where a softmax gives the weights for each subword, and the output of the attention module is a weighted sum of hidden representations at each location.
The self-attention component comprises a BiLSTM (Hochreiter and Schmidhuber, 1997) layer, which takes as input the output of the XLM-R encoder. The hidden state obtained from the BiLSTM layer for each subword is used to calculate the attention scores.
Suppose a sequence is given by the subwords (w 1 , w 2 , ..., w n ). Let the i th forward hidden state in the BiLSTM be represented by − → h i and i th backward hidden state by ← − h i . The combined annotation k i is obtained by concatenating − → h i and ← − h i . We first concatenate the forward and backward hidden states to obtain a combined annotation (k 1 , k 2 , ..., k n ).
The attention mechanism gives a score e i to each subword i in the sentence S, as given by (2).
Then the attention weight a i of each k i is computed by normalizing the attention score e i We then calculate the sentence latent representation vector h using equation (4 The representation is thus a weighted combination of all the hidden states. The representation vector h is then passed through a fully connected layer followed by a softmax to obtain predictions p att . The predictions from the first and second components are aggregated (See figure 5) using element wise product (denoted by •) to obtain the final predictions (p f inal = p CN N • p att ). We experimented with other aggregating techniques like linearly weighted average, but element-wise product worked out better.

Data Description
We used the dataset provided by the organizers of Task-9 of SemEval 2020 (Patwa et al., 2020) for training both Hinglish and Spanglish models. The data has been annotated semi-automatically. The statistics of the dataset are shown in Table 1. The dataset for Hinglish is balanced while that of Spanglish is highly unbalanced. For hyperparameter tuning, we used the validation set provided by the organizers.

Experiments and Results
We first trained a vanilla CNN model on the provided dataset using the XLM-R embeddings. The CNN model seemed to be confused on neutral data points but worked well on positive and negative tweets. The self-attention model outperforms the previous model on neutral data points though it performs worse on the positive and negative samples. The good performance on neutrals can be attributed to the fact that neutral tweets may contain multiple sentiment bearing units which the model is capable of handling.
Combining the results of CNN with those of the Self-Attention model was the primary motivation for using an ensemble of the two. The ensemble outperforms all our previous models, achieving a recall of 0.705 with an F1-score of 0.707 on the Hinglish test dataset and a recall of 0.696 with an F1-score of 0.725 on the Spanglish test dataset (See table 2). The confusion matrices for the ensemble on both datasets are shown in figure 6 and 7 (o : neutral, + : positive, -: negative). Our team was ranked 5 th among 62 teams in Hinglish and 13 th among 29 teams in Spanglish.  To visualize the sentence embeddings learned by the model for the Hinglish test dataset, we projected the sentence vectors obtained before the final fully connected layer onto a lower-dimensional subspace using the t-SNE algorithm (van der Maaten and Hinton, 2008) for the two components (See figure 8).
For CNN, the positive and negative tweets seem to form two distinct clusters, while the neutral tweets are scattered among them. In contrast, for the self-attention component, neutrals seem to form a distinct cluster, while the positive and negative classes are partially dispersed in a wide region. Thus, the two components, in a way, complement each other for better predictions over all the three classes.

Error Analysis
Most of the misclassifications were made by our model on the following three types of tweets -1. Neutral -Despite the improvement due to the self-attention classifier, the performance on neutral tweets still lags much behind positive and negative tweets.
2. Sarcastic -Sarcasm is the use of irony to mock or convey contempt. Tweets such as Best wishes to pseudo atheist In new country in advance. Bon voyage are challenging to classify due to their hidden context and are falsely predicted as positive by our model.
3. Mildly negative -Due to exorbitant amount of abusive tweets in the data, some mildly negative ones like South africa team bekar h jab tak ushme ABD villers na ho are falsely predicted as neutral.