NLP-CIC at SemEval-2020 Task 9: Analysing Sentiment in Code-switching Language Using a Simple Deep-learning Classifier

Code-switching is a phenomenon in which two or more languages are used in the same message. Nowadays, it is quite common to find messages with languages mixed in social media. This phenomenon presents a challenge for sentiment analysis. In this paper, we use a standard convolutional neural network model to predict the sentiment of tweets in a blend of Spanish and English languages. Our simple approach achieved a F1-score of 0:71 on test set on the competition. We analyze our best model capabilities and perform error analysis to expose important difficulties for classifying sentiment in a code-switching setting.


Introduction
The phenomenon of combining two or more languages in the same message is known as code-switching or code-mixing (Gumperz, 1982;Myers-Scotton, 1993). Code-switching is an indicator of bilingual competence (Hamers and Blanc, 1999), and it is also motivated by social and cultural factors such as social status, race, age, etc. (Kim, 2006). Although this phenomenon has been studied extensively in linguistics (Pfaff, 1979;Poplack, 1980;Gumperz, 1982;Myers-Scotton, 1993;Milroy and Muysken, 1995;Lipski, 2005;Martínez, 2010;Auer, 2013), it is still challenging for machines to process mixed natural languages. Code-switching is notoriously present on social media posts and chats such as Twitter, Facebook or WhatsApp ; consequently making it more difficult to process the sentiment expressed in such contents.
In this work, we present a Convolutional Neural Network (CNN) system to predict the sentiment of a given code-mixed tweet. The sentiment labels are either positive, negative, or neutral, and the languages involved are English and Spanish. Our best model utilizes only Spanish word embeddings from tweets (Deriu et al., 2017) and does not require manual feature engineering.

Related work
Sentiment analysis is a widely studied task in monolingual, multilingual and cross-lingual settings. For instance, monolingual opinion mining in a multilingual context (Boiy and Moens, 2009), multilingual sentiment analysis (Balahur and Turchi, 2014), and cross-lingual polarity detection (Demirtas and Pechenizkiy, 2013). Language-independent approaches to sentiment analysis include the use of emoticons (Davies and Ghahramani, 2011) or emoticons and noisy labels (Narr et al., 2012). Sentiment analysis has not been extensively studied on code-switched content. A possible reason is the paucity of large annotated This work is licensed under a Creative Commons Attribution 4.0 International License. License details : http:// creativecommons.org/licenses/by/4.0/. data covering several language pairs or combinations. The Sentimix shared task is an effort to address this problem on the Spanish-English and Hindi-English language pairs for sentiment analysis.
Analysing opinions in tweets that blend Spanish and English is a difficult task (Vilares et al., 2017). Vilares et al. (2017) obtained an accuracy of 59.34% using lexical, syntactic, and N-gram features. They concluded that the task is challenging because of the presence of noise, difficulties with language identification and POS tagging, and the lack of annotated code-mixed lexicons and a large dataset.
Another line of work related to code-switching is the development of contextual and static multilingual or cross-lingual text representations which cover multiple languages in the same vector space. Examples include LASER (Artetxe and Schwenk, 2019), MUSE , and multilingual BERT 1 . These representations can be used to encode inputs for deep learning models. The effectiveness of these text representation approaches on code-switched texts remains an open question.

Methodology
We describe the Spanglish dataset and our submitted models.

The Sentimix Spanglish dataset
The Sentimix Spanglish dataset (Patwa et al., 2020) consists of a list of tweets blending English and Spanish text. The dataset is divided into train, development, and test sets holding 12002, 2998, and 3789 samples for each set respectively. This dataset provides detailed annotations per word, which besides the English and Spanish tags includes named entities, in-word mixes, ambiguous and foreign words. A major part of the dataset is Spanish, being the mode language for 65.1% of all tweets, whereas English is only 20.5%. We noticed that language statistics are proportionally distributed across the three partitions of the dataset. With regards to sentiment labels, Table 1

Text normalization as pre-processing
We normalize the input text using Ekphrasis 2 to address the noise in social media text. Specifically, the normalization process consists of : -Mapping URL, email, percent, money, phone, user, time, date, and numbers to a unique descriptive token. -Labeling stylistic patterns such as uppercase, elongated, repeated, emphasized, and censored words.
Examples of these can be seen in Table 3. -Word transformation suitable for social media content, such as word segmenter, spellchecker, and tokenizer. We apply the normalization to the entire dataset and noticed it had a marginal influence over the Spanish text. We did not apply normalization to the Spanish part of the text.

Neural Network architectures
CNN model. We used a standard architecture (Kim, 2014) with standard values as hyperparameters, consisting of a single convolution layer with multiple filter sizes of 2, 3 and 4, each of them with 100 filters, followed by max-pooling and a dropout layer, to finally stack a fully connected layer which outputs the results. We used ReLU as activation function, Adam as optimizer and cross-entropy loss function as the optimization objective.The hyperparameters are : vocabulary size of 15000, batch size of 64, and dropout probability of 0.5. We predict using the best epoch result obtained out of 5 epochs.
Our embedding vectors were initialized using only 200 dimension Spanish word embeddings 3 which were trained on a collection of tweets (Deriu et al., 2017). We did not use English embeddings because in previous experiments we noticed it does not contribute enough to the performance gain. This is probably due to the fact that Spanish is the majority language in the dataset, as we mentioned before.
GRU model. This submission used a bidirectional Gated Recurrent Unit (GRU) 4 over English-Spanish aligned word embeddings of dimension 300 . The hidden state of the GRU is of dimension 512. We used dropout with probability of 0.1 on the outputs of the embedding and GRU layers. Also, layer normalization is applied on the hidden representation generated by the GRU layer. A fixed representation for a given tweet is derived by taking the average of the backward and forward hidden representation of the GRU layer. This serves as input to a dense layer with softmax activation function which output a probability distribution over the three labels. We used AdamW as optimizer to minimize the negative log-likelihood loss function for 10 epochs. We use a batch size of 256 and learning rate of 0.001.

Results and discussion
We obtained a precision score of 0.807 and a recall score of 0.647 on the competition, thus a 0.71 F1 score was calculated for the competition rank 5 . Moreover, according to the competition guidelines, we also report class-wise F1 scores for test set, in which we achieved a lower score 6 . Under this metric, our aforementioned architecture achieved results shown at

Error analysis
Here an additional analysis is done to examine the results of our best model (the CNN model) via removing the text normalization stage and running cross-validation. Moreover, we perform error analysis to find common situations in which our system fails to make the correct prediction. These analysis were done using development set labels.
We found that removing the normalization step reduces the model performance in some points. Specifically, the macro-F1 score without the presence of Ekphrasis normalization is 0.42, hence a performance reduction of 3%.
We present five categories that highlight some of the errors that our best model made. The analysis was made over a stratified sample of 300 examples on the development set. Due to space limitations, we show a few examples of these categories along with the gold label (L) and our prediction (P) in Table 3 7 . We consider the following categories : 3. https://www.spinningbytes.com/resources/wordembeddings/ 4. We chose GRU over LSTM because the former has less parameters to learn. 5. https://competitions.codalab.org/competitions/20789#learn_the_details-result (Codalab user name (our team) : ajason08) 6. The Codalab scorer is using a less strict metric, probably a weighted scheme. 7. This link shows the visual version of emojis characters : https://arxiv.org/abs/2009.03397 Tweet example L P Category <user> a claro este emoji :Unamused: es de pura alegria i forgot :Face with Tears of Joy: T N Difficult <user> espero con todas mis fuerzas , que fucking taylor se lleve el best album puneta P N Difficult estuve todo el dia muriendome de sueno . <time> y yo estoy active . <hashtag> mi dios esta pasao </hashtag> N T Difficult viendo walking dead . <repeated> .
T N Tendency felizmente sacandome las cejas y papi entra a mi cuarto con el speech , solo pense en que si se me jodia una se iba a fucking formal <elongated> P N Tendency <user> las pastillas antiinflamatorias me estan jodiendo el estomago so tuve que parar de tomarmelas :Slightly Smiling Face: :Slightly Smiling Face: :Slightly Smiling Face: P N Tendency como no amarte <allcaps> viernes </allcaps> <hashtag> tgif </hashtag> outfit del dia blusa <number> black pant <number> envios <phone> <hashtag> com enzo el fin de semana </hashtag> ! <url> N P Advertising ganate almuerzo en tu trabajo , sacate un <hashtag> selfie </hashtag> y postealo en las redes con <hashtag> almuerzaconla gatita </hashtag> <url> P T Advertising chequea este outfit fresco y basico para la playa , desde ya preparate para este feriado ! <hashtag> salhuaclothing </hashtag> <url> T P Advertising style which leads to out-of-vocabulary issues, and the high level of common sense or pragmatics understanding probably involved. Also, we noticed that often the effect of code-switching is sarcastic. -Negative tendency : the use of highly negative words (e.g. vulgar expressions) seems to bias the model towards predicting a negative sentiment, however, this only occurs in the English portion of the text. We notice it because some English words appears in the embedding vocabulary, probably from people doing Spanish-English code-switching. -Advertising : These are tweets promoting some product or service. Surprisingly, most of them have neutral or negative label. Conversely, our best model rates them mostly as conveying positive or neutral sentiment. -Ambiguous labels : we hypothesize that the model gets confused when it finds samples with a very similar narrative but different labels. For example, the tweet "mami esta llorando con la voz kids :Face with Tears of Joy:" 8 is referring to a hilarious situation, and it receives a "neutral" label, however a second tweet "llego a casa a encontrarme a mami llorando viendo inside out" 9 although is referring to a very similar situation, it receives a "positive" label, we noticed that for both tweets the source of the sentiment comes from the relation between the adult and emotive show. We think these instances are a source of confusion for the model. -Doubtful labels : some tweet labels can be related to subjectivity in the annotation process. We consider that these examples are incorrectly labeled.

Conclusions
Code-switching is an interesting problem holding an important presence in social media, which combined with informal writing style increases the challenges for social media processing such as sentiment analysis. We experimented with the Sentimix Spanglish dataset using CNN model and only Spanish embeddings. We achieve a precsion, recall, and F1 score of 0.80, 0.64, and 0.71 respectively. Our analyses suggest that a deep learning model can be easily biased by the presence of cue words such as vulgar expressions for sentiment analysis. We found that this occurs mostly when the cue word is in English. This observation requires a deeper analysis. We also highlight the need to address complex language usage such as informality and sarcasm.
Furthermore, we also pointed out that subjectivity in the annotation of sentiment labels is a problem that deserves to be addressed. We plan to test contextual multilingual embeddings (e.g. Bert) and leverage the language tags and other non-linguistic constructs such as hashtags and emojis.