SemEval-2020 Task 9: Overview of Sentiment Analysis of Code-Mixed Tweets

In this paper, we present the results of the SemEval-2020 Task 9 on Sentiment Analysis of Code-Mixed Tweets (SentiMix 2020). We also release and describe our Hinglish (Hindi-English)and Spanglish (Spanish-English) corpora annotated with word-level language identification and sentence-level sentiment labels. These corpora are comprised of 20K and 19K examples, respectively. The sentiment labels are - Positive, Negative, and Neutral. SentiMix attracted 89 submissions in total including 61 teams that participated in the Hinglish contest and 28 submitted systems to the Spanglish competition. The best performance achieved was 75.0% F1 score for Hinglish and 80.6% F1 for Spanglish. We observe that BERT-like models and ensemble methods are the most common and successful approaches among the participants.


Introduction
The evolution of social media texts such as blogs, micro-blogs (e.g., Twitter), and chats (e.g., WhatsApp and Facebook messages) has created many new opportunities for information access and language technologies. However, it has also posed many new challenges making it one of the current prime research areas in Natural Language Processing (NLP).
Current language technologies primarily focus on English (Young, 2020), yet social media platforms demand methods that can also process other languages as they are inherently multilingual environments. 2 Besides, multilingual communities around the world regularly express their thoughts in social media employing and alternating different languages in the same utterance. This mixing of languages, also known as code-mixing or code-switching, 3 is a norm in multilingual societies and is one of the many NLP challenges that social media has facilitated.

Code-Mixing Challenges
In addition to the writing aspects in social media, such as flexible grammar, permissive spelling, arbitrary punctuation, slang, and informal abbreviations (Baldwin et al., 2015;Eisenstein, 2013), code-mixing has introduced a diverse set of linguistic challenges. For instance, multilingual speakers tend to code-mix using a single alphabet regardless of whether the languages involved belong to different writing systems (i.e., language scripts). This behavior is known as transliteration, and code-mixers rely on the phonetic patterns of their writing (i.e., the actual sound) to convey their thoughts in the foreign language (i.e., the language adapted to a new script) (Sitaram et al., 2019). Another common pattern in code-mixing is the alternation of languages at the word level. This behavior often happens by inflecting words from one language with the rules of another language (Solorio and Liu, 2008). For instance, in the second example below, the word pushes is the result of conjugating the English verb push according to Spanish grammar rules for the present tense in third person (in this case, the inflection -es). The Hinglish example shows that phonetic Latin script typing is a popular practice in India, instead of using Devanagari script to write Hindi words. We capture both transliteration and word-level code-mixing inflections in the Hinglish and Spanglish corpora of this competition, respectively.
Aye HI aur HI enjoy EN kare HI Eng. Trans.: come and enjoy No SP me SP pushes EN please EN Eng. Trans.: Don't push me, please Considering the previous challenges, code-mixing demands new research methods where the focus goes beyond simply combining monolingual resources to address this linguistic phenomenon. Codemixing poses difficulties in a variety of language pairs and on multiple tasks along the NLP stack, such as word-level language identification, part-of-speech tagging, dependency parsing, machine translation, and semantic processing (Sitaram et al., 2019). Conventional NLP systems heavily rely on monolingual resources to address code-mixed text, limiting them when properly handling issues such as phonetic typing and word-level code-mixing.

Code-Mixing as a Global Linguistic Phenomenon
Naturally, code-mixing is more common in geographical regions with a high percentage of bi-or multilingual speakers, such as in Texas and California in the US, Hong Kong and Macao in China, many European and African countries, and the countries in South-East Asia. Multilingualism and code-mixing are also widespread in India, which has more than 400 languages (Eberhard et al., 2020) with about 30 languages having more than 1 million speakers. Language diversity and dialect changes trigger Indians to frequently change and mix languages, particularly in speech and social media contexts. As of 2020, Hindi and Spanish have over 630 million and over 530 million speakers (Eberhard et al., 2020), respectively, ranking them in 3rd and 4th place based on the number of speakers worldwide, which speaks of the relevancy of using these languages in our code-mixing competition.

SentiMix Overview
This paper provides an overview of the SemEval-2020 Task 9 competition on sentiment analysis of codemixed social media text (SentiMix). Specifically, we provide code-mixed text annotated with word-level language identification and sentence-level sentiment labels (negative, neutral, and positive). We release our Hinglish (Hindi-English) and Spanglish (Spanish-English) corpora, which are comprised of 20K and 19K tweets, respectively. We describe general statistics of the corpora as well as the baseline for the competition.
We received 61 final submissions for Hinglish and 28 for Spanglish, adding to a total number of 89 submissions. We received 33 system description papers. We provide an overview of the participants' results and describe their methods at a high level. Notably, the majority of these methods employed BERT-like and ensemble models to reach competitive results, with the best performers reaching 75.0% and 80.6% F1 scores for Hinglish and Spanglish on held-out test data, respectively. We hope that this shared task will continue to catch the NLP community's attention on the linguistic code-mixing phenomenon.

Related Work
Linguists (Verma, 1976;Bokamba, 1988;Singh, 1985) studied the phenomena of code-mixing and intrasentential code-switching and found that processing code-mixed language is much more complicated than monolingual text. Code-mixing is often found on social media which contains a lot of nonstandard spellings of words and unnecessary capitalization (Das and Gambäck, 2014), making the task more difficult. Naturally, the difficulty will increase as the amount of code-mixing increases. To quantify the level of code-switching between languages in a sentence, Gambäck and Das (2016) introduced a measure called Code Mixing Index (CMI) which considers the number of tokens of each language in a sentence and the number of tokens where the language switches.
Finding the sentiment from code-mixed text has been attempted by some researchers. Mohammad et al. (2013) used SVM-based classifiers to detect sentiment in tweets and text messages using semantic information. Bojanowski et al. (2017) proposed a skip-gram based word representation model that classifies the sentiment of tweets and provides an extensive vocabulary list for language. Giatsoglou et al. (2017) trained lexicon-based document vectors, word embedding, and hybrid systems with the polarity of words to classify the sentiment of a tweet. Sharma et al. (2016) attempted shallow parsing of code-mixed data obtained from online social media, and  tried word-level identification of code-mixed data to classify the sentiment. Some researchers also tried normalizing the text with lexicon lookup for sentiment analysis of code-mixed data (Sharma et al., 2015).
To advance research in code-mixed language processing, few workshops have also been conducted. Four successful series of Mixed Script Information Retrieval have been organized at the Forum for Information Retrieval Evaluation (FIRE) (SahaRoy et al., 2013;Sequiera et al., 2015;Banerjee et al., 2016). Three workshops on Computational Approaches to Linguistic Code-Switching (CALCS) have been conducted which included shared tasks on language identification and Named Entity Recognition (NER) in code-mixed data (Solorio et al., 2014a;Molina et al., 2016;Aguilar et al., 2018). For our SentiMix Spanglish dataset, we adopt the SentiStrength (Vilares et al., 2015) annotation mechanism and conduct the annotation process over the unified corpus from the three CALCS workshops.

Task Description
Although code-mixing has received some attention recently, properly annotated data is still scarce. We run a shared task to perform sentiment analysis of code-mixed tweets crawled from social media. Each tweet is classified into one of the three polarity classes -Positive, Negative, Neutral. Each tweet also has word-level language marking. We release two datasets -Spanglish and Hinglish.
We used CodaLab 4,5 to release the datasets and evaluate submissions. Initially, the participants had access only to train and validation data. They could check their system's performance on the validation set on a public leaderboard. Later, a previously unseen test set was released, and the performance on the test set was used to rank the participants. Only the first three submissions on the test set by each participant were considered, to avoid over-fitting on the test set. The ranking was done based on the best out of the three submissions. There was no distinction between constrained and unconstrained systems, but the participants were asked to report what additional resources they have used for each submitted run.
We release 20k labeled tweets for Hinglish and ≈ 19k labeled tweets for Spanglish. In both the datasets, 6 in addition to the tweet level sentiment label, each tweet also has a word-level language label. The detailed distribution is provided in Table 1. Some annotated examples are provided in Table 2. Although this task focuses on sentiment analysis, the data has word-level language marking and can be used for other NLP tasks.

Evaluation Metric
To evaluate the performance and rank the participants, we use weighted F1 score on the test data, across the positives, negatives, and neutral examples. The F1 scores are calculated for each class and then their average is weighted by support (number of true instances for each class). We use a weighted F1 score since the number of instances per class is not equal. Other than the F1 score, we also calculate precision and recall for each class to analyze and have a better understanding of false positives and false negatives.

Dataset
The datasets consist of tweets labeled into one of the three classes: • Positive (Pos): Tweets which express happiness, praise a person, group, country or a product, or applaud something. Hinglish example: "bholy bhayaa. Ufffff dil jeet liya ap ne. Love you imran bhai. Mind blowing ap ki acting hai." (bholy bhayaa, you won hearts. love you imran bhai your acting is mind blowing). Spanglish example: "We all here waiting pa ke juege mex :)" (We all here waiting for Mexico to play :)).
• Negative (Neg): Tweets which attack a person, group, product or country, express disgust or unhappiness towards something, or criticize something. Hinglish example: "You efficiency of anchoring a program is continuously deteriorating. Ab to dekhne ki himmat hi nahi" (Your efficiency of anchoring is continuously deteriorating. Now can't even dare to watch it) Spanglish example: "Eres una cualkiera yes u are." (You are a tramp, yes you are.) • Neutral (Neu): Tweets which state facts, give news or are advertisements. In general those which don't fall into the above 2 categories. Hinglish example: "Nahi wo is news ko defend kerne ki koshesh ker rhe hain h" (No, they are trying to defend this news). Spanglish example: "My phone looks ratchet todo crack" (My phone looks ratchet all crack). Both the Hinglish and Spanglish datasets are released using the previous sentiment label scheme. However, each dataset has been annotated separately as the studies were independent before the organization of this competition. We provide the data collection and annotation details in the following subsections.

Hinglish
Data Collection: First, we make a list of all the Hindi tokens from the dataset provided by (Patra et al., 2018). From that list, we remove those tokens which are common to Hindi and English (example 'the' can be used in both the languages). Then we use Twitter API 7 to crawl those tweets from twitter which have at least one word from the list. The list has 10786 tokens. Some words from the list are: kuch, tu, gaya, raha, aaj, apne, tum, gaye, sath etc.
Language and Sentiment Annotation: For word-level language marking we use an automated tool released by Bhat et al. (2014). The tokens are labeled into HIN -Hindi, ENG -English, or O -other. For tweet level sentiment labels, we took the help of around 60 annotators who were bilingual/multilingual, proficient in Hindi and had Hindi as their first or second language. Each tweet was shown to two annotators, and it was selected if their annotations matched, else it was discarded. They used a simple website designed for this purpose to annotate the data. Each tweet was shown on a page that had a radio button for each label. The annotators first had to enter their unique id, then they could either select a sentiment option for a tweet and send or choose to skip the tweet.
Statistics: Table 1 gives detailed class-wise distribution of the tweets. Although Neutral is the majority class for Hinglish, the dataset is not too imbalanced. The class-wise distribution is similar for all three splits. Table 2 shows some examples of tweets marked with language and sentiment tags. The average CMI for Hinglish train, validation, and test set is 25.32, 25.53, and 25.13 respectively. The inter-annotator agreement is 55%.

Spanglish
Data Collection: We use the Spanish-English data from the CALCS workshops (Solorio et al., 2014b;Molina et al., 2016;Aguilar et al., 2018). In the first workshop (Solorio et al., 2014b), the data was collected by crawling tweets from specific locations with a strong presence of Spanish and English speakers (e.g., California and Texas). The collection process was conducted using common words from each language through the Twitter API. 7 In the second workshop (Molina et al., 2016), the organizers provided a new test set collected with a more elaborated process. They selected big cities where bilingual speakers are common (e.g., New York and Miami). Then, they localized Spanish radio stations that showed code-mixed tweets. Such radio stations led to users that also practice code-mixing. Similar to the third workshop (Aguilar et al., 2018), we take the CALCS data and extend it for sentiment analysis. It is worth noting that a large number of tweets in the corpora only contain monolingual text (i.e., no code-mixing). Considering that, and after merging the two corpora, we prioritize the tweets that show code-mixed text to build the SentiMix corpus. We ended up incorporating 280 monolingual tweets per language (English, Spanish) in the test set.
Annotation: Since we use the data from the previous CALCS workshops, we did not need to undergo the token-level annotation process for language identification (LID). We adopted the CALCS LID label scheme, which is comprised of the following eight classes: lang1 (English), lang2 (Spanish), mixed (partially in both languages), ambiguous (either one or the other language), fw (a language different than lang1 and lang2), ne (named entities), other, and unk (unrecognizable words). For the annotations of the sentiment labels, we follow the SentiStrength 8 strategy (Thelwall et al., 2010;Vilares et al., 2015). That is, we provide positive and negative sliders to the annotators. Each slider denotes the strength for the corresponding sentiment, and the annotators can choose the level of the sentiment they perceived from the text (see Figure 1). The range of the sliders is discrete and included strengths from 1 to 5 with 1 being no strength (i.e., no positive or negative sentiment) and 5 the strongest level. Using two independent sliders allowed the annotators to process the positive and negative signals without excluding one from the other, letting them provide mixed sentiments for the given text (Berrios et al., 2015). Once the sentiment strengths were specified, we converted them into a 3-way sentiment scale (i.e., positive, negative, and neutral). We simply subtract the negative strength from the positive strength, and mark the text as positive if the result was greater than zero, negative if less than zero, or neutral otherwise.
We annotate each tweet with the help of three annotators from Amazon Mechanical Turk. 9 We regulate the annotations by using quality questions within every assignment 10 of a HIT (Human Intelligence Task). Every assignment has ten tweets, two of them were for quality control (i.e., the annotation was already known) and the other eight tweets were the ones to annotate. 11 The annotators had to have at least one quality control tweet right so that the assignment (i.e., the ten tweets) was not automatically rejected. Since the sentiment analysis task is arguably arbitrary, we provided multiple valid levels of strength for the quality control tweets. If an assignment was rejected, then another annotator was automatically required to complete the HIT until three annotations were accepted. Also, we automatically approved HITs if their 3-way sentiment inter-annotator agreement was over 66%. 12 Otherwise, we evaluated manually the annotations and decide whether to extend the assignments or mark the sentiment labels ourselves for the trivial cases. After merging the annotations, we gave a pass over the data and manually corrected annotations that were unambiguously wrong.
Statistics: The Spanglish class-level distribution of the partitions appear in Table 1. Notably, the data is highly imbalanced towards the positive class covering about 56% in the entire Spanglish corpus, while the negative and neutral classes account for around 16% and 27%, respectively. The reason for this imbalance distribution is that we did not collect the data following a sentiment-oriented crawling strategy (e.g., searching by sentiment-related keywords). Instead, we just extended the original corpus, which happens to be mostly positive. The intention to proceed in this way is to enrich the original corpus annotations with sentiment-level labels. Moreover, the splits do not share the same distribution (i.e., development and test are more skewed than training) because we were annotating data on-demand rather than having available the entire corpus at any stage of the competition. Some annotated examples are provided in Table 2. The average CMI for the train, validation, and test sets are 21.84, 20.52, and 17.23, respectively.

Baseline
We develop our baseline system using the pre-trained multilingual BERT (M-BERT; Devlin et al. (2019)). M-BERT was trained on 104 languages' entire Wikipedia dump and the WordPiece (Wu et al., 2016) vocabulary of this model contains 110K sub-word tokens from these 104 languages. To balance the risk of low-resource languages being under-represented or over-fitted due to small training resources during pretraining, exponentially smoothed weighting was performed on the data during pre-training data creation and vocabulary creation. Although M-BERT was trained on monolingual data from different languages, it is capable of multilingual generalization in code-switching scenarios (Pires et al., 2019).
We use the Transformers (Wolf et al., 2019) library to implement our framework and we fine-tune the pre-trained BERT-Base, Multilingual Cased model separately for each of the two languages. Based on our observation on the training split for each dataset, we set the highest sequence length to 40 and 56 tokens for Spanglish and Hinglish, respectively. Then, we fine-tune the model for three epochs using AdamW (Loshchilov and Hutter, 2019) optimizer (η = 2e −5 ).  Table 2: Examples of labeled tweets. Code-mixing often refers to the juxtaposition of linguistic units from two or more languages in a single conversation or sometimes even a single utterance. These examples emphasize on the fact that people don't do only phrase, or tag-mixing as it was a belief in the linguistic forum until now.

Participation and Top Performing Systems
We received an overwhelming response for both Hinglish and Spanglish. 61 teams submitted their systems for Hinglish and 28 teams submitted their systems for Spanglish. 16 teams submitted to both Hinglish and Spanglish. We received 33 system description papers in total. The embeddings and techniques used by the participants are tabulated in Table 5. The team names, Codalab names, and their corresponding description papers are provided in Appendix (Table 6). We provide a summary of the top teams below (Codalab usernames are mentioned in parentheses) : Top Hinglish Systems @ SentiMix • KK2018 (kk2018) used pre-trained XLM-R  which was trained with 100 languages. They trained it with adversarial (intentionally designed to make model cause a mistake) examples. To create adversarial examples, they used the formula proposed by (Miyato et al., 2016) where the perturbation is created using the gradient of the loss function.
• MSR India (genius1237) used embeddings from XLM-R as inputs to a classification layer. They also do so with multiligual BERT.
• Reed (gopalvinay) Finetuned BERT and claimed that pre-training of BERT is not of much use. They also tried bag-of-words based feedforward networks.
• BAKSA (ayushk) used XLM-R  multilingual embeddings ( a transformerbased masked language model trained on 100 languages) followed by ensemble model of CNN and self attention architecture.

Top Spanglish Systems @ SentiMix
• XLP (LiangZhao) augmented the data using machine translation. Then they used pre-trained embeddings made by Facebook Research (XLMs) (Lample and Conneau, 2019) followed by CNN classifier of linear classifier (fully connected layer). They optimized a weighted loss function based on the complexity of code-mixing.
• Voice@SRIB (asking28) applied multiple pre-processing steps and used Ensemble model by combining CNN, self-attention and LSTM based model.
• Palomino-Ochoa (dpalominop) combined a transfer learning scheme based on ULMFit (Howard and Ruder, 2018) with the-state-of-the-art language model BERT.
• HPCC-YNU (kongjun) used word and character embeddings as input to BiLSTM with attention.  We report Precision (P), Recall (R), and F1 score for each class separately. In each column, the boldfaced scores are the highest score in that column.

Results and Analysis
In the previous section, we briefly described the top systems. Here, we group and summarize various techniques used by the systems (Codalab usernames are mentioned in parentheses) : • Word Embedding: Three popular word embedding ways explored by participants. Word2Vec, Glove, FastText. Some participants used character-embedding. Additional resources were also used by participants to train their own embeddings.
• Special Mentions: Apart from common practices and architectures quite a few participants explored interesting dimensions and added significant value to this endeavor. We strongly believe these dimensions need to be explored and discussed further.: XLP (LiangZhao) used Cross-lingual embeddings which could an interesting way for code-mixed language processing where we have scarcity of annotated data.  UPB (eduardgzaharia, clementincercel) used capsule network with biGRU and showed promising results. The use of capsule networks in NLP tasks need further exploration.
ULD@NUIG (koustava) explored an interesting way to phoneme based Generative Morphemes learning approach. Sub-word based embedding is an interesting new way in the NLP community, but what is the best sub-word unit to choose is still unresolved. Morpheme based approach could be a good alternative, especially for highly spelling variant code-mixed data.
IIT Gandhinagar (vivek IITGN) tried a new direction by generating sentences using language modeling. Language modeling for code-mixed data is still an under-researched problem.
HPCC-YNU (kongjun) used a Bilingual Vector Gating Mechanism. Vector gating technique got certain success in document classification kinds of applications, but its applications in other NLP dimension demands further exploration.
Will go (will go) used Bert and Pseudo labeling. Pseudo Labeling can be a useful strategy for code-mixed languages especially when annotated data is scarce. .
kk2018 (kk2018) reported unique ways to apply adversarial network and its usage in code-mixing. They got very good results.
LIMSI UPV (somban) gave a way to merge RNN and CNN architecture together for the betterment of sentiment analysis. This could be an interesting way to explore in the future.

Conclusion and Future Work
SentiMix, sentiment analysis of code-mixed tweets at SemEval 2020 received an overwhelming response for both Hinglish and Spanglish. 61 teams submitted their systems for Hinglish and 28 teams submitted their systems for Spanglish. The best performance achieved was 75.0 % F1 score for Hinglish and 80.6% for Spanglish. We received a total of 33 system description papers. BERT-like models were the most successful among participants. Although the SentiMix task mainly focused on sentiment analysis, the data will serve the NLP community or whoever is interested in the code-mixing problem for these particular languages and in general. Properly annotated code-mixed data is still scarce. The success of SentiMix motivates us to go further and organize similar events in the future. We plan to add more languages, especially from regions that have a high percentage of bi-or multilingual speakers. We also plan to enrich our datasets with annotations for other tasks (NER, emotion recognition, translation etc). We strongly believe that codemixing is a new horizon of interest in the NLP community and needs to be further explored in the future.