MSR India at SemEval-2020 Task 9: Multilingual Models Can Do Code-Mixing Too

In this paper, we present our system for the SemEval 2020 task on code-mixed sentiment analysis. Our system makes use of large transformer based multilingual embeddings like mBERT. Recent work has shown that these models posses the ability to solve code-mixed tasks in addition to their originally demonstrated cross-lingual abilities. We evaluate the stock versions of these models for the sentiment analysis task and also show that their performance can be improved by using unlabelled code-mixed data. Our submission (username Genius1237) achieved the second rank on the English-Hindi subtask with an F1 score of 0.726.


Introduction
The task of identifying sentiment from text is extremely important in this age where large volumes of text content are being consumed via social media. The task becomes even more interesting when it comes to bilingual communities as these communities exhibit the phenomenon of code-mixing online (Rijhwani et al., 2017).
Existing approaches to tackling this problem have mainly been based on statistical methods (Vilares et al., 2016;Patra et al., 2018). These methods have used features like n-gram counts and TF-IDF vectors along with a linear classifier. There have been very few approaches to this problem using deep learning as the amount of labelled code-mixed data available has always been very less. Methods like the one in Pratapa et al. (2018b) train word embeddings using unlabelled code-mixed data, the availability of which is not as problematic as labelled data, and use these embeddings along with a recurrent neural network based model.
Recent advancements in natural language processing have shown that large transformer based models like BERT (Devlin et al., 2019), when pre-trained on large corpora, are easily adaptable for downstream tasks with small datasets. These models even perform well in a cross-lingual manner (Conneau et al., 2018) when pre-trained on corpora spanning multiple languages. Our experiments show that these multilingual models perform well even on code-mixing tasks, having had no exposure to any code-mixing during pre-training. We use such a system to solve the code-mixed sentiment analysis problem. We also show that it's performance can be improved by using a combination of generated and real code-mixed text.
The rest of the paper is organized as follows. Section 2 tasks about the dataset for the task and the pre-processing done to it. Section 3 talks about the different systems we evaluated, with Section 3.3 in particular going into how we improved the multilingual models using code-mixed data. Section 4 describes the performance of the different models and Section 5 concludes our discussion.

Language Train Dev Test
En-Es 12002 2998 3789 En-Hi 14000 3000 3000   We make use of the language identification tool by Gella et al. (2014) to identify the Romanized Hindi sections and transliterate them to Devanagari using the Bing Translator API 1 . The language tags provided along with the data is not used. No other pre-processing is done to the data.

System Description
Figure 1 describes the model used for sentiment analysis. The model is a classification model that comprises of a pretrained transformer-based multilingual embedding (like BERT) and a linear layer acting as a classification head. The embedding takes in a tokenized sentence and outputs a single embedding for that sentence. This embedding is then run through the linear layer that outputs scores for each of the 3 classes. The entire system was implemented using the Huggingface Transformers library (Wolf et al., 2019). We experimented with different models for the embedding. We also experimented with different pooling techniques that are used to obtain the sentence embedding and these are detailed below. Finally, as a baseline, we report the results from the method in Pratapa et al. (2018b), using Word2vec embeddings trained on code-mixed data along with a BiLSTM.

Multilingual Embeddings
Multilingual BERT (mBERT) (Devlin et al., 2019) is a transformer based model that is pre-trained on a corpora comprising 104 languages. This performs well on cross-lingual tasks like XNLI and this was taken as our baseline model. A more recent model is XLM-Roberta (XLM-R) (Conneau et al., 2019) and this has been shown to outperform BERT on many cross-lingual tasks. This differs from BERT in the type of tokenization it uses and the amount of data it is pre-trained on. Table 2 contains a list of differences between the two models. We use the bert-base-multilingual-cased model for BERT and the xlm-roberta-base model for XLM-R.

Sentence Embedding Technique
The aforementioned multilingual models output one embedding per input token. These need to be pooled together to obtain a sentence embedding to use for the sequence classification task. There have been multiple works proposing different methods to obtain a sentence embedding from BERT (Reimers and Gurevych, 2019;Wang and Kuo, 2020). The two most popular (and simplest) methods are performing average pooling over the embeddings of every token or using the embedding of the first token ([CLS] token in case of BERT, <s> in case of XLM-R). We evaluate both these methods and report the performance of both.

Finetuning Multilingual Embeddings on Code-Mixed Data
There have been multiple works proposing techniques to create domain specific versions of models like BERT (Sun et al., 2019;Alsentzer et al., 2019). Khanuja et al. (2020) showed that when models like mBERT are finetuned on synthetic and non-synthetic code-mixed data, they perform much better on downstream code-mixed tasks. Along these lines, we finetune both mBERT and XLM-R with code-mixed data on the masked language modeling task. We follow a 2 stage curriculum, first finetuning on a large corpus of 2 million generated (synthetic) code-mixed sentences and then with a smaller corpus of 90,000 real (non-synthetic) code-mixed sentences. The curriculum followed and synthetic sentences generated are based on the technique in Pratapa et al. (2018a). We create one model each for En-Es and En-Hi, finetuned on code-mixed data from that pair. We call these Modified mBERT and Modified XLM-R.

Results and Analysis
The results are presented in Tables 3 and 4. Each table contains F1 scores averaged over 5 different seeds. For all the runs, a batch size of 64 was used along with the Adam optimizer with a learning rate of 5e-5. Each batch was made to have equal number of samples from all 3 classes. Training was performed for 10 epochs. Right away, we are able to observe that the stock versions of mBERT and XLM-R, which are not exposed to any form of code-mixing during their pre-training show impressive F1 scores. This is talked about more in Section 4.2. We present an analysis of the sentence embedding techniques first.

Sentence Embedding Methods
Both the sentence embedding methods experimented with are shown as separate columns in Tables 3  and 4. Using average pooling does bring in improvements in some cases, mainly on the Dev sets, but the corresponding Test set numbers are not better. The embedding of the first token ([CLS]/<s>) in the final layer is computed as a weighted sum over the embeddings of the all the tokens of the n − 1 st layer. Given such a mechanism, the embedding of the first token may be able to capture enough information over all the tokens of the sentence and is able to perform as well as the average pooling method for a simple sequence classification task. Our results are in line with the results in Wang and Kuo (2020), where most simple downstream tasks do not see big differences in performances of the 2 embedding methods, with only more complex sentence similarity or probing tasks showing the average pooling method to perform better.

Finetuning on Code-Mixed Data
Both mBERT and XLM-R performing well on these tasks is pretty impressive. Finetuning 2 these models with code-mixed data improves the performance of the stock models. We observe an improvement in almost all the cases, ranging from 1-5%. Our results resonate with the ones in Khanuja et al. (2020), suggesting that most code-mixed tasks can be solved by simply using multilingual embeddings like mBERT, finetuning them on any available code-mixed data if better performance is needed.

Class-Wise Performance Analysis
We take the best performing model (on the test set this is Stock XLM-R) for both tasks and analyse the class-wise precision, recall and F1-scores. These are depicted in Tables 5 and 6. Given that training was with data balanced across the 3 classes, similar performance across them is expected. This is observed in the En-Hi task, with all 3 classes having precision and recall within a small range. Similar numbers are observed between the dev and test sets too. However when it comes to the En-Es test set, there is a big gap between the classes. The precision values for the neutral class is extremely low and this is affecting the overall F1 scores. Interestingly, this gap in scores isn't present on the dev set, suggesting that there is some aspect of the test set that the model is unable to learn from the train set during the training process.

Conclusion
In this paper, we present our system for the SemEval 2020 task on code-mixed sentiment analysis. We make use of multilingual models like mBERT and show that they work well for code-mixing tasks. The best performance is extracted from these models by finetuning them on code-mixed data and using this version instead of their stock versions. We also find that for simple sequence classification tasks, the choice of sentence embedding technique does not have a significant impact on the result. There are multiple paths for further exploration of this work. While finetuning mBERT on code-mixed data, we've created one model per language and used a relatively small amount of data (compared to the amount of data BERT is pretrained on). Both these could be looked into, creating a single model for  Table 6: En-Es Task: Class-wise performance with Stock XLM-R multiple language pairs, and using much more data for this purpose. In this process, one may be able to obtain a universal model that works for a large number of code-mixed pairs in addition to the large number of languages that mBERT already supports.