XLP at SemEval-2020 Task 9: Cross-lingual Models with Focal Loss for Sentiment Analysis of Code-Mixing Language

In this paper, we present an approach for sentiment analysis in code-mixed language on twitter defined in SemEval-2020 Task 9. Our team (referred as LiangZhao) employ different multilingual models with weighted loss focused on complexity of code-mixing in sentence, in which the best model achieved f1-score of 0.806 and ranked 1st of subtask- Sentimix Spanglish. The performance of method is analyzed and each component of our architecture is demonstrated.


Introduction
Sentiment analysis is in the area of research that perform the automatic comprehension of the subjective information from user-generated data, which helps to gain the views on certain topics. Due to the rise of social media such as micro-blogs (e.g., Twitter) and the trend of global communications, they have accelerated the use of multilingual expressions, raising the concerns on code-mixing behavior (Patwa et al., 2020). To develop cross-lingual encoders that can encode any sentence into a shared embedding space, by using monolingual transfer learning, multilingual extensions of pretrained (Lample et al., 2019) encoders have been shown effective.
As for code-mixed text, more complicated than cross-lingual sentence, it is crucial to consider the complexity of texts written in several different languages because different types of integration correlate with different social contexts (Gualberto A. et al., 2016). Sometimes, the user may post blogs in non-native language with grammar mistakes or even prefer to express the sentiment in the native language. The phenomena has encouraged the researchers to analyze the sentiment from multilingual code-mixed texts. Because Spanish and English share a lot of words with Latin roots, sometimes words with the same origin take a separate path in each language, or words with different origins resemble each other by coincidence, but have different meanings. For example,éxito from Spanish means success, which resembles exit from English, with different meaning and sentiment. In the task, the number of words in a sentence vary from different languages dramatically. Intuitively, the language that has a bigger presence in the tweet would contain the sentiment of the sentence. To tackle the problem, we adopt the focal loss through calculating the ratio of each language in code-mixing text.
The rest of the paper is structured as follows: Section 2 provides the detailed implementation method. Section 3 presents the results and performance of our models as well as experiment settings. Concluded remarks and future directions of our work are summarized in Section 4.
2 Implementation details 2.1 Preprocessing Normally, deep learning models have a simple data processing pipeline, while in the task data is very messy. Therefore we have used a more detailed method according to characteristic of the code-mixing data (URLs, emoji, hash symbol etc.) First, user name mentioned and URL are all removed because they are useless for sentiment prediction. Special characters like "RT" representing re-tweet is also deleted. Moreover, we also remove the hash This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/. Figure 1: Illustration of our model symbol from hash-tags as it can be problematic for tokenizers to work with. As for non-text symbol like emoji and emoticon, we use the (emoji, 2019) library from python and emoticon dictionary from wiki (List of emoticons, 2020) respectively to transform the symbols to text. Next all characters into lowercase and stop words are removed. Afterwords, we employ fastBPE to generate and apply BPE codes to get post-BPE vocabulary using vocabulary of XLM model for 100 languages including Hindi, Spanish, and English. Sentence size is limited to 256. This is enough for nearly all of the tweets after processing.

Data augmentation
In order to get more training data and based on the statistics of dataset, we have utilized machine translation (Sennrich et al., 2016) for generating more text to boost up the performance. After the original code-mixed text is translated to the target language Spanish, both source sentences and translated sentences are mixed to train a model.

Tested architectures 2.3.1 Pre-trained Models for Feature Encoding
To extract valid representation features of tweet, two state-of-the-art pre-trained sentence embedding models are utilized. Details are deliberated in the following section.
• XLMs: We use pretrained embeddings made available by Facebook research (Lample et al., 2019), which is unsupervised that only relies on monolingual data, and support 100 languages including English and Spanish. After fine-tuning an XLM model on the training corpus, the model is still able to make accurate predictions at test time in code-mixed languages, for which there is not enough training data. This approach is usually referred to as "zero-shot cross-lingual classification". Based on the pretrained XLM model, the sentence is indexed by vocabulary and then independently fed into the pretrained transformer model, which is also optimized during training. The single column of last hidden layer of transformer model is used as the representation of sentence, fed into a projection layer using linear transformation. While for CNN model, all columns of last hidden layer are utilized as the sentence embedding.
• MUSE: MUSE are multilingual embeddings based on fastText (Conneau et al., 2017), available in different languages, where the words are mapped into the same vector space across languages. We use the average representations of all words in a sentence, which is modified during training as well.

Output layer
Two models are examined with MUSE and XLM respectively: CNN based and linear layer based. • Linear Classifier: The pretrained embeddings are just directly fed into a linear layer, also referred as fully connected layer and softmax afterwards to get the final predictions.

Optimized loss
As analysis above, non-native English speaker may misuse English due to the culture differences and lack of vocabulary, and so on. The monolingual corpus in Spanish will be more accurate in the expression of the sentiment than multilingual. On the other hand, the quality of monolingual sample may be decreased due to error from augmentation data from translation. In view of data analysis of training and test corpus, we also found that the percentage of each language e.g., English and Spanish is biased. According to the statistics, the percentage of Spanish words is twice more than English in training dataset, and almost three times in valid and test data. The test data also have 560 monolingual sentences, in which half are in English and the other are in Spanish. In this case, the model is prone to learn the unbalanced semantic information.
To benefit the gain from the samples and focus on the majority language model, we weighed the loss L W based on the complexity of code-mixing (Gamb ack et al., 2014).The formula is listed as followings, where β is the percentage of Spanish words in a sentence, CE is the initial cross entropy, γ > 0 and α is a constant positive scaling factor. To better explore the trend, weighted loss with different hyper-parameters is shown in shown in Figure 2.
The γ is a focusing parameter that control the loss. Larger values of γ correspond to large losses for low complexity of code-mixing sentences. When γ < 1, the model is prone to learn the multilingual data and on the contrary, if γ > 1, it's more likely to learn the monolingual data. When γ equals 1, the loss is just the cross entropy as default.

Experiments and Results evaluation 3.1 Dataset
Subtask in Spanish of the SemEval-2020 task 9 is to predict the sentiment of a given code-mixed tweet. The sentiment labels are positive, negative, or neutral, and the code-mixed languages will be English-Spanish. Besides the sentiment labels, also the language labels at the word level are provided. The word-level language tags are en (English), spa (Spanish), hi (Hindi), mixed, and univ (e.g., symbols, @ mentions, hashtags).  Table 1: Performance metrics of different models on validation and test sets. The average f1 scores of validation set are reported for ten runs using different random seeds to choose hyper-parameters, and the test scores are generated by using the trained model to predict on released labeled test data.

Experiment Setup and Results
Hyper-parameter optimization is performed using a simple grid search. All models are trained with 10 epochs with a batch size of 8 and an initial learning rate 0.000005 by Adam optimizer. The linear layers are dropped out with a probability of 0.5. Unless otherwise stated, default settings are used for other parameters. In the process of searching for optimal architecture and parameters, we experimented CNN and fully connected layer (marked as FC) respectively with MUSE and XLM.
To explore and compare the optimal parameters α and γ, as shown in Figure 3, there is an obvious increasing tendency of f1 score until α >1.5 when γ <= 1.0, and reaches the highest score as γ = 0.25 and second highest as γ = 1.0, which indicates that the model has found optimal parameters prone to high level of code-mixing data. Based on the results of validation set, to select best model, we expect that the best performance is always achieved in optimal parameters as above which are γ= 0.25 or γ= 1.0. The scores are summarized in Table 1. XLM model with a fully connected layer achieved best when γ= 0.25, and from its class-wise scores, we conclude that the model performs best in classification of positive samples, while worst in neutral samples. The result can be caused by unbalanced distribution of data and complexity of code-mixing, such as the expression of positive sentiment mainly focused in specific language. CNN based model has not shown significant increase in performance compared to linear classifier.

Conclusion
In this paper, we have introduced a novel approach with weighted loss of different multilingual models with weighted loss focused on complexity of code-mixing sentences for sentiment analysis task in SemEval-2020. The method is effective in situation where the distribution of different languages is unbalanced, and has a better control of language preference for sentiment by the level of how languages mix. Moreover, we conclude that the quality of word representations used has a significant impact on the performance of a model. Results indicate the potency of XLM on code-mixed lingual classification, leading to 4-5 % increase in f1 score compared to MUSE. In the future, we will continue to do model optimization and also try ensemble models.