Palomino-Ochoa at SemEval-2020 Task 9: Robust System Based on Transformer for Code-Mixed Sentiment Classification

We present a transfer learning system to perform a mixed Spanish-English sentiment classification task. Our proposal uses the state-of-the-art language model BERT and embed it within a ULMFiT transfer learning pipeline. This combination allows us to predict the polarity detection of code-mixed (English-Spanish) tweets. Thus, among 29 submitted systems, our approach (referred to as dplominop) is ranked 4th on the Sentimix Spanglish test set of SemEval 2020 Task 9. In fact, our system yields the weighted-F1 score value of 0.755 which can be easily reproduced — the source code and implementation details are made available.


Introduction
Sentiment Analysis is one of the most active research areas in Natural Language Processing (NLP), web mining and social media analytics (Tang et al., 2016). In particular, its polarity detection task has been extensively researched since 2002 (Liu, 2012). Polarity detection involves to determine whether a given text, containing an opinion, is positive, negative or neutral. In order to solve this task, several dictionary-based methods were proposed in the past (Liu, 2012). However, Machine Learning (ML) approaches have been the ones with more impact regarding the-state-of-the-art (Liu, 2015). Moreover, traditional ML approaches based on feature engineering and modelling have been consistently improved by Deep Learning methods (Zhang et al., 2018).
As a rule, to apply Deep Learning in sentiment analysis input text should be encoded in some way, for instance using word embeddings. Word embeddings are useful because allow us to encode semantic similarities among words. One can obtain a word vector representation by training a large corpus using algorithms such as Word2vec (Mikolov et al., 2013), Glove (Pennington et al., 2014) and FastText (Bojanowski et al., 2017), to name a few.
Nowadays, this kind of text representation has evolved to a language model encoding. The idea is to use language context in order to better encode words and characters. Overall, the aim of this encoding is to transfer the knowledge embodied in the language model to address a specific task.
Thus, in this paper we focus on transfer learning: we make use of a pre-trained language model and then we applied it in sentiment classification (The aim is to predict the correct sentiment classification of a given code-mixed tweet). In order to do so, our work combines two powerful frameworks: the Universal Language Model Fine-tuning (ULMFiT) (Howard and Ruder, 2018) and BERT (Devlin et al., 2018).
ULMFiT is the basis for our transfer learning strategy and it has proven to have an impressive performance on several English text classification tasks-It also has obtained optimal results for the Spanish language (Palomino and Ochoa-Luna, 2019). On the other hand, BERT is currently the-state-of-the-art language model for NLP. Hence, we use a reduced multilingual language model (BERT) which is further fine-tuned (ULMFiT) in order to classify code-mixed tweets. We evaluate our system on the Sentimix Spanglish test set of SemEval 2020 Task 9. Our submission (referred to as dpalominop) obtained the weighted-F1 value of 0.755 and was ranked 4th among 29 teams.

Background
The aim of the task 9 (Patwa et al., 2020) is to predict the correct sentiment classification of a given code-mixed tweet. Each sentiment label can be one of three options: positive, negative and neutral.
There are two code-mixed languages provided on this track: English-Hindi and English-Spanish. In this work we are only approached the code-mixed English-Spanish task.
Train, evaluation and test datasets are available and contain tweets in CoNLL-U format (Buchholz and Marsi, 2006). Thus, every tweet word is tagged accordingly: en (English), spa (Spanish), hi (Hindi), mixed and univ (e.g. symbols, @ mentions, hashtags, etc). The whole tweet is tagged with the corresponding sentiment label. During the competition only train and evaluation datasets were labeled.

System Overview
In this section, we present the system design choices that allow us to predict the sentiment of a given code-mixed tweet.
Overall, our strategy is as follows, due to a small dataset is provided we plan to use a pre-trained language model. By doing so, we aim at extracting features and context words from a large corpus in order to "transfer" this knowledge to our small dataset. Consequently, we only should perform a fine-tuning step regarding the task at hand.
Several language models have been proposed in the last years (McCann et al., 2017;Peters et al., 2018;Howard and Ruder, 2018), but the one with the highest impact in the-state-of-the-art has been BERT (Devlin et al., 2018). BERT is a powerful language model that uses a bidirectional representation from unlabeled data by jointly conditioning on both left and right context words. In order to train this model, Devlin et al. (2018) propose a novel training task called Masked Language Model (MLM). In this task, some input tokens are randomly masked in order to predict vocabulary identifiers using context, i.e., remaining words of the sentence. In addition, the work proposes the Next Sentence Prediction (NSP) training task that jointly pre-trains text-pairs representations. To do so, the whole text is parsed in several batches of two consecutive phrases. Then, the first one is used to predict the second one. With these two training tasks, BERT has outperformed previous state of the art results on several NLP challenges.
Transfer learning approaches based on language models for NLP tasks have been proposed in the past (Peters et al., 2018;Howard and Ruder, 2018;Devlin et al., 2018). Arguably, the simplest proposal for performing text classification using language models is the Universal Language Model Fine-tuning (ULMFiT) (Howard and Ruder, 2018). Such work was extended to Spanish sentiment analysis with remarkable results (Palomino and Ochoa-Luna, 2019).
Our proposal uses a base BERT language model and embed it within a ULMFiT transfer learning pipeline. The resulting system is depicted in Figure 1.

Classification Layer
ULMFiT fine-tuning Pre-training Task Layer Swapping last layer Figure 1: General system pipeline. W x is the input to the system and O x is the output before the pre-training task layer (MLM / NSP).
The components are described as follows: 1. BERT is the base language model (LM) which is trained on a general domain corpus to capture general features of the language through several layers. Regarding the original ULMFiT pipeline, the LSTM-based LM has been changed by a BERT LM.
2. The last layer used in BERT to perform NSP pre-training is changed by a single classification layer with the same number of outputs as labels provided on the SEMEVAL challenge (Patwa et al., 2020).
3. Fine-tuning is performed as in the original ULMFiT pipeline. The target task (sentiment analysis) is tuned through gradual unfreezing, discriminative fine-tuning (Discr), and slanted triangular learning rates (STLR) (Howard and Ruder, 2018). The aim is to preserve low-level representations and adapt to high-level ones. In our context, the sentiment analysis classifier is fine-tuned using the provided labeled English-Spanish tweets.
So far, we have presented the main pipeline of our system, but research on BERT models have produced several variants such as BASE, LARGE and MULTILINGUAL which rely on the number of parameters, layers and languages, accordingly. As discussed in the original paper (Devlin et al., 2018), the best choice to use in small and medium datasets is the BASE version because the lower number of parameters to fine-tune.
However, the code-mixed tweets aimed to classify are written in English and Spanish. In this sense, the best choice would be a multilingual version but, experimental results have shown that a multilingual LM performs worst than a single LM 1 .
Regarding those results, our design choice has been to use a multilingual model with few languages (including at least English and Spanish) which we referred to as reduced multilingual.
In the next section we describe the pre-processing data, system configuration as well as the choice of the language model used in this challenge.

Experimental Setup
A complete description about hardware and software requirements for reproducing this paper is detailed in this section. In addition, we show the hyper-parameters tuned during experimentation, e.g. the learning rate that allow us to converge without overfitting and regularization.

Technical Resources
All experiments were carried out on Jupyter notebooks running Python 3.7 kernel and PyTorch 1.3.1. For a detailed explanation about dependencies, please refer to the public project repository 2 .

Datasets
The data is splitted as follows. The training dataset: 12002 labeled tweets, the validation dataset: 2998 labeled tweets and the testing dataset: 3789 unlabeled tweets. Each sentiment label can be one of three options: positive, negative and neutral.

Pre-Processing
All the datasets were pre-processed according to the following rules: 1. Every tweet structure was converted to plain text and tagging each word.
2. Text was converted to lowercase and every accent mark was removed.
3. Repeated characters were replaced to single characters. 4. User references, hashtags and useless spaces were removed.

Pre-trained Language Model
In order to accomplish the constraints presented in section 3, the pre-trained language model used was bert-base-multilingual-uncased-sentiment 3 which was published in public official repository of Huggingface Co.

Fine-tuned Language Model
The main hyper-parameters used through the fine-tuning process are:

Results
Results for SemEval 2020 Task 9 Competition are reported in Table 1. Our submission (referred to as dpalominop) was ranked 4th among 29 systems (weighted-F1 score).

Team
Score 1  Several techniques were tested before finding our solution. Since we have used a transfer learning approach, one first challenge was to find out the language model that best fit our multilingual constraints. After a thorough analysis, we decided to use a multilingual language model (bert-base-multilingualuncased-sentiment). This model employs a small number of languages in its pre-training process. Furthermore, several pre-processing data and fine-tuning techniques were also tested.
We have also performed some ablation experiments regarding our proposal steps so as to understand the impact of the design choices in our results. This analysis is presented in Table 2. In short, although data pre-processing and fine-tuning increase performance, the language model choice have achieved the greatest results.

Weighted-F1
Our proposal Using reduced multilingual BERT base 0.695 w/pre-processing data 0.713 w/pre-processing data + ulmfit fine-tuning 0.755 Changing pre-trained language model Using Google multilingual BERT base 0.613 w/pre-processing data 0.636 w/pre-processing data + ulmfit fine-tuning 0.681 Table 2: Ablation using different pre-trained language models. "w/" denotes "with".

Conclusion
We have presented a transfer learning approach to tackle a mixed Spanish-English sentiment classification task. In order to do so, we have combined a transfer learning scheme based on ULMFiT with the-state-ofthe-art language model BERT. That approach allowed us to be ranked 4th on the Sentimix Spanglish test set of SemEval 2020 Task 9. Furthermore, we have demonstrated that a reduced multilingual language model performs better than the one supporting several languages. Moreover, we have discussed the impact of a correct fine-tuning using a discriminative process on each layer regardless that some of them remain frozen during training.