eMLM: A New Pre-training Objective for Emotion Related Tasks

BERT has been shown to be extremely effective on a wide variety of natural language processing tasks, including sentiment analysis and emotion detection. However, the proposed pretraining objectives of BERT do not induce any sentiment or emotion-specific biases into the model. In this paper, we present Emotion Masked Language Modelling, a variation of Masked Language Modelling aimed at improving the BERT language representation model for emotion detection and sentiment analysis tasks. Using the same pre-training corpora as the original model, Wikipedia and BookCorpus, our BERT variation manages to improve the downstream performance on 4 tasks from emotion detection and sentiment analysis by an average of 1.2% F-1. Moreover, our approach shows an increased performance in our task-specific robustness tests.


Introduction
Language models have been studied extensively in the NLP community (Dai and Le, 2015;Howard and Ruder, 2018;Peters et al., 2018;Devlin et al., 2019;Liu et al., 2019), with approaches attaining state-of-the-art results on multiple token-level or sentence-level tasks. BERT (Devlin et al., 2019) is a pre-trained language model, which proposed a new pre-training objective inspired by the Cloze task (Taylor, 1953), which enables the training of a deep bi-directional transformer network. This objective, called Masked Language Modeling (MLM) is used on large amounts of unlabeled data from Wikipedia and BookCorpus to produce powerful universal language representations. However, the pre-training does not take into account the downstream task on which the model will be applied.
In this paper, we posit that we can leverage the characteristics of a downstream task to design better task-tailored pre-training objectives. Concretely, we induce information from emotion or sentiment lexicons into our BERT pre-training objective to improve the performance on tasks from sentiment analysis and emotion detection.
There are numerous studies that focus on emotion detection (Demszky et al., 2020;Desai et al., 2020;del Arco et al., 2020;Sosea and Caragea, 2020;Majumder et al., 2019;Mohammad and Kiritchenko, 2018;Abdul-Mageed and Ungar, 2017;Mohammad and Kiritchenko, 2015;Mohammad, 2012;Strapparava and Mihalcea, 2008) and sentiment analysis (Yin et al., 2020;Tian et al., 2020;Phan and Ogunbona, 2020;Zhai and Zhang, 2016;Chen et al., 2016;Liu, 2012;Glorot et al., 2011;Pang and Lee, 2005). Various lexicons have been used to improve model performance on these tasks. For instance, Katz et al. (2007) used occurrences of emotion words to identify various emotion types in news headlines. Moreover, emotion lexicons have been used to produce important features which can be used inside a machine learning algorithm to improve the performance on emotion detection tasks (Mohammad, 2012;Sykora et al., 2013;Khanpour and Caragea, 2018;Biyani et al., 2014). In this paper, however, instead of leveraging these lexicons to design features, in contrast, we use them to obtain language representations that are more suitable for emotion and sentiment tasks.
To this end, we introduce Emotion Masked Language Modeling (eMLM), a new pre-training BERT (Devlin et al., 2019) objective aimed at improving the BERT performance on tasks related to sentiment analysis and emotion detection. Inspired by the well-known Masked Language Modeling objective, eMLM adds only a few simple, yet powerful changes. Instead of uniformly masking the tokens in the input sequence, eMLM leverages lexicon information, and assigns higher masking probabilities to words that are more likely to be important in the sentiment or emotion contexts. To enable a fair comparison with the vanilla BERT model, we train the eMLM BERT model in the same fashion as the vanilla BERT, pre-training on Wikipedia and BookCorpus (Zhu et al., 2015). To our knowledge, we are the first to study different masking probabilities for the BERT pre-training procedure guided by sentiment and emotion lexicons. Similar to our work, some studies also focused on incorporating sentiment information into pre-trained language models. For example, Yin et al. (2020) built an attention network on top of BERT to predict sentiment labels of phrase nodes obtained through a constituency parse tree. On the other hand, Tian et al. (2020) designed various pretraining objectives, such as masking and predicting all words from a pre-defined small set of seeds, and predicting an aspect-sentiment pair or the polarity of words. In contrast, we leverage information from available sentiment and emotion lexicons.
We show the feasibility of our approach by testing eMLM on two sentiment analysis benchmark datasets and two emotion detection datasets. These datasets span diverse domains, such as movie reviews, online health communities, and Reddit discussions, enabling a comprehensive analysis of eMLM.
Our contributions are as follows: 1) We introduce a new pre-training objective for BERT (leveraging available lexicons), aimed at producing better task-guided universal representations for downstream tasks from sentiment analysis and emotion detection. We offer the pre-trained model as an easy way to leverage our approach on downstream applications. 2) We show the efficacy of our approach by testing our method on four benchmark datasets for emotion and sentiment and obtain an average improvement in F1 score of 1.2%. 3) We verify the robustness of our model in the face of input perturbations, which occur frequently in informal contexts (e.g., due to mispellings).

Proposed Approach
Background Bidirectional Encoder from Transformers for Language Understanding (BERT) (Devlin et al., 2019) is a pre-trained language model trained on large amounts of unlabeled data using two objectives: 1) Masked Language Modeling (MLM) randomly masks 15% of tokens in a sequence, followed by a supervised prediction of the masked tokens; 2) Next Sentence Prediction (NSP) predicts in a binary fashion if two sentences follow each other. By using these two tasks on large-scale data repositories such as BookCorpus (800M words) (Zhu et al., 2015) and Wikipedia (2, 500M words), BERT produces powerful universal language representations, applicable on a wide range of tasks, such as sentiment analysis, question answering, and commonsense reasoning.
However, to be used in various downstream tasks, BERT has to undergo a task-specific finetuning step (Devlin et al., 2019), where the contextualized embedding is adapted to the needed task. We posit that we can improve the downstream performance by focusing on the target task in the pre-training phase as well. Specifically, we focus on sentiment analysis and emotion detection, and show that task-guided unsupervised pre-training helps the performance considerably.
Masking Emotion Words Now we introduce Emotion Masked Language Modeling (eMLM), a variation of MLM targeted at inducing emotion or sentiment-specific biases in the BERT pre-training phase. Specifically, unlike BERT, which uses a uniform probability (15%) to mask the tokens in an input sentence, we assign higher probabilities to tokens which are emotionally rich words from an available lexicon L. We denote this probability by k, which is a hyperparameter in our eMLM method. Our masking process can be summarized as follows: Given an input sentence S: 1) We extract the words that belong to the lexicon L, and we denote them by E; 2) We set the masking probability of these words as P (w e ) = k ∀ w e ∈ E; 3) To ensure we mask 15% of the words in total, we lower the masking probability of the non-emotionally-rich words using the following formula: where | · | represents the size of a set. We show examples of how our masked probabilities change from MLM to eMLM in Table 1. For instance, in the first example, there are two emotion words, perfect and hope, and we use a masking probability of k = 0.50. While the probabilities of these two words are set to 50%, the non emotionally-rich word probability is lowered from 15% to 9% to keep the sum of probabilities constant. The rest of the training process is the same as the original BERT pre-training. That is, we train our BERT model from scratch using eMLM and NSP on the same datasets: Wikipedia and BookCorpus. We mention that we use whole word masking, both for eMLM and the MLM (i.e., we mask all the subtokens corresponding to a word).

Experiments and Results
In this section, we first describe our experimental setup ( §3.1), then present our datasets and lexicons ( §3.2), and then discuss the results that contrast eMLM with the original BERT MLM ( §3.3).

Experimental Setup
We use various benchmark datasets from sentiment analysis and emotion detection to test our eMLM approach. For every dataset considered, we use the provided training, validation, and test splits.
To assert statistical significance, we fine-tune each model 10 times with different random seeds and report the average F1 score. We investigate various masking probabilities k, ranging from 0.2 to 1.0, and find that 0.5 works best in our setting. For low values around 0.2 we notice that the performance is similar to that of the original BERT, while for high values (closer to 1.0), the performance is negatively affected.

Datasets and Lexicons
We test our models on various benchmark datasets described below.
Stanford Sentiment Treebank (SST) (Socher et al., 2013) SST contains 11, 855 sentences from  movie reviews, annotated with five sentiment labels: negative, somewhat negative, neutral, somewhat positive, and positive. First, we consider the binarized dataset, called SST-2, where the examples with the negative and somewhat negative labels are merged into a negative class, and the examples with the somewhat positive and positive labels are merged into a positive class (with neutral class being removed). Second, we consider the SST fine-grained version (SST-5), which uses all five labels.
GoEmotions (Demszky et al., 2020) is a sentence-level multilabel dataset of 58, 000 comments curated from Reddit and annotated with 27 emotion categories and the neutral class.
CancerEmo (Sosea and Caragea, 2020) is a sentence-level multilabel dataset of 8, 500 sentences labeled with the eight Plutchik (Plutchik, 1980) basic emotions from an Online Health Community for people suffering from diseases such as cancer.
We analyze the behaviour of eMLM in diverse environments: sentiment analysis or emotion detection, various data platforms (e.g., Reddit, OHCs), and variate emotion or sentiment granularity (from 2 classes to as many as 28 classes).
Lexicons As mentioned above, our eMLM focuses on emotionally rich words from a lexicon.
In this paper, we use EmoLex (Mohammad and Turney, 2013), a lexicon of 6, 000 words associated with eight Plutchik basic emotions (Plutchik, 1980) (sadness, anger, joy, surprise, anticipation, trust, fear, disgust) and 5, 555 words associated with the positive and negative sentiments. We consider the sentiment and emotion words separately to analyze the impact of each on the performance of eMLM. We denote the approach which masks the emotion-revealing words by eMLM (E), and the sentiment-revealing words by eMLM (S).

Results on Sentiment Analysis
We show the results of our approaches on SST in Table 2. First, we observe that eMLM (E) and eMLM (S) improve upon the vanilla BERT model on both tasks, with eMLM (E) obtaining as much as 1.7% improvement in F1. Interestingly, eMLM (E) outperforms eMLM (S) suggesting that masking finergranularity emotion words in eMLM produces better representations for the task. At the same time, eMLM (E) achieves better performance on the finegrained SST-5 task, where the improvements over the vanilla BERT are considerable.

Results on Emotion Detection
We show the results of eMLM on the GoEmotions dataset in Table  3 and observe that, similar to sentiment analysis, eMLM (E) is the best performing approach, improving upon vanilla BERT by 1.4% in F1. We show the results on CancerEmo in Table 4 and observe the same pattern: eMLM (E) consistently outperforms the other approaches. We see improvements as high as 4% on Joy and 2% on Sad-   ness. Overall, eMLM (E) obtains an 1.7% F1 improvement over the vanilla BERT model.

Discussion
The presented results reveal the feasibility of our proposed approach. Our BERT model trained using the eMLM objective produces high quality contextualized embeddings for downstream tasks that span the sentiment analysis and emotion detection tasks. Moreover, our methods incur no additional computational cost over the original BERT (Devlin et al., 2019), and undergo the same amount of pre-training. We also tried combining and masking both sentiment and emotion words; however, we did not see any performance improvements. As a step forward, we are interested in gaining more insights into the differences between eMLM (E) and the vanilla BERT model. We study this in the robustness context in the next section, and analyze how our models behave in the face of various input perturbations (i.e., noise).

Varying the Emotion Masking Probability k
To offer additional insights into our eMLM approach and show the impact of the sentiment or emotion-rich word masking probability on downstream tasks, we show the results obtained using various values of k in Table 5. First, we note that using a slightly lower probability of 0.30 still adds improvements to our model on three of the considered datasets. In contrast, too high of a proba-bility hurts the F1 performance. Concretely, using k = 0.90, our eMLM approach decreases the F1 compared to the vanilla BERT by 1% on Cancer-Emo, 5% on GoEmotions, and 1% on SST-2.

Robustness Test
It has been shown that neural models are often sensitive to various input perturbations (Niu et al., 2020;Belinkov and Bisk, 2018). In this section, we aim to investigate the robustness of our proposed approach in the face of input noise. We focus on the following two questions: 1) Does eMLM improve the robustness of the model? 2) What type of input noise is successful in misleading our model?
We study these questions on the SST-5 sentiment analysis task using the framework introduced by Hsieh et al. (2019). We explore three ways to generate input perturbations and verify their "success." We say a perturbation is "successful" on a model M for an example e if 1) The model M classifies e correctly and 2) The model M misclassifies the example e when noise is applied to it. Naturally, the lower the perturbation success rate, the more robust a model is. The perturbations that we considered are as follows: 1. Random (Alzantot et al., 2018) replaces one word from the input sentence with a random word from the vocabulary. For a word, we repeat this process 100 times. If at least one of the replacements leads to an incorrect prediction, the perturbation is deemed to be successful.
2. LIST (Alzantot et al., 2018) replaces each word (one at a time) in the input text with a synonym. The input perturbation is successful if at least one replacement leads to an incorrect prediction.

EmoWord
If there is an emotion word in the input sentence, then we zero out that word, otherwise, we zero out a random word from the input sequence.

Results
We show the results of the robustness tests for the vanilla BERT and the eMLM approach in Table 6. First, EmoWord is the most successful perturbation, being twice as effective compared to the other methods. Second, we observe that Random and LIST obtain the same success rates among both the BERT and eMLM approach. However,  on EmoWord, our eMLM approach is considerably more robust, outperforming the simple BERT model by 4.4%. We argue that this is the byproduct of the eMLM training procedure, which focuses on predicting emotion words in the pre-training step.

Conclusion
In this paper, we introduced a new BERT pretraining objective suited for sentiment analysis and emotion detection tasks. We showed that the approach is feasible; it needs no additional pretraining compared to the vanilla BERT, and improves the performance by 1.2% F1 on average on various tasks. Our analysis also suggests that eMLM is more robust in the face of input perturbations. As future work, we note that our approach is general enough, so we plan to leverage different lexicons outside the sentiment analysis and emotion detection domains to investigate if the model generalizes well on other domains (e.g., financial).
We also plan to study if our method is effective for non-English languages. Finally, we note that there exist lexicons that assign to words not only their emotion, but also their emotion intensity (Mohammad, 2018). Therefore, we plan to investigate if associating the masking probability with the emotion intensity (i.e., assign a higher probability to a more intensive word) would further help improve the performance.