SpanEmo: Casting Multi-label Emotion Classification as Span-prediction

Emotion recognition (ER) is an important task in Natural Language Processing (NLP), due to its high impact in real-world applications from health and well-being to author profiling, consumer analysis and security. Current approaches to ER, mainly classify emotions independently without considering that emotions can co-exist. Such approaches overlook potential ambiguities, in which multiple emotions overlap. We propose a new model “SpanEmo” casting multi-label emotion classification as span-prediction, which can aid ER models to learn associations between labels and words in a sentence. Furthermore, we introduce a loss function focused on modelling multiple co-existing emotions in the input sentence. Experiments performed on the SemEval2018 multi-label emotion data over three language sets (i.e., English, Arabic and Spanish) demonstrate our method’s effectiveness. Finally, we present different analyses that illustrate the benefits of our method in terms of improving the model performance and learning meaningful associations between emotion classes and words in the sentence.


Introduction
Emotion is essential to human communication, thus emotion recognition (ER) models have a host of applications from health and well-being (Alhuzali and Ananiadou, 2019;Aragón et al., 2019; to consumer analysis (Alaluf and Illouz, 2019; Herzig et al., 2016) and user profiling (Volkova and Bachrach, 2016;Mohammad and Kiritchenko, 2013), among others. Interest in this area has given rise to new NLP approaches aimed at emotion classification, including single-label and multi-label emotion classification. Most existing approaches for multi-label emotion classi-fication (Ying et al., 2019;Baziotis et al., 2018;Yu et al., 2018;Badaro et al., 2018;Mulki et al., 2018;Mohammad et al., 2018; do not effectively capture emotion-specific associations, which can be useful for prediction, as well as learning of association between emotion labels and words in a sentence. In addition, standard approaches in emotion classification treat individual emotion independently. However, emotions are not independent; a specific emotive expression can be associated with multiple emotions. The existence of association/correlation among emotions has been well-studied in psychological theories of emotions, such as Plutchik's wheels of emotion (Plutchik, 1984) that introduces the notion of mixed and contrastive emotions. For example, "joy" is close to "love" and "optimism", instead of "anger" and "sadness". # Sentence GT S1 well my day started off great the mocha machine wasn't working @ mcdonalds.
anger, disgust, joy, sadness S2 I'm doing all this to make sure you smiling down on me bro. joy, love, optimism Consider S1 in Table 1, which contains a mix of positive and negative emotions, although it is more negative oriented. This can be observed clearly via the ground truth labels assigned to this example, where the first part of this sentence only expresses a positive emotion (i.e., joy), while the other part expresses negative emotions. For example, clue words like "great" are more likely to be associated with "joy", whereas "wasn't working" are more likely to be associated with negative emotions. Learning such associations between emotion labels and words in a sentence can help ER models to predict the correct labels. S2 further highlights that certain emotions are more likely to be associated with each other. Mohammad and Bravo-Marquez (2017) also observed that negative emotions are highly associated with each other, while less associated with positive emotions. Based on these observations, we seek to answer the following research questions: i) how to enable ER models to learn emotion-specific associations by taking into account label information and ii) how to benefit from the multiple coexisting emotions in a multi-label emotion data set with the intention of learning label correlations. Our contributions are summarised as follows: I. a novel framework casting the task of multi-label emotion classification as a span-prediction problem. We introduce "SpanEmo" to train the model to take into consideration both the input sentence and a label set (i.e., emotion classes) for selecting a span of emotion classes in the label set as the output. The objective of SpanEmo is to predict emotion classes directly from the label set and capture associations corresponding to each emotion. II. a loss function, modelling multiple co-existing emotions for each input sentence. We make use of the label-correlation aware loss (LCA) (Yeh et al., 2017), originally introduced by Zhang and Zhou (2006). The objective of this loss function is to maximise the distance between positive and negative labels, which is learned directly from the multilabel emotion data set. III. a large number of experiments and analyses both at the word-and sentence-level, demonstrating the strength of SpanEmo for multi-label emotion classification across three languages (i.e. English, Arabic and Spanish).
The rest of the paper is organised as follows: section 2 describes our methodology, while section 3 discusses experimental details. We evaluate the proposed method and compare it to related methods in section 4. Section 5 reports on the analysis of results, while section 6 reviews related work. We conclude in section 7.
2 Methodology 2.1 Framework Figure 1 presents our framework (SpanEmo). Given an input sentence and a set of classes, a base encoder was employed to learn contextualised word representations. Next, a feed forward network (FFN) was used to project the learned representa-tions into a single score for each token. We then used the scores for the label tokens as predictions for the corresponding emotion label. The green boxes at the top of the FFN illustrate the positive label set, while the red ones illustrate the negative label set for multi-label emotion classification. We now turn to describing our framework in detail.

Our Method (SpanEmo)
Let {(s i , y i )} N i=1 be a set of N examples with the corresponding emotion labels of C classes, where s i denotes the input sentence and y i ∈ {0, 1} m represents the label set for s i . As shown in Figure 1, both the label set and the input sentence were passed into the encoder BERT (Devlin et al., 2019). The encoder received two segments: the first corresponds to the set of emotion classes, while the second refers to the input sentence. The hidden representations (H i ∈ R T × D ) 2 for each input sentence and the label set were obtained as follows: [SEP]} are special tokens and |C| denotes the size of emotion classes. Feeding both segments to the encoder has a few advantages. First, the encoder can interpolate between emotion classes and all words in the input sentence. Second, a hidden representation is generated both for words and emotion classes, which can be further used to understand whether the encoder can learn association between the emotion classes and words in the input sentence. Third, SpanEmo is flexible because its predictions are directly produced from the first segment corresponding to the emotion classes.
We further introduced a feed-forward network (FFN) consisting of a non-linear hidden layer with a Tanh activation (f i (H i )) as well as a position vector p i ∈ R D , which was used to compute a dot product between the output of f i and p i . As our task involved a multi-label emotion classification, we added a sigmoid activation to determine whether class i was the correct emotion label or not. It should be mentioned that the use of the position vector is quite similar to how start and end vectors are defined in transformer-based models for question-answering. Finally, the span-prediction tokens were obtained from the label segment and then compared with the ground truth labels since there was a 1-to-1 correspondence between the label tokens and the original emotion labels.

Label-Correlation Aware (LCA) Loss
Following Yeh et al. (2017), we employed the labelcorrelation aware loss, which takes a vector of truebinary labels (y), as well as a vector of probabilities (ŷ), as input: where y 0 denotes the set of negative labels, while y 1 denotes the set of positive labels.ŷ p represents the p th element of vectorŷ. The objective of this loss function is to maximise the distance between positive and negative labels by implicitly retaining the label-dependency information. In other words, the model should be penalised when it predicts a pair of labels that should not co-exist for a given example.

Training Objective
To model label-correlation, we combined LCA loss with binary cross-entropy (BCE) and trained them jointly. This aimed to help the LCA loss to focus on maximising the distance between positive and negative label sets, while at the same time taking advantage of the BCE loss through maximising the probability of the correct labels. We experimentally observed that training our approach jointly with those two loss functions produced the best results. The overall training objective was computed as follows: where α ∈ [0, 1] denotes the weight used to control the contribution of each part to the overall loss.

Implementation Details
We used PyTorch (Paszke et al., 2017) for implementation and ran all experiments on an Nvidia GeForce GTX 1080 with 11 GB memory. We also trained BERT BASE utilising the open-source Hugging-Face implementation (Wolf et al., 2019). For experiments related to Arabic, we chose "bertbase-arabic" developed by Safaya et al. (2020), while selecting "bert-base-spanish-uncased" developed by Cañete et al. (2020) for Spanish. All three models were trained on the same hyper-parameters with a fixed initialisation seed, including a feature dimension of 786, a batch size of 32, a dropout rate of 0.1, an early stop patience of 10 and 20 epochs. Adam was selected for optimisation (Kingma and Ba, 2014) with a learning rate of 2e-5 for the BERT encoder, and a learning rate of 1e-3 for the FFN.
It should be mentioned that we tuned our method only on the validation set and further report on the analysis of the effect of parameter α in section 5.4.

Data Set and Task Settings
In this work, we chose semEval2018 (Mohammad et al., 2018) for our multi-label emotion classification, which is based on labelled data from tweets in English, Arabic and Spanish. The data was originally partitioned into three sets: training set (Train), validation set (Valid) and test (Test) set. Following the metrics in Mohammad et al. (2018), we run our experiments on micro F1-score, macro F1-score and Jaccard index score 3 .  To pre-process the data, we utilised a tool designed for the specific characteristics of Twitter, i.e., misspellings and abbreviations (Baziotis et al., 2017). The tool offers different functionalities, such as tokenisation, normalisation, spelling correction, and segmentation. We used the tool to tokenise the text, convert words to lower case, normalise user mentions, urls and repeated-characters.

Multi-label Emotion Classification
We compared the performance of SpanEmo to some baseline as well as state-of-the-art models on all three languages. For experiments related to English, we selected seven models, while we chose three models for both Arabic and Spanish. We also include the results of BERT BASE .

English
English models include JBNN (He and Xia, 2018), DATN (Yu et al., 2018), NTUA (Baziotis et al., 2018), RERc (Zhou et al., 2018), BERT BASE +DK (Ying et al., 2019), BERT BASE -GCN (Xu et al., 2020) and LEM (Fei et al., 2020). JBNN introduces a joint binary neural network, which focuses on learning the relations between emotions based on the theory of Plutchik's wheel of emotions (Plutchik, 1980), and then performing multi-label emotion classification via integrating these label relations into the loss function. DATN proposes a dual attention transfer network to improve multi-label emotion classification with the help of sentiment classification, while NTUA is ranked the top-1 model of the SemEval2018 competition as it relies on different pre-training and finetuning strategies. RERc defines a ranking emotion relevant loss focused on incorporating emotion relations into the loss function to improve both emotion prediction and rankings of relevant emotions. Both BERT BASE +DK and BERT BASE -GCN utilise the same encoder as our own with the former considering additional domain knowledge (DK) and the latter capturing emotion relations through Graph Convolutional Network (GCN), respectively. LEM introduces a latent emotion memory network, in which the latent emotion module learns emotion distribution via a variational autoencoder, while the memory module captures features corresponding to each emotion.

Arabic
Arabic models consist of EMA (Badaro et al., 2018), Tw-StAR (Mulki et al., 2018) and HEF (Alswaidan and Menai, 2020). EMA is the best performing model from the SemEval2018 competition on this set. It utilises various pre-processing steps (e.g. diacritics removal, normalisation, emojis transcription and stemming), as well as different classification algorithms. The Tw-StAR model applies some preprocessing steps and then uses TF-IDF to learn features of a Support Vector Machine. HEF is based on a hybrid neural network, including different word embeddings (e.g. Word2Vec, Glove, FastText) plus variations of RNN neural networks, such as Long Short-Term Memory and Gated Recurrent Unit.

Spanish
Spanish models comprise Tw-StAR (Mulki et al., 2018), ELiRF (González et al., 2018) and MI-LAB (Mohammad et al., 2018). The ELiRF model applies some pre-processing steps while also adapting the tweet tokeniser for Spanish tweets. MILAB is the best performing model from the SemEval2018 shared-task on this set. Table 4 presents the performance of our proposed approach (SpanEmo) on all three languages, in terms of micro F1-score (miF1), macro F1-score (maF1) and Jaccard index score (jacS), and compares it to the baseline and state-of-the-art models discussed in section 3.3.

Language English
Model  As shown in Table 4, our method outperformed all models on all languages, as well as on almost all metrics, with a marginal improvement of up to 1-1.3% for English, 1.9-3.6% for Arabic and 6.3-9.2% for Spanish. This demonstrates the utility and advantages of SpanEmo, as well as the label-correlation aware loss for improving the performance of multi-label emotion classification in English, Arabic and Spanish.
Based on the empirical results reported in Table 4, the following observations can be made. First, incorporating the relations between emotions into the models tends to lead to higher performance, especially for macro F1-score. For example, both DATN and LEM learn emotion-related features and achieve better performance than NTUA and BERT BASE +DK. Additionally, ELiRF makes use of various sentiment/emotion features (i.e., learned from lexica) and it yielded the best performance among the three compared models. This corroborates our earlier hypothesis that learning emotionspecific associations is crucial for improving the performance. Although BERT BASE +DK adopts the same encoder as our own and adds domain knowledge, our method still performs strongly, especially for both macro F1-and jaccard score with a marginal improvement of up to 2.9% and 1%, respectively. In short, capturing emotion-specific associations as well as integrating the relations between emotions into the loss function, helped SpanEmo to achieve the best results compared with all models on almost all metrics.

Ablation Study
To understand the effect of our framework, we undertook an ablation study of the model performance under three settings: firstly, the model was trained only with BCE loss; secondly, it was trained only with LCA loss; and thirdly it was trained without the label segment. The third setting is equivalent to training the model as a simple multi-label classification task, by only considering the input sentence.    by 1-2% for macro F1-and jaccard score. In addition, the results of SpanEmo dropped by 1-2% for two metrics apart from the macro F1-score when trained without the BCE loss. However, the removal of the label segment led to a much higher drop of 3-6%. The same patterns were also observed in the Arabic and Spanish experiments. This supports our earlier hypothesis that casting the task of multilabel emotion classification as span-prediction is beneficial for improving both the representation and performance of multi-label emotion classification.

Prediction of Multiple Emotions
We additionally validated the effectiveness of our method for learning the multiple co-existing emotions on English, Arabic and Spanish sets. Table 6 presents the results, including BERT BASE . SpanEmo demonstrated a strong ability to handle multi-label emotion classification much better than BERT BASE . Since BERT BASE is trained only with binary crossentropy (BCE) loss, here we include the results of our method trained only with this loss function. SpanEmo still achieved consistent improvement as the number of co-existing emotions increases, showing the usefulness of our method in learning multiple emotions. Improvement can clearly be observed for English and Arabic experiments, but not as much for Spanish. This may be attributed to the high percentage of single-label data, which is around (40%) for Spanish, while it is lower than that for both English and Arabic. Obviously, SpanEmo can be used without LCA loss, and still obtain descent performance. Nevertheless, training our method jointly with the LCA loss leads to better results.

Word-Level
In this section, we present the top 10 words learned by SpanEmo for each emotion class by extracting the learned representations for each emotion class and all words in every input sentence, and then computing the similarity between them via cosine similarity. Finally, we performed this operation on all inputs in the SemEval2018 English validation set and then sorted all words for each emotion class in ascending order. Table 7 presents the top-10 words per emotion class. As shown in Table 7, the words discovered by our framework are indicative of the corresponding emotion. This helps to show that SpanEmo learns meaningful associations between emotion classes and words automatically, which can be beneficial for feature extraction and learning. Additionally, SpanEmo demonstrated that it can learn diverse words as well as shareable words across some emotions. For example, the words {pissed, wrath, smashed} are associated with both anger and disgust, demonstrating the ability of SpanEmo to learn the relations between emotions.

Emotion
Top 10 Words anger anger pissed wrath idiots dammit kicking irritated thrown smashed complain anticipation prediction planning mailsport assumptions upcoming waiting route waited frown ideas disgust disgusting smashed gross hate pissed wrath dirty awful vile dumb fear nervous fear terror frightening afraid frown panic terrifying scary dreading joy happy excitement joyful congratulations glad delightful excited adorable amusing smiling love love sweetness loved hug mate lucky carefree shine care gracious optimism optimism integrity salvation persevere perspective bright effort faith glad lord pessimism hopeless frown disappointed weary dread despair depressing chronic suicide pain sadness sadness frown depressing saddened hurt disappointed weary upset sorrow hate surprise stunned awestruck shocking awe mailsport buster genuinely curious hardly believing trust integrity shine respect courage sign effort confident faith easy kindness

Sentence-Level
We visualised an example from the English validation set annotated with four emotions, which were anger, disgust, pessimism and sadness. Our goal was to determine whether by adding emotion classes to the example, SpanEmo could learn their associations to each other (i.e., associations between emotion classes and words in the example).
To compute the similarity between emotion classes and words in the example, we also followed the same process discussed in section 5.2.1. Figure 2 presents the results, where lighter cells indicate higher similarity, while darker cells indicate lower similarity. As shown in Figure 2, the learned representations capture the association between the correct emotion label set and every token in the example. Interestingly, we can also observe that the word "happy" is usually expressed as a positive emotion, but, in this context, this word becomes negative and the model learns this contextual information. Moreover, the phrase "about to join the police academy" is associated with "anticipation", which makes sense although this class is not part of the correct label set. This demonstrates the utility and advantages of our approach not only in deriving associations reported in the annotations, but also providing us with a mechanism to explore additional information beyond them.

Label Correlations
Since one of the research questions in this paper was to learn the multiple co-existing emotions from a multi-label emotion data set, we analysed the learned emotion correlations from SpanEmo and compared them to those adopted from the ground truth labels in the SemEval2018 validation set. Figure 3 presents the two emotion correlations as obtained from the ground truth labels and from the predicted labels, respectively.
It can be observed that Figure 3 what the emotion annotations have revealed. 3(b), which was learned by SpanEmo, also highlights that negative emotions are positively correlated with each other, and negatively correlated with positive emotions. For example, "anger and disgust" share almost the same patterns, which is consistent with the studies of Mohammad and Bravo-Marquez (2017) and Agrawal et al. (2018), both of which report the same issue with negative emotions of "anger" and "disgust", as they are easily confused with each other. This is not surprising as their manifestation in language is quite similar in terms of the use of similar words/expression. We also noted this finding when analysing the top-10 key words learned by SpanEmo in section 5.2.1. In short, taking into account emotion correlations is crucial for multi-label emotion classification in addressing the ambiguity characteristic of the task, especially for emotions that are highly correlated.

Influence of Parameter (α)
SpanEmo was trained with BCE loss and with LCA loss via a weight (α), whose impact on the results is presented in Figure 4. It should be mentioned that this analysis was performed on the validation set of SemEval2018 data set. The lower bound (i.e., 0.0) indicates that the model was trained only with the BCE loss, whereas the upper bound (i.e., 1.0) indicates that it was trained only with the LCA loss. When the value of α increased from 0.0 to 0.5, the results first improved considerably and then gradually deteriorated apart from the results of the macro F1-score. The results of BCE loss favoured the micro F1-and jaccard score, whereas the results of LCA loss favoured the macro F1-score. However, integrating LCA with BCE can balance the results across all three metrics, resulting in strong performance. The best results were achieved on almost all metrics when the value of α was set to 0.2. Thus, we set the value of parameter α to 0.2 for all experiments reported in this paper.

Related Work
There is a large body of NLP literature on emotion recognition (Mohammad and Alm, 2015). Earlier studies focused on lexicon-based approaches, which make use of some words and their corresponding labels to identify emotions in text, e.g. NRC 4 (Mohammad and Turney, 2013) and EmoSenticNet (Poria et al., 2014). Other methods treat the emotion recognition task as a supervised learning task, in which a learner (e.g. linear classifier based methods) is trained on the features of labelled data to classify inputs into one label (Bostan and Klinger, 2018;Liew et al., 2016;Tang et al., 2013;Wang et al., 2012;Aman and Szpakowicz, 2007). More recently, several neural network models have been developed for this task, obtaining competitive results on different emotion data sets. Some of these models generally focus on a singlelabel emotion classification, in which only a single label is assigned to each input (Islam et al., 2019;Xia and Ding, 2019;Alhuzali et al., 2018b,a;Agrawal et al., 2018;Saravia et al., 2018;Felbo et al., 2017;Abdul-Mageed and Ungar, 2017). Other models have also been proposed for multilabel emotion classification, in which one or more labels are assigned to each input (see detailed description in section 3.3).
Our work is motivated by research focused on learning features corresponding to each emotion as well as incorporating the relations between emotions into a loss function (Fei et al., 2020;He and Xia, 2018). Our work differs from these two works in the following ways: i) our method learns features related to each corresponding emotion without relying on any external resources (e.g. lexicons). ii) We further integrated the relations between emotions into the loss function by taking advantage of the label co-occurrences in a multi-label emotion data set. In this respect, our approach does not rely on any theory of emotion. iii) We empirically evaluated our method for three languages, demonstrating its effectiveness as being language agnostic. In contrast to previous research, we focus on both learning emotion-specific associations and integrating the relations between emotions into the loss function.

Conclusion
We have proposed a novel framework "SpanEmo" aimed at casting multi-label emotion classification as a span-prediction problem. We demonstrated that our proposed method outperforms prior approaches reported in the literature on three languages (i.e., English, Arabic and Spanish). Our empirical evaluation and analyses also demonstrated the utility and advantages of our method for multilabel emotion classification, specifically the addi-tion of emotion classes to the input sentence, which helped the model learn emotion-specific associations and increase its performance. Finally, training our method with LCA loss jointly led to better results, showing the benefits of integrating the relations between emotions into the loss function.
The standard approach in a multi-label emotion classification problem often focuses on modelling individual emotion independently. In this respect, most existing methods do not take into account label dependencies while learning emotion-specific associations. However, we demonstrated the effectiveness of including label information to the input sentence when training SpanEmo, helping it achieve better performance and capture emotion correlations. We hope that this study will inspire the community to investigate further the vital role of learning label dependencies and associations corresponding to each emotion.