Distributed Representations of Emotion Categories in Emotion Space

Emotion category is usually divided into different ones by human beings, but it is indeed difficult to clearly distinguish and define the boundaries between different emotion categories. The existing studies working on emotion detection usually focus on how to improve the performance of model prediction, in which emotions are represented with one-hot vectors. However, emotion relations are ignored in one-hot representations. In this article, we first propose a general framework to learn the distributed representations for emotion categories in emotion space from a given emotion classification dataset. Furthermore, based on the soft labels predicted by the pre-trained neural network model, we derive a simple and effective algorithm. Experiments have validated that the proposed representations in emotion space can express emotion relations much better than word vectors in semantic space.


Introduction
In the past decades, a lot of tasks have been proposed in the field of text emotion analysis. The most primary one among them is emotion classification task (Alm et al., 2005). Based on emotion classification task, many new tasks have been proposed from different considerations. Lee et al. (2010) proposed the task of emotion cause extraction, which aims at predicting the reason of a given emotion in a document. Based on the emotion cause extraction task, Xia and Ding (2019) introduced the emotion-cause pair extraction task for the purpose of extracting the potential pairs of emotions and corresponding causes in a document. Jiang et al. (2011) proposed a target-dependent emotion recognition task, which aims at predicting the sentiment with the given query. To express the intensity of * Corresponding author. a specific emotion in text, Mohammad and Bravo-Marquez (2017) proposed the emotion intensity detection task. However, all the above tasks treat emotions as independent ones and represent emotions with one-hot vectors, which definitely ignore the underlying emotion relations.
Based on existing emotion detection tasks, many efforts have been made to achieve better performance (Danisman and Alpkocak, 2008;Xia et al., 2011;Kim, 2014;Xia et al., 2015;Li et al., 2018;Zong et al., 2019) and many datasets have been introduced to train and evaluate the corresponding models (Ghazi et al., 2015;Mohammad et al., 2018;Liu et al., 2019). The vast majority of existing emotion annotation work assumes that the emotions are orthogonal to each other and represent the emotion categories with one-hot vectors (Mohammad, 2012;Gui et al., 2016;Klinger et al., 2018). Actually, the boundaries as well as the relations among emotion categories are not clearly distinguished and defined.
Typical word embedding learning algorithms only use the contexts but ignore the sentiment of texts (Turian et al., 2010;Mikolov et al., 2013). To encode emotional information into word embedding, sentiment embedding and emotion(al) embedding have been proposed (Tang et al., 2014;Yu et al., 2017;Xu et al., 2018). Tang et al. (2015) proposed a learning algorithm dubbed sentimentspecific word embedding (SSWE). Agrawal et al. (2018) proposed a method to learn emotionenriched word embedding (EWE). However, all the above algorithms represent emotions in semantic space rather than emotion space. As shown in Table 1, each emotion category represented in semantic space reflect a piece of semantic information rather than a specific emotional state. In this work, we regard each emotion category as a specific emotional state in emotion space and represent each emotion category with a point in emotion

Semantic Space
Emotion Space Each word corresponds to a point in semantic space. Words cannot be represented in emotion space. Emotional states cannot be represented in semantic space.
Each emotional state corresponds to a point in emotion space. Each emotion category is encoded with a piece of specific semantic information.
Each emotion category is encoded with a specific emotional state. space. The further experiments show that our representations in emotion space can express emotion relations much better than word vectors in semantic space.
From the perspective of psychology, some studies have discussed the complexity of the human emotional state (Russell, 1980;Griffiths, 2002;Fontaine et al., 2007;Clark, 2010) and the shared psychological features across emotions (Fehr and Russell, 1984;Mauss and Robinson, 2009;Campos et al., 2013). However, psychological researches mainly focus on the human emotional state itself and do not pay attention to emotion relations hidden in the text. As there are lots of emotion detection tasks and corresponding datasets in NLP field, it is very meaningful to investigate what is the relations among emotion categories hidden in corpora. In this paper, we detect the underlying relations among emotion categories labeled in corpora from the perspective of NLP.
Distributed representations of emotion categories in emotion space can also benefit NLP applications. Take depression recognition for example, depression is a serious mood disorder and manifested by a complex emotional state (Blatt, 2004;Beck et al., 2014). Most existing emotion taxonomies or datasets do not contain depression as a specific category. In this article, we generate the latent encoding for each emotion category. Based on the psychological researches (Rottenberg, 2005;Joormann and Stanton, 2016) on relations between depression and existing emotion categories, we can predict the distributed representations of depression in the text even if there are no samples annotated as depression.
The main contributions of this work are summarized as follows: • A general framework to learn distributed emotion representations from an emotion classification dataset is first proposed. Based on soft labels predicted by the pre-trained neural network model, a simple and effective approach is derived. As far as we know, this is the first work to learn the distributed representations for emotion categories in emotion space rather than semantic space.
• Experiments have been conducted to validate the effectiveness of our emotion representations. The results have shown that our emotion representations in emotion space can express emotion relations much better than word vectors, and is competitive with human results.
• Emotion similarities across datasets have been detected to validate the quality of our emotion representations across corpora. The results have shown the good consistency of our representations in emotion similarities across datasets although they are created for a variety of domains and applications. Reddit comments for emotion prediction. However, all above datasets are annotated with discrete basic emotion categories, which means the emotion categories are represented with one-hot vectors. Onehot representations ignore the underlying relations among emotion categories. In this work, the underlying emotion relations contained in the datasets are revealed with our emotion representations.

Related Work
Soft Labels: Hinton et al. (2015) observed that it is easier to train classifier using the soft targets output by trained classifier as target values than using manual ground-truth labels. Phuong and Lampert (2019) provided their insights into the working mechanisms of distillation by studying the special case of linear and deep linear classifiers. Szegedy et al. (2016) proposed a label smoothing mechanism for the purpose of encouraging the model to be less confident by smoothing the initial one-hot labels. Imani and White (2018) investigated the reasons for the improvement of the model performance by converting hard targets to soft labels in supervised learning. Zhao et al. (2020) proposed a robust training method for machine reading comprehension by learning soft labels. In this work, soft labels output by the trained neural network model are used to generate distributed representations for emotion categories.

Methodology
In this section, we describe how to learn the distributed representations for emotion categories. First, a general framework is proposed. Then, a simple and effective algorithm is derived based on the soft labels from a pre-trained neural network model. After that, we extend our method to multilabel datasets. At last, detailed approaches of the algorithm are listed.

The General Framework
As shown in Table 2, the four instances from dataset SemEval-2007 task 14 (Strapparava and Mihalcea, 2007) are annotated with both emotion categories and valence values. Although both instance 1 and instance 2 are labeled with joy category, their valence values are very different, which means there is a big difference between their emotional states. Actually, emotions in instance 1 seem to be more excited while emotions in instance 2 seem to be more hopeful. On the other hand, instances 3 and 4 are annotated with the same valence value while they are divided into different categories. Fontaine et al. (2007) also find that emotional state is high-dimensional and valencearousal-dominance representation model is not sufficient to describe the emotional state.
The above examples show emotional states contained in different documents, even if they are annotated with the same emotion category or valence value, are not exactly the same. In this work, we regard text emotional states as an emotion space. The emotion contained in a specific document corresponds to a specific emotional state, further corresponds to a point in the space. As a result, documents annotated with same emotion category probably correspond to different emotional states and points in the space, which means the emotion category is a random variable rather than a specific vector in the space.
For category K, we define x as the sample annotated with category K and V K as the specific distributed representations of category K. Let V(x) be the distributed representations of sample x and p(x) be the probability density of sample x. Let Ω be the integral domain of x. We further use L(V K , V(x)) as the distance function between V K and V(x). In order to obtain a better distributed representation for category K, we must minimize the expectation of L. Thus, we obtain the calculation formula for specific distributed representation of category K as the following:

A Simple Method
Although we can not directly obtain the strict probability distribution of each emotion category in emotion space, there are many available emotion classification dataset, in which the instances can be regarded as samples of the corresponding annotated emotion categories. For emotion dataset D and emotion category K, we use all samples annotated as category K in the dataset to estimate the distribution of category K. Thus, we can rewrite formula 1 as: where S K is the set of all instances labeled with category K in dataset D.
In this paper, we use squared Euclidean distance as the distance metric between two representations. Therefore, formula 2 can be simplified as follows: By solving formula 3, we have: where N K is the size of S K . Since then we have derived that the distributed representation of emotion category K is exactly the average of the distributed representation of all instances labeled as category K in dataset D.
Now, let's discuss how to obtain the distributed representation for the instances in the dataset. As shown in Figure 1, the output of the neural network model is a soft label regardless of the specific architecture of the model. It has been verified that soft labels output by the trained model tend to have higher entropy and contain more information than manual one-hot labels (Hinton et al., 2015;Phuong and Lampert, 2019). Inspired by previous work on soft labels, we directly take the soft labels output by the trained neural network model as the distributed representation of the input instance. As a result, the dimension of V K is equal to the number of categories annotated in dataset D.
We define soft labels output by the trained neural network model of the input instance x as f (x). Thus, we derive a simple method to calculate the specific distributed representation for category K:

How to Deal with Multilabel Data?
In some corpora, instances are annotated with multiple emotion categories (Strapparava and Mihalcea, 2007;Demszky et al., 2020). To deal with multilabel instances, we regard each multilabel instance as multiple single label instances with weights summing to 1, and the weight of each single label data is set to the reciprocal of the number of the annotated labels. For example, suppose document D is labeled with category A and B. We regard D as two half instances, one half is labeled with category A and the other half is labeled with category B.
Let Y(x) denote the set of the annotated labels of sample x and |Y(x)| denote the size of set Y(x). Take above document D as an example, then Y(D) is equal to {A, B} and |Y(D)| is equal to 2 as  there are two labels contained in Y(D). Therefore, we obtain the calculation formula of specific distributed representation for category K: where w K (x) is equal to 1/|Y(x)|, which is the weight of instance x in category K,

Algorithm
In this part, we describe the algorithm of learning the Distributed Representations for Emotion Categories (DREC). First, go through every instance in the dataset, and calculate the total weight and weighted sum of soft labels output by the trained model for each category. Then, the weighted sum is divided by the total weight to obtain the final distributed representation for each emotion category. The detailed approaches are stated in Algorithm 1. SL ← f (T (n) ) // soft labels 07: V j ← V j + SL/|Y (n) | 08: W j ← W j + 1/|Y (n) | 09: end for 10: end for 11: for i = 1 to C do 12: V i ← V i /W i 13: end for

Experiments
In order to validate the intrinsic quality of our emotion representations, we conducted three experi-ments in this section. First of all, arrangement experiment is conducted to show the emotion distribution. Then, relations between different emotion taxonomies are detected in mapping experiment. At last, the emotion representations extracted from various corpora are compared to show the consistency of our approach across corpora.

Datasets
There are four datasets we use to detect emotion relations. The detailed information of each dataset is described as follows: GoEmotions: GoEmotions is annotated of 58k English Reddit comments extracted from popular English subreddits (Demszky et al., 2020), multilabeled for 27 emotion categories, which is proposed by Cowen and Keltner (2017). GoEmotions is created for the purpose of building a large dataset with a large number of positive, negative, and ambiguous emotion categories. The detailed emotion categories are shown in Table 3.
AffectiveText: AffectiveText consists of 1250 instances on the domain of news headlines (Strapparava and Mihalcea, 2007). The dataset is multilabel annotated. There are six emotion categories (anger, disgust, fear, joy, sadness and surprise) and valence contained in the dataset.
ISEAR: ISEAR is created from questionnaires by Scherer and Wallbott (1994). Each instance is annotated with only one label. There are seven emotion categories contained in ISEAR: anger, disgust, fear, guilt, joy, sadness, and shame.
GoEmotions is used to conduct the first two experiments (arrangement and mapping), and the above four datasets are used to validate our representations across corpora in last experiment.

Model Settings
Any model that outputs are soft labels can be employed to learn the distributed representations for emotion categories. In our experiments, TextCNN (Kim, 2014), BiLSTM (Schuster and Paliwal, 1997) and BERT (Devlin et al., 2019) are used as the training models. For comparison, experiments on word embedding learning algorithms are conducted to show emotion relations in semantic space. For a specific emotion category, we use its word embedding as its representations in semantic space. 100-dimensional GloVe (Pennington et al., 2014) is the word vectors used in TextCNN and BiLSTM. The detailed model settings are listed as follows: TextCNN: The height of convolutional kernel size is divided into three groups (3,4,5) and the width is 100, which is equal to the dimension of the word vectors. There are 32 channels in each group. Batch size and learning rate are set to 16 and 0.001.
BiLSTM: There is only one layer in this model. Batch size and learning rate are set to 16 and 0.001 separately, which are the same as for TextCNN. There are 32 neurons in the hidden layer in each direction.
BERT: BERT-based model is used in this experiment. A fully connected layer is added on top of the pre-trained model. Batch size and learning rate are separately set to 8 and 2e-5 for fine-tuning.

Arrangement
As shown in Table 3, the emotion categories are divided into three groups corresponding to the positive, negative, and ambiguous emotions, which are divided by the creators of GoEmotions 1 (Demszky et al., 2020).
We conduct the experiments 10 times with same model and different initial parameters, and the average representations are employed to show the following results. After final emotion representations obtained, to better understand the arrangement of emotion categories in emotion space, we reduce the dimension of the emotion representations to two with singular value decomposition (Wall et al., 2003). The two-dimensional average vectors are displayed as shown in Figure 2. Three color-shape pairs, red-circle, gray-square and black-triangle, correspond to positive, negative and ambiguous emotions respectively. Figure 2  As shown in Figure 2 (a)-(c), the results of three word embedding algorithms (GloVe, SSWE and EWE) are displayed. We can find that the word vectors of emotion terms are displayed relatively random in semantic space and there are no clear linear boundaries among positive, negative and ambiguous emotions.
As shown in Figure 2 (d)-(f), it can be found that in emotion space, regardless of the constructed model, there are obvious boundaries among positive, negative and ambiguous emotions. The two blue dashed lines separate each type of emotion category from the others, which means that different types of emotion categories are linearly separable from each other in emotion space. The ambiguous emotions are just located between positive and negative emotions in Figure 2 (d)-(f), which shows our representations in emotion space can better describe the relative relation between ambiguous emotions and the others. In addition, the arrangement of emotions in Figure 2 (d) and (e) are very similar, which means TextCNN and BiLSTM have similar emotion relation extraction capabilities.
From this experiment, we can conclude that similar emotions are more likely to get together in emotion space than in semantic space, which further demonstrates that our representations can express emotion relations much better than word vectors.

Mapping
Demszky et al. (2020) manually mapped these 27 emotion categories to Ekman's basic emotions (Ekman, 1992). 2 In this experiment, we automatically generate these mapping relations based on the proposed distribution representations of emotion categories.
In this experiment, we take Ekman's basic emotions as target emotions and the remaining 21 categories as source emotions. For each source emotion, we select the most similar one from the target emotions as its mapping result. The calculation formula is listed as follows: where e t is the emotion category in target emotions, e s is the emotion category in source emotions and e is the mapping result of e s . sim is the similarity function and the cosine similarity is selected here. The emotion representations are calculated 10 times with same model and different initial parameters and the average results are employed to conduct this experiment. Table 4 shows the mapping results with different models. We also calculate the results of word vectors for comparison. Manual results are chosen as the gold answers. GloVe correctly maps 3 out of 21 emotions, which is comparable to a random result. By encoding emotional information into word representations, SSWE (Tang et al., 2015) maps 10 emotions correctly and EWE (Agrawal et al., 2018) maps 7 emotions correctly. The results indicate that although sentiment embedding (SSWE) and emotion embedding (EWE) map more emotions correctly than typical word embedding (GloVe), SSWE and EWE still mismatch more than half of the source emotions as they are constructed under semantic space.
In emotion space, our emotion representations correctly map 18 out of 21 emotions, which is much better than the result in semantic space. The scores undoubtedly show that our emotion representations can describe emotion relations much better than word vectors. Besides, detailed mapping results for each emotion can be seen in Table 4. Results of TextCNN and BiLSTM are exactly the same, which is consistent with their similar arrangement in emotion space in first experiment. BERT maps disapproval to disgust while the others map it to anger. The most confusing emotions are caring and embarrassment, human maps them to joy and sadness respectively, while our representations in emotion space map them to sadness and disgust.
The inconsistency of the two emotions (embarrassment and caring) in emotion space and in human results shows the complexity of emotion relations. Existing psychological study (Scherer, 2005) shows that embarrassment is close to both sadness and disgust, which means sadness and disgust can both be regarded as the mapping result for embarrassment. As for caring, it has been discussed (Scherer et al., 2013) that caring is a positive emotion in nature but accompanied by the occurrence of negative events.
The mapping results of the three models are roughly the same as human-provided mapping results, which shows our emotion representations are effective. However, when a certain emotion has high similarities to multiple emotions (such as embarrassment to disgust and sadness), there may exist some differences between different mapping results. In other words, there are no absolutely cor-

Source Emotions Human
Semantic Space Emotion Space  GloVe  SSWE  EWE  TextCNN BiLSTM BERT  admiration  joy  disgust anger  anger  joy  joy  joy  amusement  joy  anger  joy  disgust joy  joy  joy  annoyance  anger  anger  disgust anger  anger  anger  anger  approval  joy  fear  disgust fear  surprise  surprise  joy  caring  joy  anger  anger  anger  sadness  sadness  sadness  confusion  surprise anger  joy  anger  surprise  surprise  surprise  curiosity  surprise fear  surprise surprise surprise  surprise  surprise  desire  joy  fear  joy  joy  joy  joy  joy  disappointment  sadness   rect mapping results for all emotions, which further indicates the relations among emotions are indeed complex.

Emotion Relations across Corpora
Due to the deviations in different corpora (such as data source bias and annotation bias), there may exist some differences in emotion relations between different corpora. In this part, we analyze the difference in emotion relations across corpora. BERT is chosen as the training model here to eliminate the potential impact caused by models. For each dataset, the experiments are repeated 10 times with same model and different initial parameters, and the average results are reported here. There are five emotion categories (anger, disgust, fear, joy and sadness) shared in the four datasets. The shared five emotions are basic emotion categories in many emotion taxonomy theories (Ekman, 1992;Harmon-Jones et al., 2016;Cowen and Keltner, 2017). As a result, the cosine similarities among these emotion categories as shown in Figure 3 are not high. For each dataset, all co-sine similarities are not greater than 0.3 except the similarity between anger and disgust.
On the other hand, the datasets are created based on different annotation standards from different domains. Thus, for specific emotion pair, the similarities across datasets may be quite different. However, the relative magnitude of similarities is consistent across datasets. For each dataset, there is a moderate similarity between anger and disgust (ranging from 0.52 to 0.65) while the similarities among remaining emotion pairs are relatively small (ranging from 0.04 to 0.30).
In order to quantitatively measure the consistency of emotion relations in different datasets, Pearson correlation coefficients between cosine similarities across datasets are calculated as shown in Table 5. The Pearson correlation coefficients among datasets are pretty high (ranging from 0.867 to 0.949), which indicates the underlying emotion relations are quite similar across datasets even if they are created in different domains.
In this experiment, we detect emotion relations across corpora. The results reveal that there is a

Conclusion and Future Work
In this paper, we argued that the emotion categories are not orthogonal to each other and the relations among emotion categories are very complex. We proposed a general framework to learn the distributed representation for each emotion category in emotion space from a given emotion dataset. Then, a simple and effective algorithm was also derived based on the soft labels predicted by the pre-trained neural network model. We conducted three experiments to validate the effectiveness of our emotion representations and the experimental results demonstrated that our representations in emotion space can express emotion relations much better than representations from word embeddings.
There are three avenues of future work we would like to explore. First, the distributed representations for emotion categories are derived from a specific emotion classification dataset. It would be interesting to build a universal emotion representation that is irrelevant to a specific corpus. Second, the computation of our emotion representations relies on the soft labels predicted by the neural network model, and we would like to investigate a more general method in the future. Finally, we would like to explore more NLP applications of our emotion rep-resentations, such as improving the performance of emotion classification models and studying emotion spaces across languages.