Guilt by Association: Emotion Intensities in Lexical Representations

What do linguistic models reveal about the emotions associated with words? In this study, we consider the task of estimating word-level emotion intensity scores for specific emotions, exploring unsupervised, supervised, and finally a self-supervised method of extracting emotional associations from pretrained vectors and models. Overall, we find that linguistic models carry substantial potential for inducing fine-grained emotion intensity scores, showing a far higher correlation with human ground truth ratings than state-of-the-art emotion lexicons based on labeled data.


Introduction
There has been substantial research on methods to label words with associated emotions. Crowdsourcing approaches have been used to compile databases to study the nexus between them (Mohammad and Turney, 2013; Mohammad, 2018;Shoeb and de Melo, 2020). Another strategy, adopted by the well-known DepecheMood (Staiano and Guerini, 2014) and DepecheMood++ (Araque et al., 2018) lexicons is to apply simple statistical methods to emotionally tagged data crawled from specific online sources.
In this paper, we consider the question: What do linguistic models reveal about the emotions associated with words? Word vectors (Mikolov et al., 2013;Pennington et al., 2014) have often been evaluated on standard word relatedness benchmarks. In this paper, we instead explore to what extent they encode emotional associations.
Earlier methods (Strapparava and Mihalcea, 2008;Mac Kim et al., 2010) used corpus statistics in tandem with dimensionality reduction techniques for emotion prediction. Shoeb et al. (2019) considered word embeddings trained on emojis for emotion analysis. Rothe et al. (2016) proposed predicting sentiment polarity ratings from word vectors, while other studies (Buechel and Hahn, 2018b;Buechel et al., 2018) predicted valence, arousal, and dominance using supervised deep neural networks. Khosla et al. (2018) proposed incorporating valence, arousal, and dominance ratings as additional signals into dense word embeddings. Buechel and Hahn (2018a) showed that emotion intensity scores can be predicted based on a lexicon providing valence-arousal-dominance ratings. The task of emotional association of words has been studied on other languages as well. Sidorov et al. (2012) presents a dataset of Spanish words labelled with Ekman's six emotions, while others explored cross-lingual propagation from one language to another (Abdaoui et al., 2017) or to hundreds of other languages (Ramachandran and de Melo, 2020). We show that we can make use of linguistic models to obtain high correlations with emotion intensities without any need for manual ratings. While this work focuses on English, the methodology can be applied to data in any language.
Overview. We introduce the task and experimental setup in Section 2. Then, section 3 presents our techniques to address this task based on linguistic models. We provide detailed empirical evaluation results in Section 4. Finally, Section 5 concludes the paper by discussing the relevance of our experiments.
Contribution. Overall, our intriguing finding in this paper is that pretrained linguistic models allow us to predict much more accurate emotion associations than state-of-the-art emotion lexicons. We show that this holds even without any supervision, while different kinds of supervised setups yield even better results.

Emotion Intensities as Associations
In our study, we seek to predict the emotions associated with individual words. Such predictions can also be useful for sentence-level predictions, as considered in Section 4.2, but that is not the primary focus of this paper.
While many emotion lexicons only provide binary emotion labels, it is clear that words may exhibit different degrees of association with an emotion. For instance, the word party suggests a greater degree of joy than the word sea, although the latter may as well be associated with joy.
Emotion inventory and ground truth. The NRC Emotion/Affect Intensity Lexicon (Mohammad, 2018) provides ground-truth human ratings of emotion intensity for English words. The latest release of the data covers the eight basic emotions proposed in Plutchik's wheel of emotions (Plutchik, 1980), i.e., anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. However, the techniques we consider in our paper apply broadly to any inventory of discrete emotions.
The ratings in the dataset are scaled to [0, 1] such that a score of 1 signifies that a word "conveys the highest amount of" a particular emotion, while 0 corresponds to the lowest amount. These ratings were solicited via crowdsourcing and then standardized using best-worst scaling.
Benchmark data. In our work, instead of viewing the resource as a lexicon that provides emotion intensity scores, we propose to treat it as providing gold standard associations between pairs of words, one of the two words being an emotion name. Thus, we derive a semantic association benchmark similar to widely used word relatedness benchmarks such as the RG-65 (Rubenstein and Goodenough, 1965) and WS353 (Finkelstein et al., 2001) sets.
In order to more fairly evaluate resources that only provide unigrams, we disregard bigrams in the lexicon, resulting in a total of 9,706 word pairs. To facilitate an evaluation of unsupervised and supervised techniques, for each emotion, we split the corresponding data into training/validation/test portions with a ratio of 64%/16%/20%. This results in a total of 7,762 pairs in the training sets, and 1,944 pairs in the test sets. The remaining 1,552 instances serve as validation data. We adopt these data splits for all of the experiments in this paper and also make them available online for reproducibility. 1

Unsupervised Prediction
While past work on emotion intensity prediction has considered this an entirely separate task, we here consider emotional intensity scoring as similar in nature to regular lexical associations between words. Given two words w 1 , w 2 , the cosine similarity cos(v w 1 , v w 2 ) of their corresponding word vectors v w 1 , v w 2 is expected to reflect their association, as most word vector spaces capture semantic relatedness (Hill et al., 2014). Hence, given a targeted vocabulary word w and an emotion e under consideration, we simply compute as the emotion intensity score. Here, we assume there is a single word w e denoting e (e.g., joy).

Supervised Prediction
If a training set is available, then for each emotion e covered by it, we train a regression model f e (v w | θ e ) parametrized by θ e that allows us to predict the emotion intensity for e, given a vector of a word as input. Finally, we simply define to route each prediction to the relevant f e , i.e., given a target emotion e, we simply invoke the right model f e . In our experiments, we consider two kinds of models. The first is a feedforward neural network with a ReLU activated hidden layer with 64 neurons and a single output neuron that predicts the intensity of emotion, and a dropout rate of 0.2 applied to the hidden layer. The loss function is the mean squared error of the prediction with respect to the ground truth. The second option is a support vector regression (SVR) model with RBF kernel.

Self-Supervised Prediction
Finally, we propose a hybrid self-supervised technique that relies on supervised learning but does not require any pre-existing training data. Instead, we induce training data from the predictions of our unsupervised model. For each emotion e, we first identify the set T + e , the words in the vocabulary V with the top k highest intensity predictions, and the set T − e , the words with the top k lowest intensity predictions.
These predictions are acquired using Eq. 1, i.e., our unsupervised technique. Subsequently, for each e, the set T + e ∪ T − e along with the corresponding labels is used to train a supervised modelf e (v w |θ e ) as above in Section 3.2. Finally, the overall model again just involves invoking the relevantf e with regard to the desired emotion e: σs(w, e) =f e (v w |θ e )

Main Results
We proceed with a detailed evaluation assessing to what extent a number of different linguistic resources and models capture emotion intensity information. For this, we consider the ground truth benchmark data introduced in Section 2 and compare the results obtained using different approaches with the test set ratings in terms of Pearson Correlation coefficients. The emotion intensity score for any out-of-vocabulary words is taken to be 0.0, indicating that the data does not make a prediction about such words. The results are given in Table 1. The "Overall" column reports the correlation with the union of all word-emotion pairings in the ground truth test sets. As we have an equal number of word-emotion pairs for each emotion, this serves as an aggregate measure of the result quality. For the baselines with fewer than eight emotions, the results are computed by averaging over just the subset of results for labels that are covered. A multi-way statistical significance analysis of the "Overall" results using t-tests shows that the improvements of all unsupervised, self-supervised, and supervised word vector methods except for unsupervised BERT are statistically significantly better than all lexicon baselines at significance level 0.001.
We compare a series of baselines along with our unsupervised, self-supervised, and regular supervised methods proposed in Section 3.
Baselines. We first evaluate existing state-of-theart emotion lexicons. These provide either realvalued intensity scores or binary association scores (treated as 0.0 or 1.0). We found that that the scores that the lexicons provide exhibit very low correlation with the ground truth. In the case of EmoLex (Mohammad and Turney, 2013), this is because it merely provides binary labels, not intensities.
For DepecheMood (Staiano and Guerini, 2014), DepecheMood++ (Araque et al., 2018), and EmoWordNet (Badaro et al., 2018), we conjecture that the data-driven automated techniques used to create them based on coarse-grained documentlevel labels do not result in word-level scores of the same sort as those solicited from human raters. The additional labels in DepecheMood and De-pecheMood++ (e.g., Amused) may carry some information on some of the labels in our ground truth data (e.g., Joy). However, we find that mapping these emotions to a target emotion results in less accurate emotion association scores. Thus, we disregarded any labels not in the ground truth.
Unsupervised method. We next evaluate various pretrained word vector models, including word2vec trained on Google News (Mikolov et al., 2013), Glove 840B CommonCrawl (Pennington et al., 2014), and fastText trained on Wikipedia (Joulin et al., 2017). We find that these outperform the emotion lexicons by a substantial margin. They also outperform BERT, for which we use the pretrained BERT-base uncased model (Devlin et al., 2019) and consider mean-pooled word-piece final layer output embeddings as word-level vectors.
Even higher correlation can be attained with additional post-processing. One example are the counter-fitted vectors (Mrksic et al., 2016), ob-tained by taking the PARAGRAM-SL999 vectors by Wieting et al. (2015) and optimizing them using synonymy and antonymy constraints.
The best results are obtained using AffectVec (Raji and de Melo, 2020), which post-processes the same vectors using not only synonymy and antonymy constraints, but also sentiment polarity ones. Sentiment polarity evidently helps to better distinguish different emotional associations.
In contrast, the emotion-enriched word vectors (EWE) by Agrawal et al. (2018) do not perform well for word intensity prediction.
Supervised method. For supervised methods, we report the mean correlation over 20 runs of the learning algorithm. The supervised models were chosen based on the proposed methods in the WASSA-2017 challenge (Mohammad and Bravo-Marquez, 2017), with minor simplifications to avoid overfitting, given the small size of our training set.
As expected, the models succeed in learning emotional associations from word vectors with a higher correlation than unsupervised prediction. As this technique does not rely on cosine similarities, post-processing proves fruitless, and very strong results are obtained using GloVe and fastText embeddings.
Self-supervised method. Unsupervised prediction is less accurate than supervised prediction, but is applicable to arbitrary emotions without any need for pre-existing emotion-specific training data. The self-supervised approach has the same advantage of not requiring training data and can thus as well be applied to arbitrary emotions.
To create the training labels for the selfsupervised model, we select the top 100 (k = 100) highest and lowest intensity predictions from our unsupervised model. This results in 200 automatically induced training instances per emotion. For BERT, the vocabulary of fastText serves as the candidate set of words to calculate the top k predictions.
As expected, without access to gold standard training data, self-supervision is unable to compete with the supervised approach. However, for a given set of word vectors, especially for BERT, self-supervised learning mostly surpasses the unsupervised approach despite drawing on it for training, owing to its ability to selectively pick out the most pertinent cues from the vectors.

Unsupervised Sentence Classification
Experimental Setup. We additionally explore unsupervised sentence-level emotion classification on the GoEmotions dataset (Demszky et al., 2020) with its fine-grained inventory 28 different emotion labels. Given input document D, we choose the label arg max e∈Σ w∈D γ w,D σ u (e, w), where Σ is the set of labels, γ w,D denotes the TF-IDF score of w in D, and σ u (w, e) are wordemotion scores. For the latter, we use either our unsupervised word scoring (Eq. 1) or a baseline lexicon, where any emotion not covered by a resource is assumed to have intensity scores of 0.0. An exception is made for the neutral label, which we choose if the average prediction score across labels in Σ is ≤ 0.03, based on the reasoning that, similar to the annotation instructions in the original paper, low scores or conflicting ones should be labeled neutral.
Results. The evaluation results in Table 2 show that existing emotion lexicons are outperformed even by the random baseline of choosing a label arbitrarily. This is because they do not cover enough emotions to support this task-the GoEmotions dataset includes many fine-grained emotion labels such as remorse, gratitude, and caring. Emotion lexicons are often used for unsupervised analysis, but evidently can only work well when the target emotion inventory matches that of the lexicon. In contrast, with several vector models, our unsupervised approach is able to greatly outperform all baselines, achieving substantially better results than chance. For reference, we also show the results of the fully supervised BERT model from Demszky et al. (2020), fine-tuned on the GoEmotions training data, while our vector-based models are entirely unsupervised. Note that the task is fairly challenging, since many of the 28 emotions are easy to confuse, e.g., fear, nervousness, embarrassment, disappointment, disapproval, etc.

Conclusion
In this paper, we show that pretrained linguistic models readily provide higher-quality emotional intensity information than state-of-the-art emotion lexicons, despite not being trained on emotionlabeled data, particularly if the representation space  is post-processed. We find that a regular supervised variant obtains very high correlations, while a self-supervised variant that does not require gold standard training data is able to outperform the unsupervised method and can as well be applied to arbitrary emotion labels. Overall, our results confirm that linguistic models have remained underexplored for word-level emotion intensity assessment. Our training/validation/test splits and newly induced lexicons created using our method are available from http://emotion.nlproc.org/ to promote further research in this area.

Broader Impact
Assessing the emotions evoked by a text has important applications, such as discovering disappointed customers on social media, designing conversational agents that emulate human empathy by responding in a more appropriate way, as well as numerous forms of digital humanities analyses. However, there are also grave concerns and risks when using automated techniques to assess the emotional impact of text. Clearly, consent to assess the text must have been granted and malicious as well as potentially harmful applications must be avoided. Beyond this, even for applications deemed beneficial, there is a risk that inaccurate assessments may lead to undesirable outcomes. The empirical results in this paper show that many commonly used resources and techniques have a low correlation with human ratings and even the best BERTbased learning approaches are highly imperfect, as they may latch on to superficial lexical cues and mere correlations that may exhibit substantial biases. This is discussed further by Mohammad (2020). As a result, prior to any potential use of such techniques, a thorough analysis of potential application-specific risks must be conducted.