Multi-Task Learning and Adapted Knowledge Models for Emotion-Cause Extraction

Detecting what emotions are expressed in text is a well-studied problem in natural language processing. However, research on finer grained emotion analysis such as what causes an emotion is still in its infancy. We present solutions that tackle both emotion recognition and emotion cause detection in a joint fashion. Considering that common-sense knowledge plays an important role in understanding implicitly expressed emotions and the reasons for those emotions, we propose novel methods that combine common-sense knowledge via adapted knowledge models with multi-task learning to perform joint emotion classification and emotion cause tagging. We show performance improvement on both tasks when including common-sense reasoning and a multitask framework. We provide a thorough analysis to gain insights into model performance.


Introduction
Utterance and document level emotion recognition has received significant attention from the research community (Mohammad et al., 2018;Poria et al., 2020a). Given the utterance Sudan protests: Outrage as troops open fire on protestors an emotion recognition system will be able to detect that anger is the main expressed emotion, signaled by the word "outrage". However, the semantic information associated with expressions of emotion, such as the cause (the thing that triggers the emotion) or the target (the thing toward which the emotion is directed), is important to provide a finer-grained understanding of the text that might be needed in real-world applications. In the above utterance, the cause of the anger emotion is the event "troops open fire on protestors", while the target is the entity "troops" (see Figure 1). * Work done during an internship with Amazon AI.
Research on finer-grained emotion analysis such as detecting the cause for an emotion expressed in text is in its infancy. Most work on emotion-cause detection has utilized a Chinese dataset where the cause is always syntactically realized as a clause and thus was modeled as a classification task (Gui et al., 2016). However, recently Bostan et al. (2020) and Oberländer and  argued that in English, an emotion cause can be expressed syntactically as a clause (as troops open fire on protestors), noun phrase (1,000 non-perishable food donations) or verb phrase (jumped into an ice-cold river), and thus we follow their approach of framing emotion cause detection as a sequence tagging task.
We propose several ways in which to approach the tasks of emotion recognition and emotion cause tagging. First, these two tasks should not be independent; because the cause is the trigger for the emotion, knowledge about what the cause is should narrow down what emotion may be expressed, and vice versa. Therefore, we present a multi-task learning framework to model them jointly. Second, considering that common-sense knowledge plays an important role in understanding implicitly expressed emotions and the reasons for those emotions, we explore the use of commonsense knowledge via adapted knowledge models (COMET, Bosselut et al. (2019)) for both tasks. A key feature of our approach is to combine these adapted knowledge models (i.e., COMET), which are specifically trained to use and express commonsense knowledge, with pre-trained language models such as BERT, (Devlin et al., 2019).
Our primary contributions are three-fold: (i) an under-studied formulation of the emotion cause detection problem as a sequence tagging problem; (ii) a set of models that perform the emotion classification and emotion cause tagging tasks jointly while using common-sense knowledge (subsection 4.2) with improved performance (section 6); and (iii) analysis to gain insight into both model performance and the GoodNewsEveryone dataset that we use (Bostan et al., 2020) (section 7).
However, comparatively few researchers have looked at the semantic roles related to emotion such as the cause, the target or the experiencer, with few exceptions for Chinese (Gui et al., 2016;Chen et al., 2018;Wei et al., 2020;Ding et al., 2020), English Ghazi et al., 2015;Kim and Klinger, 2018;Bostan et al., 2020; and Italian (Russo et al., 2011). We highlight some of these works here and draw connection to our work. Most recent work on emotion-cause detection has been carried out on a Chinese dataset compiled by Gui et al. (2016). This dataset characterizes the emotion and cause detection problems as clause-level pair extraction problem -i.e., of all the clauses in the input, one is selected to contain the expression of an emotion, and one or more (usually one) are selected to contain the cause of that emotion. Many publications have used this corpus to develop novel and effective model architectures for the clause-level classification problem (Chen et al., 2018;Wei et al., 2020;Ding et al., 2020). The key difference between this work and ours is that we perform cause detection as a sequence-tagging problem: the cause may appear anywhere in the input, and may be expressed as any grammatical construction (a noun phrase, a verb phrase, or a Figure 1: An example of the semantic roles annotated by Bostan et al. (2020) clause). Moreover, we use common sense knowledge for both tasks (emotion and cause tagging), through the use of adapted language models such as COMET.
For English, several datasets have been introduced Kim and Klinger, 2018;Ghazi et al., 2015;Bostan et al., 2020;Poria et al., 2020b), and emotion cause detection has been tackled either as a classification problem , or as a sequence tagging or span detection problem (Kim and Klinger, 2018;Ghazi et al., 2015;Poria et al., 2020b). We particularly note the work of Oberländer and Klinger (2020), who argue for our problem formulation of cause detection as sequence tagging rather than as a classification task supported by empirical evidence on several datasets including the GoodNewsEveryone dataset (Bostan et al., 2020) we use in this paper. One contribution we bring compared to these models is that we formulate a multi-task learning framework to jointly learn the emotion and the cause span. Another contribution is the use of common-sense knowledge through the use of adapted knowledge models such as COMET (both in the single models and the multi-task models). Ghosal et al. (2020) have very recently shown the usefulness of common-sense reasoning to the task of conversational emotion detection.

Data
For our experiments, we use the GoodNewsEveryone corpus (Bostan et al., 2020), which contains 5,000 news headlines labeled with emotions and semantic roles such as the target, experiencer, and cause of the emotion, as shown in Figure 1. 1 We focus on the emotion detection and cause tagging tasks in this work. To our knowledge, GoodNew-sEveryone is the largest English dataset labeled for both of these tasks.
In our experiments, we limit ourselves to the data points for which a cause span was annotated (4,798). We also note that this dataset uses a 15way emotion classification scheme, an extended set including the eight basic Plutchik emotions as well as additional emotions like shame and optimism. While a more fine-grained label set is useful for capturing subtle nuances of emotion, many external resources focus on a smaller set of emotions. We also note that the label distribution of this dataset heavily favors the more basic emotions, as shown in Figure 2. Therefore, for our work, we choose to limit ourselves to the six Ekman emotions (anger, fear, disgust, joy, surprise, and sadness). We also choose to keep positive surprise and negative surprise separated, to avoid severely unbalancing the label distribution for our experiments. We randomly split the remaining data (2,503 data points) into 80% train, 10% development, and 10% test.

Models
An important feature showcased by the Good-NewsEveryone dataset is that causes of emotions can be expressed through different syntactic constituents such as clauses, verb phrases, or nounphrases. Thus, we approach the cause detection problem as a sequence tagging problem using the IOB scheme (Ramshaw and Marcus, 1995): C = {I-cause, O, B-cause}. Our approach is supported by very recent results by Oberländer and  and  who show that modeling emotion cause detection as a sequence tagging problem is better suited than a clause classification problem, although not much current work has yet adopted this formulation. We tackle the emotion detection task as a seven-way classification task with E = {anger, disgust, fear, joy, sadness, negative surprise, positive surprise}.

Single-Task Models
As a baseline, we train single-task models for each of emotion classification and cause span tagging. We use a pre-trained BERT language model 2 (Devlin et al., 2019), which we fine-tune on our data, as the basis of this model. Our preprocessing strategy for all of our models consists of the pretrained BERT vocabulary and Word-Piece tokenizer 3  from Huggingface (Wolf et al., 2020). Therefore, for a sequence of n WordPiece tokens, our input to the BERT model is a sequence of n + 2 tokens, For emotion classification, we pool these hidden states and allow hyperparameter tuning to select the best type: selecting the [CLS] token , or attention as formulated by Bahdanau et al. (2015): where α i = exp (Wah i +ba) n j=1 exp (Wah j +ba) for trainable weights W a ∈ R 1×d BERT and b a ∈ R 1 . Then, the final distribution of emotion scores is calculated by a single dense layer and a softmax: with e ∈ R |E| and for trainable parameters W e ∈ R |E|×d BERT and b e ∈ R |E| . For cause tagging, a tag probability distribution is calculated directly on each hidden state: with c i ∈ R |C| and for trainable parameters W c ∈ R |C|×d BERT and b c ∈ R |C| . We refer to both of these single-task models as BERT; if the task is not clear from the context, we will refer to the emotion detection model as BERT E and the cause tagging model as BERT C . Our training loss for emotion classification as well as emotion cause tagging is the mean negative log-likelihood (NLL) loss per minibatch of size b: where j is the index of the sentence in the minibatch, k is the index of the label being considered (emotion labels for NLL emo and IOB tags for NLL cause ), i is the index of the i th token in the j th sentence in the minibatch, y jk ∈ {0, 1} is the gold probability of the k th emotion label for the j th sentence, y ijk ∈ {0, 1} is the gold probability of the k th cause tag for the i th token in the j th sentence, and e jk and c ijk are the output probabilities of the k th emotion label and of the k th cause label for the i th token, both for the j th sentence.

Multi-Task Models
Our hypothesis is that the emotion detection and cause tagging tasks are closely related and can inform each other; therefore we propose three multitask learning models to test this hypothesis. For all multi-task models, we use the same base architecture (BERT) as the single models. Additionally, for these models, we combine the losses of both tasks and weight them with a tunable lambda parameter:

using NLL emo and NLL cause from Equation 4 and Equation 5.
Multi. The first model, Multi, is the classical multi-task learning framework with hard parameter sharing, where both tasks share the same BERT layers. Two dense layers for emotion classification and cause tagging operate at the same time from the same BERT layers, and we train both of the tasks simultaneously. That is, we simply calculate our emotion scores e and cause tag scores c from the same set of hidden states H.
We further develop two additional multi-task models with the intuition that we can design more explicit and concrete task dependencies than simple parameter sharing in the representation layer.
Multi C E . We assume that if a certain text span is given as the cause of an emotion, it should be possible to classify that emotion correctly while looking only at the words of the cause span. Therefore, we propose the Multi C E model, the architecture of which is illustrated in Figure 3a. This model begins with the single-task cause detection model, BERT C , which produces a probability distribution P (y i |x i ) over IOB tags for each token x i , where P (y i |x i ) = c i from Equation 3. Then, for each token, we calculate the probability that it is part of the cause as P (Cause|x i ) = P (B|x i )+P (I|x i ) = 1 − P (O|x i ). We feed the resulting probabilities through a softmax over the sequence and use them as an attention distribution over the input tokens in order to pool the hidden representations and perform emotion classification: attention is computed as in Equation 1, where α i = exp P (Cause|x i ) n j=1 exp P (Cause|x i ) , and emotion classification as in Equation 2. For the Multi C E model, we apply teacher forcing at training time, and the gold cause spans are used to create the attention weights before emotion classification (which means that P (Cause|x i ) ∈ {0, 1}). At inference time, the model uses the predicted cause span instead.
Multi E C . Next, we hypothesize that knowledge of the predicted emotion should help us identify salient cause words. The Multi E C model first performs emotion classification, which results in a probability distribution over predicted emotion labels, as in the BERT E model and Equation 2. We additionally keep an emotion embedding matrix E, where E[i] is a learnable representation of the i-th emotion label (see Figure 3b) with dimension d e (in our experiments, we set d e = 300). We use the predicted label probabilities e to calculate a weighted sum of the emotion embeddings, i.e., M = i e i · E[i]. We then concatenate M to the hidden representation of each token and perform emotion cause tagging with a final dense layer, i.e., where ; is the concatenation operator and W c ∈ R |C|×(d BERT +de) and b c ∈ R |C| are trainable parameters. In the Multi E C model, we again do teacher forcing and use the gold emotion labels before doing the sequence tagging for cause detection (i.e., e is a onehot vector where the gold emotion label has probability 1 and all other emotion labels have probability 0). At inference time, the model will use the predicted emotion distribution instead.

Adapted Knowledge Models
Recent work has shown that fine-tuning pre-trained language models such as GPT-2 on knowledge graph tuples such as ConceptNet (Li et al., 2016) or ATOMIC (Sap et al., 2018) allows these models to express their implicit knowledge directly (Bosselut et al., 2019). These adapted knowledge models (e.g., COMET (Bosselut et al., 2019)) can produce common-sense knowledge on-demand for any entity, relation or event.
Considering that common-sense knowledge plays an important role in understanding implicitly expressed emotions and the reasons for those emotions, we explore the use of common-sense knowledge for our tasks, in particular the use of COMET adaptively pre-trained on the ATOMIC event-centric knowledge base. ATOMIC's event relations include "xReact" and "oReact", which describe the feelings of certain entities after the input event occurs. For example, ATOMIC's authors present the example of <PersonX pays PersonY a compliment, xReact, PersonX will feel good>. xReact refers to the feelings of the primary entity in the event, and oReact refers to the feelings of others (in this instance, oReact yields "PersonY will feel flattered"). For example, using the headline "Sudan protests: Outrage as troops open fire on protestors", COMET-ATOMIC outputs that PersonX feels justified, PersonX feels angry, Others feel angry, and so on (Figure 4). To use this knowledge model in our task, we modify our approach by reframing our single-sequence classification task as a sequence-pair classification task (for which BERT can be used directly). We feed our input headlines into COMET-ATOMIC (using the model weights released by the authors), collect the top two outputs for xReact and oReact using beam search decoding, and then feed them into BERT alongside the input headlines, as a second sequence using the SEP token. That is, our input to BERT is now X = [[CLS], x 1 , x 2 , ..., x n , [SEP], z 1 , z 2 , ..., z m , [SEP]], where z i are the m WordPiece tokens of our COMET output and are preprocessed in the same way as x i . We hypothesize that, since pre-trained BERT is trained with a next sentence prediction objective, expressing the COMET outputs as a grammatical sentence will help BERT make better use of them, so we formulate this second sequence as complete sentences (e.g., "This person feels... Others feel...") ( Figure 4).
This approach allows us incorporate information from COMET into all our single-and multitask BERT-based models; the example shown in Figure 4 is our Multi C E model. We refer to the COMET variants of these mod- for the three multi-task models.

Experimental Setup
Evaluation Metrics For emotion classification, we report macro-averaged F1 and accuracy. For cause tagging, we report exact span-level F1 (which we refer to as span F1), as developed for named entity recognition (e.g., Tjong Kim Sang and De Meulder (2003)), where a span is marked as correct if and only if its type and span boundaries match the gold exactly 4 .
Training and Hyperparameter Selection The classification layers are initialized randomly from a uniform distribution over [−0.07, 0.07] 5 , and all the parameters are trained on our dataset for up to 20 epochs, with early stopping based on the performance on the validation data (macro F1 for emotion, span F1 for cause). All models are trained with the Adam optimizer (Kingma and Ba, 2015). We highlight again that for our Multi C E and Multi E C models, we use teacher forced during training to avoid cascading training error. Because the subset of the data we use is relatively small, we follow current best practices for dealing with neural models on small data and select hyperparameters and models using the average performance of five models with different fixed random seeds on the development set. We then base our models' per-4 Our cause tagging task has only one type, "cause", as GoodNewsEveryone is aggregated such that each data point has exactly one emotion-cause pair. We note that this problem formulation leaves open the possibility of multiple emotioncause pairs. 5 The default initialization from the gluon package: https://mxnet.apache.org/versions/1.7. 0/api/python/docs/api/gluon/index.html formance on the average of the results from these five runs (e.g., reported emotion F1 is the average of the emotion F1 scores for each of our five runs). For our joint models, since our novel models revolve around using one task as input for the other, we separately tune two sets of hyperparameters for each model, one based on each of the single-task metrics, yielding, for example, one Multi model optimized for predicting emotion and one optimized for predicting cause. The hyperparameters we tune are dropout in our linear layers, initial learning rate of the optimizer, COMET relation type, lambda weight for our multi-task models, and the type of pooler for emotion classification (enumerated in subsection 4.1).

Results
We present the results of our models in Table 1 6 . We see that the overall best model for each task is a multi-task adapted knowledge model, with Multi COM ET C E performing best for emotion (which is a statistically significant improvement over BERT by the paired t-test, p < 0.05) and Multi COM ET performing best for cause. These results seem to support our two hypotheses: 1) emotion recognition and emotion cause detection can inform each other and 2) common-sense knowledge is helpful to infer the emotion and the cause for that emotion expressed in text. Specifically, we notice that Multi C E alone does not outperform BERT on either cause or emotion, but Multi COM ET C E outperforms both BERT and Multi C E on both tasks. For cause, we also see additional benefits of common-6 Oberländer and Klinger (2020) report an F1 score of 34 in this problem setting on this dataset, but on a larger subset of the data (as they do not limit themselves to the Ekman emotions) and so we cannot directly compare our work to theirs. sense reasoning alone: BERT COM ET outperforms BERT (multi-task modeling alone, Multi, also outperforms BERT for this task) and Multi COM ET outperforms Multi. These results speak to the differences between the two tasks, suggesting that common-sense reasoning, which aims to generates implicit emotions, and cause information may be complementary for emotion detection, but that for cause tagging, common-sense reasoning and given emotion information may overlap. The commonsense reasoning we have used in this task (xReact and oReact from ATOMIC) is expressed as possible emotional reactions to an input situation, so this makes intuitive sense.
Finally, we also present per-emotion results for our best model for each task (Multi COM ET C E for emotion and Multi COM ET for cause) against the single-task BERT baselines in Figure 5 and Figure 6; these per-emotion scores are again the average performance of models trained with each of our five random seeds. We see that each task improves on a different set of emotions: for emotion classification Multi COM ET C E consistently improves over BERT by a significant margin on joy and to a lesser extent on anger and sadness. Meanwhile, for cause tagging, Multi COM ET improves over BERT on anger, disgust, and fear, while yielding very similar performance on the rest of the emotions.

Analysis and Discussion
In order to understand the impact of common-sense reasoning and multi-task modeling for the two tasks, we provide several types of analysis in addition to our results in section 6. First, we include examples of our various models' outputs showcasing the impact of our methods (subsection 7.1). Second, we carry out an analysis of the dataset, focusing on the impact of label variation among multiple annotators on the models' performance (subsection 7.2).

Example Outputs
We provide some example outputs from our systems for both cause and emotion in Table 2; the various Multi models have been grouped together for readability and because they often produce similar outputs, but the outputs for every model are available in the appendix. In the first example, the addition of COMET to BERT informs the model enough to choose the gold emotion label; in the third and fourth, either COMET or multitask learning is enough to help the model select key words that should be included in the cause (return and triple shooting). We also particularly note the second example, in which multi-task learning is needed both for the BERT and BERT COM ET models to be able to correctly predict the gold emotion. This suggests that for cause, both commonsense reasoning and emotion classification may carry overlapping useful information for cause tagging, while for emotion, different instances may be helped more by different aspects of our models.

Label Agreement
Upon inspection of the GoodNewsEveryone data, we discover significant variation in the emotion labels produced by annotators as cautioned by the authors in their original publication 7 . From our inspection of the development data, we see recur-

Multitask
BERT COM ET Multitask COM ET Mexico reels from shooting attack in El Paso fear negative surprise negative surprise fear fear Insane video shows Viking Sky cruise ship thrown into chaos at sea fear negative surprise fear negative surprise fear Durant could return for Game 3 positive surprise for game could return for game Dan Fagan: Triple shooting near New Orleans School yet another sign of city's crime problem negative surprise school yet another sign of city's crime : triple shooting near new orleans school yet another sign of city's crime Table 2: Example outputs from our systems. For each example, the gold cause is highlighted in yellow and the gold emotion is given under the text; the first two examples give our models' emotion outputs; the latter two, their causes. Joined cells show that multiple models produced the same output. To make this table easier to read, "Multitask" here may refer to Multi, Multi E C , or Multi C E (details on selection and results for each individual model available in appendix; most multi-task models gave similar outputs).  ring cases where different annotators give directly opposing labels for the same input, depending on how they interpret the headline and whose emotions they choose to focus on. For example, our development set includes the following example:

Metric BERT BERT
Simona Stuns Serena at Wimbledon: Game, Set and "Best Match" for Halep. The gold adjudicated emotion label for this example is negative surprise, but annotators actually included multiple primary and secondary emotion labels including joy, negative surprise, positive surprise, pride, and shame, which can be understood as various emotions felt by the two entities participant in the event (Simona Halep and Serena Williams). For this input, COMET suggests xReact may be happy or proud and oReact may be happy -these reactions are likely most appropriate for tennis player Simona Halep, but not the only possible emotion that can be inferred from the headline. Inspired by the variation in the data, we compute also models' accuracy using the human annotations that did not agree with the gold (i.e., a predicted emotion label is correct if it was suggested by a human annotator but was not part of a majority vote to be included in the gold). We denote this ¬Gold, and we compare the performance of our models with respect to Gold and ¬Gold. We present the results of this analysis in Table 3 8 . In this table, a higher ¬Gold accuracy means that the model is more likely to produce emotion labels that were not the gold but were suggested by some annotator. First of all, we note that all models have a relatively high ¬Gold accuracy (about half the magnitude of their gold accuracy); we believe this reflects the wide variety of annotations given by the annotators. We see a tradeoff between the Gold and ¬Gold accuracy, and we note that generally the single-task models have higher ¬Gold accuracy and the COMET-enhanced multi-task models have higher Gold accuracy. This suggests that our language models have general knowledge about emotion already, but that applying common-sense knowledge helps pare down the space of plausible outputs to those that are most commonly selected by human annotators. Recall that this dataset was annotated by taking the most frequent of the annotator-provided emotion labels. Further, since the multi-task models have higher Gold accuracy and lower ¬Gold accuracy than the single-task models, this suggests that also predicting the cause of an emotion causes the model to narrow down the space of possible emotion labels to only those that are most common.

Conclusions and Future Work
We present a common-sense knowledge-enhanced multi-task framework for joint emotion detection and emotion cause tagging. Our inclusion of common-sense reasoning through COMET, combined with multi-task learning, yields performance gains on both tasks including significant gains on emotion classification. We highlight the fact that this work frames the cause extraction task as a span tagging task, allowing for the future possibility of including multiple emotion-cause pairs per input or multiple causes per emotion and allowing the cause to take on any grammatical role. Finally, we present an analysis of our dataset and models, showing that labeling emotion and its semantic roles is a hard task with annotator variability, but that commonsense knowledge helps language models focus on the most prominent emotions according to human annotators. In future work, we hope to explore ways to integrate common-sense knowledge more innately into our classifiers and ways to apply these models to other fine-grained emotion tasks such as detecting the experiencer or the target of an emotion.

Ethical Considerations
Our intended use for this work is as a tool to help understand emotions expressed in text. We propose that it may be useful for things like product reviews (where producers and consumers can rapidly assess reviews for aspects of their products to improve or expand), disaster relief (where those in need of help from any type of disaster can benefit if relief agents can understand what events are causing negative emotions, during and after the initial disaster), and policymaking (where constituents can benefit if policymakers can see real data about what policies are helpful or not and act in their interests). These applications do depend on the intentions of the user, and a malicious actor may certainly misuse the ability to (accurately or inaccurately) detect emotions and their causes. We do not feel it responsible to publicly list the ways in which this may happen in this paper. We also believe that regulators and operators of this technology should be aware that it is still in its nascent stages and does not represent an infallible oracle -the predictions of this and any model should be reviewed by humans in the loop, and we feel that general public awareness of the limitations and mistakes of these models may help mitigate any possible harm. If these models are inaccurate, they will output either the incorrect emotion or the incorrect cause; blindly trusting the model's predictions without examining them may lead to unfair consequences in any of the above applications (e.g., failure to help someone whose text is misclassified as positive surprise during a natural disaster or a worsened product or policy if causes are incorrectly predicted). We additionally note that in its current form, this work is intended to detect the emotions that are expressed in text (headlines), and not those of the reader.
We concede that the data used in this work consists of news headlines and may not be the most adaptable to the use cases we describe above; we caution that models trained on these data will likely require domain adaptation to perform well in other settings. Bostan et al. (2020) report that their data comes from the Media Bias Chart 9 , which reports that their news sources contain a mix of political views, rated by annotators who also self-reported a mix of political views. We note that these data are all United States-based and in English. Bostan et al.
(2020) do sub-select the news articles according to impact on Twitter and Reddit, which have their own user-base biases 10 , typically towards young, white American men; therefore, the data is more likely to be relevant to these demographics. The language used in headlines will likely most resemble Standard American English as well, and therefore our models will be difficult to use directly on other dialects and vernaculars.        : triple shooting near new orleans school yet another sign of city's crime Table 11: Full model outputs for our fourth provided example.