Improving Empathetic Response Generation by Recognizing Emotion Cause in Conversations

Current approaches to empathetic response generation focus on learning a model to predict an emotion label and generate a response based on this label, and have achieved promis-ing results. However, the emotion cause, an essential factor for empathetic responding, is ignored. The emotion cause is a stimulus for human emotions. Recognizing the emotion cause is helpful to better understand human emotions to generate more empathetic responses. To this end, we propose a novel framework that improves empathetic response generation by recognizing emotion cause in conversations. Speciﬁcally, an emotion reasoner is designed to predict a context emotion label and a sequence of emotion cause-oriented labels, which indicate whether the word is related to the emotion cause. Then we devise both hard and soft gated attention mech-anisms to incorporate the emotion cause into response generation. Experiments show that incorporating emotion cause information improves the performance of the model on both emotion recognition and response generation.


Introduction
In recent years, open-domain dialogue systems are becoming increasingly ubiquitous and have been extensively leveraged for mental healthcare and entertainment (Oh et al., 2017;Zhou et al., 2020;Sharma et al., 2020). In part, this progress is driven by advances in neural response generation models (Vinyals and Le, 2015;Li et al., 2016a,c;Gao et al., 2019a,b) which have shown success in generating fluent and relevant responses, given a wide variety of user inputs. However, people can still feel a clear gap between humans and machines when conversing with them. One of the primary reasons is that existing dialogue systems lack emotion understanding and empathy (Rashkin et al., 2019). Empathetic responding is a desirable communicative * Equal Contribution † Corresponding author  skill that can make more natural communication in daily conversations (Callender, 2015). Table 1 shows an example of empathetic responding from empathetic-dialogues dataset (Rashkin et al., 2019).
A speaker is talking about a situation that happened to him/her related to a lonely feeling and a listener needs to respond with an appropriate emotion. Therefore, empathy is important in conversations. However, endowing dialogue systems with the capability of emotion understanding and empathetic responding is challenging.
Most of the existing approaches improve empathetic response generation from two directions. The first usually promotes the model's emotion understanding (Lubis et al., 2018;Rashkin et al., 2019;Lin et al., 2019;Li et al., 2020b). In this line of work, models are often trained to predict an emotion state of the speaker and generate a response based on the emotion state. The second focuses on improving response generation strategy (Welivita and Pu, 2020;Shin et al., 2020;Majumder et al., 2020). For example, Shin et al. (2020) proposes to use the look-ahead of user emotion to model empathetic response generation and improve the empathetic responding model via Reinforcement Learning. Majumder et al. (2020) presents an ap-proach to mimic the emotion of the speaker while accounting for their affective polarity.
However, both kinds of existing methods only consider using the surface information of emotions such as emotion labels to improve the quality of generated responses. The emotion cause, an essential factor for empathetic responding, is ignored.
We argue that such surface information of emotions is not sufficient for empathetic responding. The model can better understand human emotions and respond empathetically if it has the ability to perform reasoning about emotions in conversations, which means it needs to identify the cause of a certain emotion. For example in Table 1, given the dialogue context, we need to recognize not only the emotion "lonely" of the speaker, but also the emotion cause behind the emotion. We can see that the speaker is lonely due to the event "... all friends live ... different country". Here, we could infer that the speaker's emotion is caused by the first utterance containing the aforementioned event. With such deep emotional information, we can generate more relevant and empathetic responses.
To this end, we propose a novel framework to improve empathetic response generation by endowing the empathetic dialogue model with the ability to reason about human emotions in conversations. Specifically, our model is able to identify the cause behind the emotions in addition to the types of emotions. Our framework involves two components, an emotion reasoner and a response generator. The emotion reasoner first performs a contextlevel emotion prediction and a word-level emotion cause detection, providing emotional information for response generation. The response generator then makes use of such deep emotional information to generate empathetic responses. To incorporate emotion cause information into the response generator, we devise a gated attention mechanism and explore both hard and soft gating strategies to allow the model to focus more on words related to the emotion cause. For model training, we use multitask learning to build the connection between the emotion reasoner and the response generator.
Our contributions can be summarized as follows: • An emotion reasoner is designed to recognize the context emotion of the speaker and the emotion cause behind the emotion, providing deep emotional information for response generation. To the best of our knowledge, this is the first work that investigates emotion cause in empathetic response generation.
• To incorporate emotion cause into response generation, we devise a gated attention mechanism and explore both hard and soft gating strategies, which allow the model to focus on emotion cause related words.
• Experimental results show that our proposed models benefit from the emotion cause and significantly outperform other compared methods, resulting in more empathetic responses.

Related Work
In recent years, neural approaches to open-domain dialogue systems have achieved great progress (Serban et al., 2016;Wolf et al., 2019;Zhang et al., 2020b;Zhou et al., 2020;Wang et al., 2021). Especially, incorporating personality and emotional features can make dialogue systems more human-like. For emotion-aware response generation, it aims at generating responses corresponding to specific emotions. Several methods are proposed to tackle this task (Zhou et al., 2018;Colombo et al., 2019;Song et al., 2019;Shen and Feng, 2020;Majumder et al., 2021). Empathetic response generation is a sub-task of emotion-aware response generation, Rashkin et al. (2019) first proposes a standard benchmark that contains large-scale empathetic conversations. Lin et al. (2020) adapts GPT2 (Radford et al., 2019) to generate empathetic responses via transfer learning and continues to improve its response quality via active learning and negative training. Welivita and Pu (2020) develops a taxonomy of empathetic listener intents by human judges to generate more controlled and interpretable responses. Shin et al. (2020) utilizes reinforcement learning to improve the empathetic responding model, in which the model is rewarded with an estimated user sentiment look-ahead. Lin et al. (2019) models empathy in conversations through Mixture of Experts  and gets final output based on emotion distribution. Majumder et al. (2020) argues that empathetic response generation can mimic the emotion of the speaker, and introduces the emotion stochastic sampling strategy during training. Li et al. (2020b) leverages multi-type knowledge to enrich the dialogue history so that the model can   Figure 1: Architecture of the proposed framework. Our framework contains two components: an emotion reasoner (a) and a response generator (b). The emotion reasoner is used to predict a context emotion label and locate words related to the emotion cause, based on the dialogue context. The response generator makes use of the emotional information obtained from the emotion reasoner to generate the response. Specifically, a gated attention mechanism is designed to incorporate emotion cause information into the response generator.
accurately perceive and respond to implicit emotions. Li et al. (2020a) exploits user feedback and multi-granularity emotion, and introduces an adversarial learning framework to capture the nuances of user emotion. Emotion cause extraction (ECE), aims at exploring the reason for emotion change and what causes a certain emotion. ;  first define it as a word-level and clauselevel task respectively. Gui et al. (2016) proposes the first open dataset for ECE, and it serves as a standard benchmark up till now. Xia and Ding (2019) reforms ECE into emotion-cause pair extraction task. Similar to ECE, Poria et al. (2020) first introduces the task of recognizing emotion cause in conversations.

Task Formulation
We formulate the task of empathetic response generation as follows. Given a dialogue context M = {U 1 , U 2 , · · · , U L } of L utterances and each utterance U i = {w i 1 , w i 2 · · · , w i K } consists of K tokens. Following the previous work (Lin et al., 2019;Shin et al., 2020), we concatenate the L utterances together as input. Specifically, we separate utterances by [SEP] tokens and insert a special token [CLS] at the start of the sequence to form an input sequence X = {x 0 , x 1 , · · · , x N } (See Figure 1 for example). Therefore, given an input sequence X, our goal is to generate an empathetic response Y = {y 0 , y 1 , · · · , y M } that is emotionally appro-priate and relevant to the dialogue context.

Approach
Our framework that explicitly considers the emotion cause for empathetic response generation is shown in Figure 1. Our framework contains two components: an emotion reasoner and a response generator. The emotion reasoner is used to predict a context emotion label and locate words related to the emotion cause, based on the dialogue context. The response generator is responsible for incorporating the information obtained from the emotion reasoner then generating the response. Below we first introduce how we construct training samples for emotion cause detection, then we describe the two components in detail.

Emotion Cause Annotation
Since we do not have readily available data with emotion cause information on the empathetic dialogue dataset, we leverage an existing emotion cause detection model (Poria et al., 2020) for identifying emotion causes at utterance level in conversations. The model is trained on an open-domain emotional dialogue dataset, namely RECCON (Poria et al., 2020). Given a dialogue context consisting of L utterances and a context emotion label, the goal of emotion cause detection model is to identify which utterance in the dialogue context contains the emotion cause. Note that an emotion may have multiple cause-correlated utterances.
To verify the transfer performance of the detection model on the empathetic dialogue dataset used in our work, we randomly selected 100 dialogue samples from the test set and asked 3 human annotators to assign a label ∈ {0, 1} to each utterance in the dialogue context, indicating whether it is a cause-correlated utterance. The final verdict on each sample is determined by majority voting. On these annotated samples, The emotion cause annotation model finally achieved an accuracy of 89%, indicating that the annotation model has a reliable performance.
In our work, we use an emotion reasoner to perform a word-level emotion cause detection. To achieve this, we automatically assign each word in the dialogue context with a binary label. If the word is in a causal utterance, we annotate it with 1, otherwise 0.

Emotion Reasoner
The emotion reasoner aims to recognize a context emotion given a dialogue context, as well as the cause behind the emotion. It can be decomposed into two tasks: context emotion prediction and emotion cause detection. Context Emotion Prediction: The context emotion prediction is a classification problem, aiming at predicting a context emotion label ε based on the dialogue context. Specifically, given an input sequence X, we first construct a representation for each word by summing the corresponding word and position embeddings. The word representations are then fed into a transformer encoder to obtain a sequence of contextualized word representation The context emotion distribution is finally computed based on the representation v x 0 of the first special token ([CLS]) as follows: where W e and b e are trainable parameters. Emotion Cause Detection: In our work, we perform a word-level emotion cause detection, which can provide word-level emotional features for response generation. We formulate the emotion cause detection as a sequence labeling problem, where each word in the sequence is labeled with an emotion cause-oriented label ∈ {0, 1}, indicating whether the word is related to the emotion cause. Note that the [CLS] token is always labeled with 1. The sequence of emotion cause-oriented labels will later be used as gating controllers to select the emotion cause-related words in the input sequence to attend to for the response generator. Formally, given an input sequence X = {x 0 , x 1 , · · · , x N }, the output of this task is a sequence of emotion cause-oriented labels C = {c 0 , c 1 , · · · , c N }. We compute the probability c i of the i-th word related to the emotion cause with a linear layer coupled with a softmax function: where W c and b c are trainable parameters. To jointly model context emotion prediction and emotion cause detection, the objective is formulated as: The parameters of the emotion reasoner can be learned by optimizing a negative log likelihood (NLL) loss defined as:

Response Generator
With the predicted context emotion ε and the emotion cause-oriented labels C obtained from the emotion reasoner, the response generator aims to generate an empathetic response Y = {y 1 , · · · , y M } that is emotionally appropriate and relevant to the dialogue context through maximizing the probability P(Y |X, ε, C). The basis for our response generator is a Transformer network, which consists of an encoder and a decoder. Next, we describe how we incorporate the emotional information including the context emotion ε and the emotion cause-oriented labels C into the response generator. Input Representation: To fuse the context emotion label ε into the response generator, we leverage trainable emotion embeddings E ε ∈ R nemo×d model to represent each context emotion label, where n emo = 32. Then each input word of the encoder and the decoder is represented as a sum of three embeddings: word embedding E w , positional embedding E p and emotion embedding E ε . We feed the representations of the input sequence X into the encoder to obtain contextualized word representations of the input sequence H = {h x 0 , h x 1 , · · · , h x N }, which provide context information for the decoder.
Applications of Attention In Transformer: As proposed by Vaswani et al. (2017), a multi-head attention function maps a query and a set of keyvalue pairs to an output, where the query, keys, values, and output are all vectors. The multi-head attention is typically used in two different ways: (1) both the encoder and the decoder contain "Self Attention" layers, where the quires, keys and values come from the output of the previous layer in the encoder/decoder; (2) in a "Cross Attention" layer, the queries come from the previous self attention layer, and the output H of the encoder are used as the keys and values. The "Cross Attention" layer is only used by the decoder. Gated Attention Mechanism: In our work, to leverage emotion cause information, we devise a "Gated Attention" layer on top of the cross attention layer in the decoder, where the queries come from the cross attention layer and the keys and values come from the output H of the encoder. The gated attention mechanism utilizes a sequence of gates G = {g 0 , g 1 , · · · , g N } to dynamically select elements related to the cause from input, then the decoder is forced to pay more attention to these selected elements, which give important information on the context emotion. We will later describe how we obtain the sequence of gates G. For a singlehead attention layer of the l-th block in the decoder, the gated attention weight a (l) i for i-th position can be computed by: where g i is the gate for i-th position, q l is the output of the l-th cross attention layer and h x i is the contextualized word representation at i-th position. The sequence of gates G is used to force the decoder to pay more attention to important words from the input. A straightforward way is to use a binary gate g i ∈ {0, 1} to decide whether the decoder should pay attention to the i-th word. For the position with g i = 1, the attention weights a (l) i are non-zeros. On the other hand, for the positions with g i = 0, we have a (l) i = 0. We refer to this gating strategy as "hard gating strategy". However, the hard gating strategy is rather rigid. If the model chooses the wrong words, then important information will be ignored. An alternative method is to use "soft gating strategy" where each gate g i is a continuous value ranging from 0 to 1, indicating how much information of the contextualized word representations at i-th position should be used. The soft gating strategy is more flexible compared with the hard gating strategy. In our work, we explore both soft and hard gating strategies.
Next, we introduce how we compute the sequence of gates G. In the soft gating strategy, the i-th gate g i in the G is defined by g i = P(c i = 1|v x i ), which is the probability that the i-th word being related to the emotion cause. The value for soft gating is continuous, ranging from 0 to 1.
In the hard gating strategy, the i-th gate g i ∈ {0, 1} is a binary label obtained by g i = c i . To overcome the problem of the inability for backpropagating, we resort to the Gumbel-Softmax trick (Jang et al., 2017). It is a procedure for sampling a categorical one-hot value from the Gumbel distribution, instead of direct sampling from a categorical distribution.
The final loss for the response generator is:

Model Training
Our proposed approach consists of two components: the emotion reasoner and the response generator. To better explore their interaction, we solve both tasks together by multi-task learning. The full-fledged loss of the two tasks is computed as: We pretrain the emotion reasoner using the objective as defined in Eq. 4 before joint training the two components.

Dataset
We use empathetic-dialogues (Rashkin et al., 2019) for experiments. The dataset comprises 24,850 open-domain multi-turn conversations between two participators. Specifically, each conversation is grounded by a situation description and a finegrained emotion. There are 32 emotion categories in total. We use the 8:1:1 train/valid/test subset split following the original dataset definitions.

Comparison Methods
The following models are selected as baselines:   according to emotion distributions. 2) MIME (Majumder et al., 2020): Another extension of transformer-based model which considers emotion clustering and emotional mimicry. Besides, it also introduces sampling stochasticity during training.
3) EmpDG (Li et al., 2020a): an adversarial model which applies two discriminators for interacting with the user feedback. It exploits both coarsegrained dialogue-level and fine-grained token-level emotions for generation. 4) MK-EDG (Li et al., 2020b): A contextual-enhanced empathetic dialogue generator that leverages multi-type external knowledge and emotional signal distilling for response generation. We explore our model using the hard gating strategy and the soft gating strategy, as introduced in Sec 4.3, denoted as Ours(Hard) and Ours(Soft). Detailed information about the implementations is covered in Appendix A.

Evaluation metrics
Automatic Evaluation: Four kinds of automatic metrics are applied for evaluation: 1) BLEU (Papineni et al., 2002) calculates the co-occurrence frequency of n-grams between candidates and references. Following MIME and MoEL, we use BLEU-4. 2) BERTscore (Zhang et al., 2020a) uses embeddings from pre-trained language models to compute a weighted cosine similarity of reference and the generated sentence. We use matching precision, recall and F1 score (R BERT , P BERT and F BERT ) in our experiments. 3) Dist-{1,2} (Li et al., 2016b) are diversity metrics aiming at measuring text diversity by calculating the proportion of different grams in the text. 4) To evaluate the model capabilities for emotion understanding, we adopt emotion classification accuracy (Accuracy) to further evaluate model performance. Human Ratings: Evaluating open-domain dialogue systems is challenging since the lack of reliable automatic evaluation metrics (Gao et al., 2021), thus human judgements are necessary. Following previous works, we randomly sample 100 dialogues and the corresponding generated responses for different models and then ask 5 professional annotators to give each response a rating score from Fluency aspect, Relevance aspect, and Empathy aspect. Each aspect is on a scale from 1 to 5, where 1, 3, and 5 indicate unacceptable, moderate, and excellent performance respectively. In order to keep the anonymization of compared methods, the response order in each sample is totally shuffled. Human A/B Test: Human A/B test is also conducted. We re-sample another 100 samples and form them into A-vs-B types, where A is our model and B is another baseline. Another 3 annotators are asked to choose the better response for each instance. They can also choose a Tie if both are good or bad. To make sure fairness, each group of A/B test uses a distinct dialogue context. 6 Experimental Results

Main results
Automatic Evaluation:   in the small text. As can be seen from the table, our proposed models Ours(Hard) and Ours(Soft) have a clear advantage over the baseline models in terms of all metrics except the P BERT . This demonstrates that our model generates more appropriate and informative responses by recognizing emotion cause in conversations. We also observe that the difference in performance between Ours(Soft) and Ours(Hard) is not significant, yet each has its own focus. Ours(Soft) outperforms Ours(Hard) on BLEU and BERTScores, while Ours(Hard) has better performance on Dist-1 and Dist-2 ratios. It seems that Ours(Soft) sacrifices diversity for relevance gains. Human Evaluation: Table 3 presents all the results in terms of human ratings of Fluency, Relevance, and Empathy. We observed in Table 3 that Ours(Soft) and Ours(Hard) significantly outperform most of the baselines in terms of all the three criteria, achieving best and second-best results respectively. This indicates that trying to recognize emotion cause in conversations is beneficial for im-proving emotional understanding and generating more empathetic responses. Besides, we can see that using soft gating mechanism achieves better performance than using the hard gating mechanism. This can be explained by the fact that the hard gating mechanism is rigid in controlling information, and there is a chance that important information will be ignored. The soft gating mechanism, on the other hand, has more flexible control over the information.
The results on A/B Test which is shown in Table 3 also confirms that the responses generated by Ours(Hard) and Ours(Soft) are preferred by humans compared to the results of other baseline models, and the results of Ours(Soft) are slightly better than those of Ours(Hard).

Ablation Analysis
In order to assess contributions toward the final performances, we perform a series of ablation studies for Ours(Hard) and Ours(Soft). Effect of Emotion Label: To assess the contribution of the predicted emotion label which is incorporated into the response generator, the emotion embedding is removed from the input of the encoder and decoder in the response generator. As shown in Table 4, we can see that for both models, removing the emotion embedding causes performance degradation and Ours(Soft) suffers from more pronounced performance degradation in terms of BLEU and BERTScores. This indicates that the information provided by the emotion label has importance in improving response quality. Effect of Multi-task Learning: Multi-task learning is used to build the connection between the emotion reasoner and the response generator. As shown Emotion: Excited Context: Speaker: i am so happy it 's a boy Listener: you just had a baby ? ! i did n't even know you were expecting ! Speaker: he 'll be here in about 3 months EmpDG:that is a big deal with . MoEL:that is really nice of you ! MIME: i am sure you were able to get the kids . MK-EDG: i bet he was so excited Ours(Hard):i am happy for you. he must be a good boy! Ours(Soft): that's so nice . i bet you are going to be a great parent .
Emotion: Impressed Context: Speaker: my friend purchased a new car . the ride is sweet ! Listener: nice what kind of car is it they purchased Speaker: they ended up getting a honda prelude . EmpDG:wow that is so exciting! MoEL:oh wow! that is amazing! MIME: i hope you get your friends . MK-EDG: wow that is a lot of money Ours(Hard): that's great. i hope you will have one soon! Ours(Soft): wow, that's cool. i hope they get a great vehicle. Table 6: Example responses generated by different models, as well as a visualization of the emotion cause detection in our models. Words highlighted in blue are the results of emotion cause detection for Ours(Hard), while words highlighted in purple are the results of emotion cause detection for Ours(Soft). Darker color indicates the higher probability that the word being related to the emotion cause.
in Table 4, the two models trained with multi-task learning achieve better performance in response generation, compared with the two models without using multi-task learning. At the same time, we can find that multi-task training is not very helpful for emotion recognition, and the models only get a small improvement. Effect of Emotion Cause: To investigate the impact of emotion cause on emotion recognition and empathetic response generation, We remove the emotion-cause related part in the emotion reasoner and the response generator at the same time. The emotion reasoner only performs the emotion recognition task and we remove the gated attention mechanism which is used to incorporate emotion cause information from the response generator. Looking at Table 4, we can clearly see that removing the emotion cause part causes a significant decrease in the performance of both models in terms of response generation and emotion recognition. In particular, the accuracy of emotion recognition drops from 42.4% to 38.5%. This indicates that emotion cause plays an important role in promoting the understanding of emotions, confirming our insights about the emotion cause. The gated attention mechanism can be seen as a denoising technique that allows the model to acquire important information relatively easily. Emotion Cause vs. Emotion Lexicon: Emotion lexicon also plays an important role in sentiment analysis and empathetic response generation (Li et al., 2020b). To further demonstrate the superiority of the emotion cause, we compare the importance of the emotion cause versus the emotion lexicon. Similarly, we assign a label to each word in the input sequence using NRC-VAD, indicating whether the word is an emotion lexicon. The emotion reasoner performs both emotion recognition and emotion lexicon detection, and the information is then used for response generation. The results shown in the Table 4 indicate that the information provided by emotion cause is more useful for helping the model understand emotions and dialogue context than the surface information of emotions such as emotion classes.

Case Study
We also present some example responses generated by our models and baseline models in Table 6. As shown in the first example, Ours(Hard) does a good job of identifying words that are relevant to emotion causes. In addition, both Ours(Hard) and Ours(Soft) appear to generate responses that are more empathetic and contextually relevant to the conversation than other baseline models. In the second example, Ours(Soft) again is successful in locating the words associated with the emotion cause. The responses generated by Ours(Hard) and Ours(Soft) are more informative and have a richer expression of affections, while the responses generated by other models are monotonous and lack empathy.

Conclusion
In this paper, we presented a novel framework that can incorporate emotion cause information into empathetic response generation. Our approach consists of an emotion reasoner and a response generator. The emotion reasoner first predicts a context emotion label and locating the words in the dia-logue context which are associated with the emotion cause. The response generator then generates a response with the predicted context emotion label and the emotion cause information. To incorporate the emotion cause information into response generation, we devise a gated attention mechanism and explore both hard and soft gating strategies. Automatic and manual evaluations show that our proposed models can generate more meaningful and empathetic responses.

Ethical Considerations
The empathetic-dialogues dataset (Rashkin et al., 2019) used in our paper is annotated through Amazon Mechanical Turk, which means it totally protects the privacy of real users. Besides, we make sure anonymization in the emotion cause annotation of this dataset and human evaluation process. We believe our research work meets the ethics of EMNLP.

A Implementation Details
Our models are implemented using Pytorch and Texar-PyTorch, which is a modularized, versatile, and extensible toolkit for machine learning and text generation tasks. We used 300 dimensional word embedding and 300 hidden size everywhere in our experiments. the word embedding is initialize using pre-trained Glove vectors. We initialize transformer encoder with one layer and one attention head for the emotion reasoner and remove the position embedding in our emotion reasoner. A Transformer network with 6 layers and 8 attention heads is used for the response generator. We train our models using Adam optimization with a learning rate of 0.0005 and the maximum number of tokens per batch is set to 8192. Early stopping is applied during training. The training time of our models is 2 hours for around 80 epochs on a single Tesla V100 GPU. All results of different methods are generated with top-K sampling, and the K is set to 3 in our experiments.

B Results
In our experiments, we repeated 5 runs with different seeds (1024, 2048, 3170, 4096 and 5120) and average the results. The full results of different methods are presented in Table 7