Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible Knowledge Selection

Empathy, which is widely used in psychological counselling, is a key trait of everyday human conversations. Equipped with commonsense knowledge, current approaches to empathetic response generation focus on capturing implicit emotion within dialogue context, where the emotions are treated as a static variable throughout the conversations. However, emotions change dynamically between utterances, which makes previous works difficult to perceive the emotion flow and predict the correct emotion of the target response, leading to inappropriate response. Furthermore, simply importing commonsense knowledge without harmonization may trigger the conflicts between knowledge and emotion, which confuse the model to choose incorrect information to guide the generation process. To address the above problems, we propose a Serial Encoding and Emotion-Knowledge interaction (SEEK) method for empathetic dialogue generation. We use a fine-grained encoding strategy which is more sensitive to the emotion dynamics (emotion flow) in the conversations to predict the emotion-intent characteristic of response. Besides, we design a novel framework to model the interaction between knowledge and emotion to generate more sensible response. Extensive experiments on EmpatheticDialogues demonstrate that SEEK outperforms the strong baselines in both automatic and manual evaluations.


Introduction
Enriching dialogue systems with human characteristics and capabilities is a hotspot in the humanlike dialogue system research area.Empathy, which is used extensively in psychological counselling (Sharma et al., 2021;Liu et al., 2021;Sharma et al., 2020), is a key trait of everyday human conversations.In contrast to generating responses with The first case shows the speaker's emotion went from fear at the beginning of the conversation to an embarrassed self-deprecation, ending with a happy mood.And the second case shows that CEM chooses the wrong knowledge leading to inappropriate response.
controlled emotions (Zhou et al., 2018;Zheng et al., 2021), the key to the empathetic dialogue system is to understand the user's emotions and generate appropriate responses.Several works concentrate on improving the empathetic models' ability to capture contextual emotions by emotion mimicry (Majumder et al., 2020), feedback-based adversarial generating (Li et al., 2019), or the mixture of experts (Lin et al., 2019).On the other hand, Sabour et al. (2021); Li et al. (2020) introduce commonsense knowledge into empathetic models so as to better perceive implicit semantic information and generate more informative and empathetic response.However, the existing works are all about the dialogue-level emotional perception (Lin et al., 2019;Majumder et al., 2020;Li et al., 2019;Sabour et al., 2021;Li et al., 2020).Since emotions change dynamically throughout conversations, the coarse modeling method at the dialogue level (recognizing the emotion of the whole conversation context) cannot capture the process of emotional dynamics and makes it difficult to predict response emotions.Welivita and Pu (2020) have studied the shifting pattern of the utterances and drawn two graphs to show the most common emotion-intent flow patterns (with a frequency ≥ 5) throughout the first four dialogue turns and the global exchanging trends of emotion-intent between speakers and listeners in the EMPATHETICDIALOGUES dataset.For instance, in the first case illustrated in Fig. 1, the speaker's emotion shifts from afraid at the beginning of the conversation to an embarrassed selfdeprecation about previous experience of fearing heights (sharing such a funny story).Accordingly, it is much better that the dialogue agent should express the same self-deprecating sentiment like the gold response.Nevertheless, the baseline models have difficulty capturing subtle changes in the speaker's emotions and can only provide response according to the fear detected.Moreover, merely introducing knowledge without making emotionally logical choices may lead to logical conflicts between knowledge and emotion in the generated responses.As illustrated in the second case illustrated in Fig. 1, the CEM (Sabour et al., 2021) model chooses the wrong knowledge and is unable to correctly give empathetic responses with nostalgic overtones, which makes knowledge and emotion come into conflict.
To this end, we propose a Serial Encoding and Emotion-Knowledge interaction (SEEK) method for empathetic dialogue generation.To achieve a more fine-grained perception of emotional dynamics, we use an utterance-level encoding strategy which is more sensitive to the emotion flow in the conversations and able to predict the emotion characteristic of the response.We further introduce two new emotion-intent identification tasks to understand contextual emotion and predict the emotional and intentional trait of responses.For the problem of conflicts between knowledge and emotions, we also design a framework modeling the process of bi-directional interaction between them.Extensive experimental results on the utterance-level annotated EMPATHETICDIALOGUES (ED) dataset (Welivita and Pu, 2020) demonstrate that SEEK outperforms the strong baseline with both automatic and manual evaluation metrics.Our contributions are summarized as follows: • To the best of our knowledge, our work is the first to model the emotion flow that involves the process of emotional dynamics in the task of empathetic dialogue generation.In addition to the coarse emotion at the dialogue level, we introduce fine-grained emotions at the utterance level.
• By modelling the bi-directional interactive selection process between commonsense knowledge and emotions, we have improved not only the ability to recognize contextual emotions, but also the ability to filter out unreasonable external knowledge, allowing the model to generate more sensible empathetic responses.
• The automatic and manual evaluation on annotated-ED dataset shows that our proposed model is superior to the strong baselines and capable of generating more diverse and sensible empathetic responses.

Related Work
In order to control the emotion of the generated response, which is one of the fundamental characteristics of daily conversation, plenty of approaches (Zhou et al., 2018;Zheng et al., 2021;Zhong et al., 2019;Shen and Feng, 2020;Liang et al., 2021) view the target emotion as a guiding information of the models' generator.Contrary to controlling the emotion of the target response, the task of empathetic dialogue generation requires that the models learn a proper emotion to express empathy.Numerous researchers have attempted to improve the dialogue models' ability to respond empathetically.Rashkin et al. (2019) proposed a benchmark and dataset to build and evaluate empathetic dialogue generation models.Lin et al. (2019) learned a precise emotion distribution of the response based on mixture of experts.Majumder et al. (2020) split the emotions into two classes and designed a framework to mimic the target emotion in a certain class.Li et al. (2019) utilized user feedback to build a multi-resolution adversarial training framework.In addition, Kim et al. (2021) and Kim et al. (2022) focused on the keywords and emotion cause of dialogue history to better understand the context-level emotion and recognize feature transitions between utterances.As well, several datasets (Liu et al., 2021;Welivita et al., 2021) of empathetic dialogue generation have been published for further research.However, most of the current approaches do not pay enough attention to the emotion flow of the conversations.
Commonsense knowledge is widely used to build dialogue systems.Zhong et al. (2021a) utilize Commonsense knowledge graph to gain candidate words for generation.Sabour et al. (2021) adopt COMET (Bosselut et al., 2019), a pre-trained language model to generate commonsense inference for retrieving implicit information of dialogue context.In addition, Li et al. (2020) construct a graphbased framework to encode the context-knowledge graph retrieved on commonsense knowledge base.The knowledge introduced into these models might become a trigger of logical conflicts due to the absence of harmony selection.

Task Formulation
The task of empathetic dialogue generation is to generate empathetic responses based on the historical context.Given a dialogue D, where the context and the target response are denoted as C = [C 1 , ..., C N −1 ] and Y respectively, with a emotion label of the whole context e c .Additionally, a given sequence of emotion-intent labels EI = [ei 1 , ..., ei N −1 , ei Y ] of the corresponding utterances in D, which includes the 32 emotion categories, and 9 common intent classes.Our goal is to generate the next utterance Y , which is fluent and coherent to the context, and express empathy to the speaker's situation and feelings.

Utterance and Knowledge Encoder
Utterance Encoding: To get a precise representation of each utterance, we firstly encode the context at the utterance level to extract the contextual information.We employ Transformer (Vaswani et al., 2017) to encode the utterance.The embedding of the input is the sum of the word embedding, positional embedding, and dialogue state embedding.Following previous work, we prepend the utterance u i with [CLS] token to obtain the utterance input The embedding is then fed into the Transformer, and we obtain the representation: where H U i ∈ R Ln×d , L n is the length of the utterance, and d is the hidden size of the encoder.We take the representation of [CLS] to represent the utterance: Knowledge Encoding: In order to generate high-quality commonsense inferences for the corresponding context, we utilize COMET (Bosselut et al., 2019), which is a pre-trained GPT (Radford et al., 2018) language model and fine-tuned on ATOMIC (Sap et al., 2019), to generate five types of commonsense knowledge: the effect of the person (xEffect), the reaction of the person speaking the corresponding sentence (xReact), the intent before the person speaking (xIntent), what the person needs (xNeed), and what the person wants after speaking the sentence (xWant).Appending these five special relation tokens after the utterance and feeding them into COMET, we get 5 commonsense inferences texts for each relation of input utterance and then concatenate them to K i .Similarly, we encode the knowledge text using the same Transformer Encoder, and average the encoded hidden state via mean pooling (Zhong et al., 2021b): (3)

Emotion Flow Perceiver
Regarding the task of emotional understanding of each utterance as a tagging task, we use a Bi-LSTM to model the emotion dynamics and the interactions between different utterances for the contextual understanding process.
The input of Bi-LSTM is the concatenation of the encoded utterances and knowledge: where W a ∈ R 2d×d is a trainable weight, and Û i ∈ R 2d represents the processed utterance representation.

Fine-grained Emotion Recognition
For better understanding of the conversation, we pass Û i through a tagging classifier to produce a fine-grained emotion-intent tagging distribution P tag ∈ R t :

Fine-grained Emotion-Intent
Recognition

Dialogue Emotion Recognition
Response  where t is the number of emotion-intent categories.
We train the tagging module with the crossentropy loss between the predicted distribution and the ground truth label for a conversation context: (6)

Response Emotion-Intent Prediction
The shift in emotion and intent in empathetic dialogue conforms to an intuitive pattern.We use the attention mechanism to learn the shift pattern of emotion and intent between utterances.
where ĥpre ∈ R 2d is the representation of the predicted emotion-intent characteristic of response, and W p ∈ R 2d×t is the weight vector for the linear layer.P pre denotes the predicted distribution of the emotion-intent of the target response, t is the number of emotion and intent categories.
During training, we then minimize the crossentropy loss between the emotion-intent distribution of the predicted response P pre and the ground truth label ei N of the target response : (8)

Dialogue Emotion Recognition
The sequence of utterances representation not only has the contextual information of utterances themselves but also indicates the emotional trait of the whole dialogue.Similarly, we employ the attention mechanism to summarize the holistic emotion label, based on the sequence where h dia ∈ R 2d , and W d ∈ R 2d×q is the weight vector for the linear layer.The P dia is the distribution of the dialogue emotion, q is the number of available emotion categories.
The ground truth label of the dialogue emotion is denoted as e * .The cross-entropy loss utilized to optimize the process of summarizing the conversational emotion is calculated by: L dia = −log(P dia (e * )). (10)

Knowledge Selecting Decoder
Merely introducing commonsense knowledge into empathetic models without making an emotionally logical selection to is not ideal.Sabour et al. (2021) select commonsense inferences with an implicit procedure.On the contrary, our method models the process of bi-directional interactions between emotion and knowledge of the corresponding utterance in the conversations.
We adopt s layers of Cross-Attention Transformer to perform the harmony of emotion and knowledge.Since the utterance representation sequence [ Û 1 , Û 2 , ..., Û N −1 ] passed through the three tasks of emotion, it contains emotional characteristics of the corresponding utterances.The inputs of Cross-Attention Knowledge Selector are composed of the utterance representation sequence acting as the query vector, the key and value vector which are both the knowledge text generated from the COMET model K = [K 1 , ...K N −1 ].The hidden representation of selected knowledge is as follows: where S ∈ R Ls×d , L s is the maximum length of the knowledge text, and d is the hidden size of the model.Afterward,we average the harmonized knowledge via mean pooling (Zhong et al., 2021b): We take the Transformer Decoder as the backbone of the Decoder.We perform a concatenation operation between the averaged harmonized knowledge S and the prediction of response representation ĥpre to get a mixture of these two types of information to represent the [SOS] token: where W k ∈ R 2d×d is the weight vector for the linear layer.(14)

Training Objectives
During the training process, we need to minimize three classification losses and a response generation loss.The classification losses are weighted equally: In order to improve the diversity of the generated response, we adopt Frequency-Aware Cross-Entropy (FACE) (Jiang et al., 2019) as an additional loss to penalize high-frequency tokens, similar to Sabour et al. (2021): where w i is a frequency weight value of the i-th token in the vocabulary V , c i represents a candidate token in the vocabulary and δ t (c i ) is a function indicate whether c i equals to the ground truth token y t .
Lastly, all the parameters for our proposed model are jointly trained and optimized by minimizing the weighted sum of the three mentioned losses: where α, β, and γ are hyper-parameters used to balance three losses.In our experiments, we set α= 1, β= 1, and γ= 1.5.
4 Experimental Setup

Dataset
Our experiments are conducted on the utterancelevel annotated EMPATHETICDIALOGUES (ED) (Rashkin et al., 2019;Welivita and Pu, 2020).ED is a large-scale multi-turn dialogue dataset that contains 25k empathetic conversations between a speaker and a listener.ED provides 32 evenly distributed emotion labels which are common in daily chats.However, the emotion labels of ED dataset are on the context level, there are no explicit signals for utterance-level emotions.Welivita and Pu (2020) annotated ED dataset with 41 new categories of utterance-level emotional and intentional labels, which provide fine-grained information about the empathetic dialogues in ED dataset.

Baselines
We select several strong baseline models for com-

Implementation Details
We implement our model using Pytorch (Paszke et al., 2019), and utilize Adam (Kingma and Ba, 2015) optimizer to optimize the model.We use 300-dimensional pre-trained GloVE vectors (Pennington et al., 2014) to initialize the word embeddings, which are shared between the encoder and the decoder.During the training stage, the learning rate is initialed as 0.0001 and we vary the learning rate following Vaswani et al. (2017).Our model is trained on one NVIDIA Geforce RTX 3090 GPU using a batch size of 32 and the early stopping strategy.For other settings, such as dropout rate, maximum decoding steps, and so forth, we keep the same as Sabour et al. (2021).The training time of SEEK is about 3 hours for around 27000 iterations.

Automatic Evaluation
Since Liu et al. (2016) had proved that some automatic metrics based on word overlapping might be improper to evaluate the dialogue systems, such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004), we adopt Perplexity (PPL) and Distinct-n (Dist-n) (Li et al., 2016) as the main automatic metrics of generation quality.For the conversational emotion recognition and our newly introduced two tasks including fine-grained emotion-intent tagging and response emotion-intent prediction, we employ dialogue emotion accuracy (DE Acc.), utterance emotion-intent accuracy (UEI Acc.) and response emotion-intent accuracy (REI Acc.).
To examine whether SEEK can generate more sensible response with fine-grained emotion recognition, we compare the performance of our model with the strong baselines.As shown in Table 1, the diversity scores (Dist-1 and Dist-2) of SEEK outperform all of the baselines, which indicates our models can generate more informative response based on the external knowledge.We attribute this improvement to the knowledge selector and the predicted emotion of the target responses, with which the cross-attention mechanism helps to select the related knowledge based on the contextual information of utterances, and the predicted vector provides additional information of the generating process.
To prove if SEEK has better understanding of the dialogue emotion, we list the accuracy of the baselines and our proposed model.Remarkably, SEEK surpasses all of the baselines by a large margin, we attribute the increase of performance to the two fine-grained tasks we introduced.The better comprehension of the utterances in dialogue, the more accuracy it takes.In terms of the two new accuracy scores, UEI Accuracy and REI Accuracy, SEEK reaches satisfying performances, as the number of the categories of these two tasks are 41.

Human Evaluation
Following previous works, we conduct a human evaluation based on three aspects: coherence (Coh.):How much does the response relevant to the context?empathy (Emp.):How much does the model know about the speaker's situation and emotion characteristic?Does the model respond empathetically enough or give suggestions?fluency (Flu.):How much the generated response obey the grammar?We randomly choose 100 dialogues and assign the responses generated by the models to three crowd-sourced workers for the evaluation.Each aspect is on a scale of 1 to 5.Moreover, considering the variation between different individuals, we conduct another human A/B test to directly compare our method with other baselines.Three professional annotators score the questionnaire of the response pairs to choose one of the responses in random order or select "Tie" when the quality of provided sentence is difficult to distinguish.As the results of the human rating and A/B test are shown in Table 3 and table 4, SEEK outperforms the baselines in all the three aspects.

Ablation Studies
To study the effect of tasks and modules employed in our model, we remove the newly introduced tasks and the interaction process between emotion and knowledge.Additionally, we replace the knowledge type and encoding strategy respectively.The results are demonstrated in Table 2.
Removing the task of fine-grained Utterance Emotion-Intent tagging and Response Emotion-Intent prediction (w/o Utter, w/o Res, and w/o Utter & Res) causes the drop of accuracy of dialogue emotion recognition and generative quality, as these variants lose the fine-grained understanding of the dialogue and the ability to predict the emotion-intent characteristics of the target response.
The margin between the variant (w/o Emo) without emotional harmonization of the knowledge and SEEK proves the importance of the interaction between knowledge and emotion-intent from the Knowledge Selection module of our model.The variant without knowledge (w/o Know) indicates the importance of external knowledge for the diversity of responses the model generated.
Moreover, the decreased performance by replacing the type of knowledge + Others Know and the encoding strategy + Context Enc shows the superiority of our method.Using Others type of knowledge in our model rather than PersonX results in a considerable decrease in all performance, which indicates that the PersonX type of commonsense helps the model to understand the utterances more effectively.The encoding strategy employed Table 5: Two cases of generated responses by SEEK and the baselines.We annotated each turn with the emotional or intentional labels at the end of the utterances.The words relevant to the predicted labels in SEEK's response are highlighted in red.
by baselines (as the variant + Context Enc used) emphasizes on overall understanding of the whole conversation, ignoring an accurate grasp of utterances, which leads to a decline of performance.
Remarkably, the UEI Accuracy of w/o Utter and REI Accuracy of w/o Res are higher than SEEK.This is possibly due to the noise of the utterance label of annotated ED dataset and the subtle differences between intent categories (e.g.agreeing and acknowledging, counselling and questioning), which means the classification supervision signal of utterances or the response will make the input vector of attention module harder and lose some information of other classes.The loss of information about the hidden states may confuse another classifier and leads to a decrease in accuracy.In any case, although there exists a trade-off between these two tasks, they can simultaneously improve the ability of the model to generate more sensible empathetic responses by modeling the emotion flow.

Case Study
The first case of figure 1 illustrates how emotion shifts during a multi-turn conversation.For better compares generated responses of our model and the baselines, we show two of the generated result of our model and baselines in Table 5.In the first case, the baselines failed to give responses with nostalgic overtones, similar to the commonsense knowledge demonstrated in figure 1, where CEM choose the wrong knowledge to generate response with a happy emotion and the intent to have fun.On the contrary, SEEK successfully gives a response with more sensitive and accurate emotional perception.Similarly, in the second case, all of the baselines generate responses based on the explicit emotion guilty, without fine-grained understanding which is more accurate.Unlike the baselines, SEEK respond sensitively with sympathizing intent.
We further draw a heat map to illustrate the crossattention weights of commonsense knowledge in a certain case.The detailed information of that case and analysis will be shown in Appendix A.

Conclusion
In this paper, we study the task of empathetic dialogue generation.The strong baselines ignore emotion flow of the conversations.We therefore proposed a Serial Encoding and Emotion-Knowledge interaction (SEEK) method for empathetic dialogue generation, to predict the correct emotion of the target response by perceiving the emotion flow of the context and harmonizing commonsense knowledge with fine-grained emotions to avoid conflicts.Experiments on the utterance-level annotated EM-PATHETICDIALOGUES show that our model outperforms the baselines, and the ablation studies indicate that all the components of our model, the encoding strategy, and the commonsense knowledge work.
In the future, we will focus on further usage (e.g.providing online-emotion aid) of empathetic systems and try to improve normalization capabilities of our model on other datasets.

Limitations
The limitation of our work mainly comes from the shortage of datasets in the task of empathetic dialogue generation.Although there are several newly released large-scale datasets (Liu et al., 2021;Welivita et al., 2021), most of the research can only be carried out on the English corpus EMPATHET-ICDIALOGUES. Another limitation is the problem of evaluation metrics.As mentioned in Liu et al. (2016), the scores of standard automatic evaluation metrics are not consistent with human evaluation results.The lack of task-specifically automatic metrics makes it troublesome for evaluating empathetic dialogue generation.

Ethical Considerations
The data (Rashkin et al., 2019;Welivita and Pu, 2020) used in our work is all drawn from opensource datasets.The conversations of the dataset are around given emotions and carried out by employed crowd-sourced workers, with no personal privacy issues involved.

A More Cases
To show the process of knowledge selection of our proposed model, we clearly show the attention weights on the commonsense knowledge in Table 6.We firstly get the weights matrix from Cross-Attention outputs and search the words in the knowledge text by the index of high-value elements.To directly show the selecting process, we mark the knowledge words based on the color in the heat map we drew: the higher weight the knowledge words have the darker blue marks them in the table.
In this case, the context of the case is mainly about a couple of parents asking for the gender of the baby in a hospital and the COMET totally model generates 25 commonsense inferences based on it.The speaker reacts excitedly to knowing the gender of their baby which infers something to celebrate, and SEEK chooses the correct knowledge and expresses congratulation.

Figure 1 :
Figure 1: Two cases of multi-turn Empathetic Dialogues.The first case shows the speaker's emotion went from fear at the beginning of the conversation to an embarrassed self-deprecation, ending with a happy mood.And the second case shows that CEM chooses the wrong knowledge leading to inappropriate response.

Figure 2 :
Figure 2: An overall architecture of our proposed model.
At the training stage, we prepend the target response u N = [y 1 , ..., y T ] with the [SOS] token and get the final input of the Decoder Y = [[SOS], y 1 , ..., y T ].The training loss is the standard negative log-likelihood (NLL) loss on the target response u N : L nll = − T t=1 log(P (y t |C, y <t ).

Table 1 :
Automatic Evaluation results of baselines and our model.The improvement of SEEK to four strong baselines is statistically significant (paired t-tests with p-values < 0.05).

Table 2 :
Ablation study of our proposed model SEEK.The best results are marked with bold.
of the context graph constructed on external knowledge.The knowledge-enriched context graph contains emotional dependencies which helps to understand the emotion characteristic of conversations.CEM: Sabour et al. (2021) use COMET to generate commonsense knowledge based on the last utterance said by the speaker in dialogue.The authors use five specific prefixes (xIntent, xEffect, xWant, xNeed, xReact) to obtain five types of knowledge corresponding to the last utterance.The model can generate more informative empathetic responses.
Speaker: I love YouTube.I've been listening to all my classic tracks.Tupac forever.(Nostalgic)ContextListener:Ilove me some Tupac.Real talk.(Acknowledging)Speaker:I started out with One Hit Wonders but ended up at Pac.I miss my youth lol.(Nostalgic) Yeah about 10 years ago I had a horrifying experience.It was 100% their fault, but they hit the water barrels and survived.They had no injuries, but they almost ran me off the road.(Guilty)