Improving Empathetic Dialogue Generation by Dynamically Infusing Commonsense Knowledge

In empathetic conversations, individuals express their empathy towards others. Previous work has mainly focused on generating empathetic responses by utilizing the speaker's emotion. Besides, external commonsense knowledge has been applied to enhance the system's understandings of the speaker's situation. However, given an event, commonsense knowledge base contains various relations, potentially leading to confusion for the dialogue system. Consequently, inconsistencies arise among the emotion, generated response and speaker's contextual information. To this end, we propose a novel approach for empathetic response generation, which incorporates an adaptive module for commonsense knowledge selection to ensure consistency between the generated empathetic responses and the speaker's situation. This selected knowledge is used to refine the commonsense cognition and empathy expression for generated responses. Experimental results show that our approach significantly outperforms baseline models in both automatic and human evaluations, exhibiting the generation of more coherent and empathetic responses. Moreover, case studies highlight the interpretability of knowledge selection in the responses and the effectiveness of adaptive module in our model. Code: https://github.com/Hanscal/DCKS.


Introduction
Empathy is a desirable human ability in our daily conversations.It is known as a complex multidimensional construct encompassing social, cognitive, and emotional processes, which enables us to experience the emotion of others through various emotional stimuli and to understand the implicit mental states of others (Davis, 1983;Zheng et al., 2021).Previous research (Rashkin et al., 2019;Lin et al., 2019;Majumder et al., 2020;Li et al., 2021b) has been conducted on dialogue systems to enhance its empathy ability in open-domain.In order to generate empathetic responses, one line of growing interests is incorporating commonsense knowledge into conversation modeling (Ghosal et al., 2020;Zhou et al., 2021;Sabour et al., 2021).Yet, understanding speaker's emotion and showing the contextually appropriate comprehension of her/his situation are still challenges in empathetic conversations.When interacting with a dialogue system, the speakers are not expected to explicitly share all the information about their situation and how they may feel.As humans, we use our commonsense knowledge to make connections between what is explicitly mentioned and what is implied.Hence, to address above issues, some prior works (Zhou et al., 2018b;Wu et al., 2020) implement external knowledge to identify the speaker's situation, to acknowledge the speaker's status and to bring diversity for generated response.
However, straightforward knowledge merging method confuses the system and the response consistency would be deteriorated.This is demonstrated in Figure 1, where the irrelevant knowledge (Need) may potentially form empathetic responses, which conflicts with the information about speaker's emotion (content).Accordingly, the speaker displays the satisfaction of her/his expe-rience, which provides potential informative cognitions based on one unified commonsense.We can assume that if the most appropriate commonsense cognition (Intent) is selected with respect to emotion status, the generated response shows better consistency and empathy.Therefore, we believe dialogue systems with rectified knowledge, which aims at unifying the contextual emotion, lead to more consistent and empathetic responses.
In this paper, we address the task of empathetic dialogue generation by dynamically infusing commonsense knowledge.Such additional commonsense knowledge is used to improve the cognitive understanding about the speaker's situation and feelings, thus enhance the empathy expression in the generated responses.Meanwhile, the dynamical selection stage avoids the confusion of knowledge in dialogue system and enhance the response consistency with context history.In general, our main contributions are summarized as follows: • We introduce a novel approach that incorporates the inferred commonsense knowledge to enhance empathetic response generation.
• We propose an effective knowledge selecting paradigm that could dynamically select the commonsense knowledge, which is most relevant to speaker's cognitive empathy.To the best of our knowledge, it is the first work to study commonsense knowledge dynamical selection for empathetic dialogue generation.
• Experiments show that with incorporating the selected commonsense, our model is able to generate more empathetic and interpretable responses compared with the previous methods.
2 Related Works

Empathetic Dialogue Generation
In recent years, research on implementing empathy in open domain dialogue systems and generating empathetic responses has gained considerable attention.Rashkin et al. (2019) consider a richer and evenly distributed set of emotions and release a dataset EmpatheticDialogues, where a listener responds to a speaker who is under an emotional situation in an empathetic way.Ghosal et al. (2020) demonstrate that detecting the speaker's emotion is an essential part of generating empathetic responses.Prior studies on emotion-related conversational systems mainly focused on rule-based systems, which heavily rely on hand-craft features (Zhou and Wang, 2018;Zhou et al., 2018a).Recently, many neural emotional dialogue generation approaches have been explored to control the emotional expression in the target response (Lin et al., 2019;Majumder et al., 2020).However, Li et al. (2021a) reveal that conventional empathetic conversation systems face an emotional inconsistency problem as they strive to produce emotionally rich responses based on predefined user-input emotions.

Connecting Knowledge and Dialogue
Leveraging knowledge from commonsense knowledge base has been demonstrated for gaining a better understanding of the implied emotions within the context (Tu et al., 2022;Lee et al., 2022).ConceptNet (Speer et al., 2017) and ATOMIC (Sap et al., 2019) are commonsense knowledge bases.ConceptNet consists of 36 relations focusing mostly on taxonomic, lexical and physical commonsense knowledge.Distinguished from ConceptNet, ATOMIC consists 9 relations that cover social commonsense knowledge including event-centered causes and effects as well as personrelated mental states.Both Zhou et al. (2018b) and Zhang et al. (2019) introduce knowledge triplets from ConceptNet into open-domain response generation.Recently, Li et al. (2022) and Zhong et al. (2021) exploit ConceptNet to enhance emotion reasoning for response generation.Ghosal et al. (2020) utilizes ATOMIC in emotional dialogue modeling for emotion identification.Sabour et al. (2021) leverages commonsense from ATOMIC to improve the understanding of speaker's situations and feelings.Therefore, enabling dialogue systems to leverage commonsense and driving implications from the speaker's explicit statements are highly beneficial for more empathetic responses.In this work, we focus on the task of empathetic dialogue generation on EmpatheticDialogues dataset, and pay attention to addressing social related commonsense knowledge from ATOMIC.For each event, we use the social relations in ATOMIC to infer the commonsense knowledge about the person involved in the event.We adopt COMET (Bosselut et al., 2019) to generate commonsense sentences for the given events.This model is pre-trained on triplets from ATOMIC and then fine tuned on ATOMIC 20 20 (Hwang et al., 2021), so that is more suitable for inferring knowledge regarding unseen events in the original ATOMIC daily basis dataset.

Methodology
Our proposed model is built upon the Transformerbased pre-trained language model to generate listener's utterance.Each conversation process of the model is mainly divided into three stages: contextual probing, contextual unification workspace and knowledge-aware decoder.The overview of our model is illustrated in Figure 2.

Task Formulation
The task requires a dialogue model to play the role of the listener and generate empathetic responses.Formally, let U = [u 1 , u 2 , ..., u n−1 ] denote a dialogue history of n−1 utterances, where where k i is the empathetic commonsense inference knowledge.Our goal is to generate a response Y using historical utterance U and commonsense knowledge K as input.A dialogue history encoder to encode U , a knowledge encoder to encoder K, and a decoder to incorporate dialog history, dynamically select knowledge and generate response.

Contextual Probing
To obtain semantic representations of the dialog history and the knowledge from ATOMIC, we divide the context probing part into context encoding and knowledge acquisition.

Context Encoding
We concatenate the utterances in the dialogue history and prepend a special token [CLS] to obtain the dialogue historical context input U = [CLS] ⊕ u 1 ⊕ u 2 ⊕ ... ⊕ u n−1 , where ⊕ is the concatenation operation.Then, we use the final hidden representation of [CLS] as the representation of the whole sequence.
We use BART encoder part to acquire the contextual representation.The sequence U is fed into the encoder, and the hidden state of the encoder token: where z ctx ∈ R L×d , L is the length of the sequence, and d is the hidden size of the context encoder.

Knowledge Acquisition
In ATOMIC, six relations could be inferred for the person X involved in the event: the effect of the event on X (xEf f ect), X's reaction to the event (xReact), X's intent before the event For each relation, we concatenate the generated commonsense inferences to obtain its commonsense sequence CS r = cs r 1 ⊕ cs r 2 ⊕ ... ⊕ cs r k , which demonstrates the knowledge regarding the speaker's dialogue state (i.e.emotion and situation).Accordingly, similar to the previous section, we prepend [CLS] to the sequences denoted as E CSr , which then are fed to five separate commonsense knowledge encoders, as shown in the contextual probing part of Figure 2: where Z r ∈ R lr×d , l r is the lengths of the commonsense inference sequences.
Then, we utilize the hidden vector of [CLS] as the representation for each relation, and through average operation we obtain the fused representation z r = Average(Z r [0]) ∈ R d for all relations.

Contextual Unification Workspace
To better leverage the hidden representation from knowledge acquisition and context encoding, we apply the workspace module for unifying contextual information according to emotion label.The workspace consists of two parts: emotion classification for identifying speaker's status, and adaptive knowledge selection for excluding irrelevant knowledge representation.

Emotion Classification
In contrast to concatenating the representations at a sequence level, we use point-wise addition to fuse the additional knowledge in the sequence, i.e., the fusing of knowledge and the context representation: In order to acquire a more accurate prediction of the speaker's emotion, given that we are provided with an emotion label e for each conversation, we use the infused representation of knowledge and context representation to perform emotion classification.We also pass z f through a linear layer g θ , followed by a softmax operation to produce the emotion category distribution P emo ∈ R q , where q is the number of available emotion categories: where θ ∈ R d×q is the weight vector for the linear layer.During training, we optimize these weights by minimizing the Cross-Entropy (CE) loss between the emotion category distribution P emo and the ground truth label e: L emo = − log(P emo (e)). (5)

Adaptive Knowledge Selection
We present a knowledge selection method that the decoder can adaptively choose the commonsense representations based on the emotion classification results.Given the set of knowledge representation Z = {Z r [0]}, the goal is to choose the most appropriate knowledge relations that satisfy the consistency with the context representation vector z ctx .By this selection paradigm, the irrelevant relations, which would potentially confused the generated response, will be eliminated, so as to boost the performance of dialogue system.Inspired by Global Workspace Theory in cognitive science (Blum and Blum, 2022;Baars, 1993) , the process of contextual coordination is realized by eliminating irrelevant cognition.We therefore implement the label of emotion as the coordination of context and the L emo (g θ (z), g θ (z ctx )) from the supervised evaluation to eliminate irrelevant cognition.The knowledge selection mechanism is divided into two stages, competition and broadcasting: • During the competition stage, we recursively exclude the irrelevant information of knowledge representation based on the emotion status.Specifically, at iteration m, we choose the max z∈Z {L emo (g θ (z), g θ (z ctx ))} as the most irrelevant knowledge representation.In order to model the influence of knowledge exclusion, we leverage nonlinear regression method (Xu and Xuan, 2019;Shen et al., 2022) to calculate the dynamics G = ∇ θ f ∈ R d×q of the aforementioned max loss.Please refer to the Appendix for the technical details.After the last iteration, the remaining knowledge representation, as the winner of competition, is applied for acknowledging the unified speaker's emotion status.
• In the broadcasting stage, the winner of competition stage will be applied for unifying the combined representation in decoder.Specifically, we realize this stage by adding the dynamics of the selecting process to rectify the knowledge representation.Thus, the generated response will less affected by the unrelated information from knowledge encoder in contextual probing module.
We provide Algorithm 1 in Appendix to show the exclusion method.Figure 3 displays how the workspace process refine the knowledge representation.

Knowledge-Aware Decoder
Generally, not all knowledge contributes to the generation of the response, so the model should have the ability to select knowledge.Instead of performing knowledge selection in the encoding phase, we leave it to the decoding phase.As shown in the right part of Figure 2, a knowledge-aware

Broadcasting Stage
< l a t e x i t s h a 1 _ b a s e 6 4 = " 6 G F a 9 u a m U g + A T E f S I n t E r e j O e j B f j 3 f i Y t i 4 Y x c w e + g P j 8 w c 3 z Z r T < / l a t e x i t > zwant < l a t e x i t s h a 1 _ b a s e 6 4 = " e M 8 N 4 o 7 z X g y X o x 3 4 2 P W u m Q U M w f o D 4 z P H 2 P n m l 4 = < / l a t e x i t > zctx < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 w L L B q k 3 F e u a 8 N 3 1 A T E f S I n t E r e j O e j B f j 3 f i Y t i 4 Y x c w e + g P j 8 w c 3 z Z r T < / l a t e x i t > zwant < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 I n X H l H t q C K w a s r r L y Q u i r q D 5 3 A T E f S I n t E r e j O e j B f j 3 f i Y t i 4 Y x c w e + g P j 8 w c 3 z Z r T < / l a t e x i t > zwant < l a t e x i t s h a 1 _ b a s e 6 4 = " R B / J E X s i r 9 + g 9 e 2 / e + 0 / p l D f q 2 S B / 4 H 1 8 A 9 z D o S g = < / l a t e x i t > react m < l a t e x i t s h a 1 _ b a s e 6 4 = " x A d g s T 5 X S g f r w 7 6 k 8 e y T N 5 8 R 6 8 J + / V e / s q H f N G P a v k B 7 z 3 T 5 P 8 o g A = < / l a t e x i t > e↵ect m+1 < l a t e x i t s h a 1 _ b a s e 6 4 = " x A d g s T 5 X S g f r w 7 6 k 8 e y T N 5 8 R 6 8 J + / V e / s q H f N G P a v k B 7 z 3 T 5 P 8 o g A = < / l a t e x i t > e↵ect m+1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 X s I L i g 2 3 Y d a j N 8 K D l 5 5 1 l f s A R B / J E X s i r 9 + g 9 e 2 / e + 0 / p l D f q 2 S B / 4 H 1 8 A 9 z D o S g = < / l a t e x i t > react m < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 X s x e y Z v 3 5 L 1 4 7 9 7 H t H X J m 8 0 c k D / w P n 8 A e / e V 0 w = = < / l a t e x i t > Iter. m < l a t e x i t s h a 1 _ b a s e 6 4 = " T j 0 e N s E Q u y j N f 1 m Iter.m + 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 b j k U cross attention block is introduced to select knowledge dynamically.Feed the selected knowledge to the context-knowledge refiner, which assists in response generation.The fused knowledge is taken as the input of this block, and then the output of this block is refined to exploit the knowledge contributions.

Knowledge Refiner
In order to refine the context and knowledge contributions in each layer, we replace the residual addition to a refine gate after the knowledge-aware attention block.Denote h k as output of knowledgeaware attention block and h c as the residual from the previous block, the output of refiner can be expressed by: Where LN is a linear layer, h k is the rectified knowledge representation, w ∈ R 2d is a learnable parameter and σ denotes sigmoid function.

Response Generation
Lastly, the target response Y = [y 1 , y 2 , ..., y T ] with length T , which is generated by the decoder token by token by using the embeddings of the tokens that have been generated and the commonsense-refined contextual representation R f ( h k , h c ), which has fused the information from both the context and the commonsense inferences.We adopt the standard negative log-likelihood (N LL) loss on the target response Y : log(y|(U, K), y <t ). (9)

Training Objectives
All the parameters for our proposed model are trained and optimized based on the weighted sum of the two mentioned losses: where γ is hyper-parameter that we use to control the influence of the these losses.In our experiments, we set γ = 1.

Datasets
We conduct our experiments on the EmpatheticDialogues, a large-scale multi-turn dataset containing 25k empathetic conversations between crowd sourcing workers.The dataset also provides an emotion label for each conversation from the total 32 available emotions.

Baselines
We select the following baseline models for comparison on EmpatheticDialogues: (1) Transformer (Vaswani et al., 2017)  ( Sabour et al., 2021): An empathetic generation approach which leverages commonsense to draw more information about the speaker's situation and uses this additional information to further enhance the empathy expression in generated responses.

Implementation Details
We implement all the models using PyTorch and use the encoder and decoder from base version of BART in our work.We use Adam optimizer with initial learning rate 0.00005 in 5 epochs.The batch size is 16.The max sequence length in source and target is 256 and 64 respectively.We use the same 8:1:1 train/valid/test split as provided by Rashkin et al. (2019).In each experiment, we apply an early stop mechanism to prevent the model from over fitting, and then report the test results of the optimal model on the test set.All our training and test results were performed on 32GB Tesla V100 GPU.

Automatic Evaluation
We employ Perplexity (PPL), corpus-level BLEU (B-n), sentence-level ROUGE (R-n) and Distinct-n (Dist-n) as our main automatic metrics.Perplexity represents the model's confidence in its set of candidate responses, with higher confidence resulting in a lower PPL.This can be used to evaluate the general quality of the generated responses.Response with higher BLEU and ROUGE is closer to the ground-truth.Distinct-n measures the proportion of unique n-grams in the generated responses and is commonly used to evaluate generation diversity.
In addition, since our proposed model and most baseline models perform emotion classification as part of their training process, we also report the prediction accuracy (Acc).

Human Evaluation
Following the methods in CEM, we conduct an aspect-based pairwise preference test.That is, for a given context, we pair our model's response with a response from the baselines and ask annotators to give each response a rating score from four aspects: 1) Coherence (Coh.): which response is more coherent in content and relevant to the context; 2) Empathy (Emp.): which response shows more understanding of the speaker's situation and presents a more appropriate emotion; 3) Informativeness (Inf.): which response conveys more information about the context.4) Continuity (Con.): which response ignites the speaker's more desire to continue the conversation.Then, we randomly sample 100 response pairs and totally shuffle the response order in each sample.We assign crowd sourcing workers to annotate each pair on a scale of 1 to 5.

Automatic Evaluation Results
Table 1 reports the evaluation results on automatic metrics.Ours model achieves the lowest perplexity, which suggests the overall quality of our generated responses is higher than the baselines, approximately 56% lower than CEM.In addition, our model also considerably outperforms the baselines in terms of Dist-n, BLEU-n and ROUGE-n, which highlights the diversity of the responses and the relevance between generated response and speaker's situation.In terms of emotion classification, our model had a much higher accuracy compared to the baselines, nearly 34% higher than CEM, which suggests the adaptive selection of commonsense knowledge is pivotal for detecting the speaker's emotion.
Table 2 reports the evaluation results on lowresource training set, and we have the following observations: (1) In the full-data scenario, our model achieves start-of-the-art performance by infusing commonsense knowledge, which means that the importance of knowledge in dialogue generation.Besides, reducing the number of training samples has effect on model performance, but not that much, for that even the model using 1/4 data still has the approximate values in PPL, BLEU-n, ROUGE-n and Dist-n compared with the model using full data.(2) In the 1/8 training data scenario, our model achieves the comparable performance with baselines even though them leveraged all training data.(3) Responses generated by our model have higher Dist-n in low-resources scenarios, which means that our model can better obtain information from multiple knowledge and generate more diverse texts.

Ablation Studies
We conduct ablation studies to verify the effectiveness of each of the components in emotion classification and the generation performance.Specifically, we design three variants: workspace, knowledge and context.It is worth noting that since workspace depends on knowledge and context, when knowledge or context module is removed, workspace is removed by default: 1. w/o Adapter: the mechanism in workspace that used for adaptive commonsense knowledge selection is removed, and the emotion classification is based on none selected commonsense representation; 2. w/o Knowledge: the commonsense knowledge representation used for emotion classification is removed (Equation 6), and the hidden representation of the [CLS] token from the encoded context is used for emotion classification; 3. w/o Context: the context representation used for emotion classification is neglected (Equation 6), but keep the affective and cognitive commonsense knowledge representations; The obtained results are shown in Table 1.We observe that reducing the workspace module results in lower classification accuracy as the same as BLEU-n and ROUGE-n.And removing the commonsense knowledge information also impacts the emotion classification accuracy.The above phenomena suggest that information about both the speaker's emotion and their situation are necessary for correctly identifying their feelings, and  dynamical knowledge selection is leveraging the knowledge contribution to the cognition response.
Removing those components leads to lower Dist-n scores but higher perplexity, which indicates the effectiveness of those components in generating more diverse responses.

Human Evaluation Results
Table 4 reports the evaluation results on human ratings.We observe that responses from our model are more contextually coherent than those from baselines.Besides, with the enhancement of commonsense knowledge, the response from our model are able to convey more specific and informative content.It is worth to note that, for the aspect of continuity, our model significantly outperforms all the baselines, which suggests that the generated responses may increase speaker's engagement, thus a more intimate emotional expression.

Gold
You do?That's good, friends can be terrible people to lend too.

Qualitative Studies
Case Study Table 3 shows the cases from Em-patheticDialogues, from which we can see that the response of our method outperforms the baselines.
We analyze these cases with respect to the four factors evaluated by human.In aspect of Coherence and Informativeness, our response is more coherent in content and consistent to the context information.For instance, in case one, by the awareness of selected knowledge 'To be home', our method mentions this phrase in response so that the response better acknowledges speaker's intention.However, other methods fail to generate consistent response.It can be observed that MoEL and CEM dismiss the implication that the speaker is alone at home.The workspace module improves Empathy and Continuity by selecting the most influential commonsense with respect to the context.In both cases, the selected knowledge corresponds to the speaker's situation, which produces a more meaningful response by showing careness for speakers.

Efficacy of Knowledge Selection
Selection process illustrates that the most irrelevant knowledge is selected and eliminated at each iteration.By combining dynamics from the selection process in refiner, the generated sentence gradually focuses on speaker's emotion status, so that our method provides more interpretable knowledge selection process for the dialogue system.Figure 4 provides characteristic of knowledge selection process.It indicates that workspace module tends to select inferred knowledge from the relation xReact.Since xReact reflects speaker's reaction to context, our adaptive selection method potentially provides the consistency between context and knowledge.

Conclusions
In this paper, we improve empathetic dialogue generation by infusing dynamical commonsense knowledge to promote the understanding of the speaker's situation and feelings, which leads to more consistent and empathetic responses.The automatic and human evaluation demonstrate that the effectiveness of our approach in high-quality empathetic response generation.

Limitations
One limitation in this work is the metrics employed in the automatic evaluation.The metrics mainly focus on the quality of generated response and the accuracy of emotion recognition, while automatic evaluation lacks a comprehensive method to evaluate empathy.Another limitation comes from the utilization of the dataset designed for open-domain dialogue system, so that the generated response from the proposed framework is not task-oriented.
In the future, we will build empathetic dialogue generation datasets with diverse and task-oriented response, and develop metrics to evaluate the understanding of the speaker's situation.

A The Details of Cognition Dynamics
Our goal is to calculate the effect of knowledge representation on the predictions of the linear transformation function g θ in the workspace module.The influence of excluding irrelevant knowledge representation can be interpreted as the change of θ with respect to L emo , which is In order to eliminate the most irrelevant knowledge representation, we take the max(•) on loss function with respect to z ∈ Z.However, it is challenging to calculate the gradient when we implement max(•) on the groups of loss functions, because the above function is non-differentiable.Thus, we first bring differentiability for max z∈Z {L emo (g θ (z), g θ (z ctx ))}.
To simplify notation, objective function is set as Here, f j denotes the loss function with respect to knowledge representation z j and each f j is differentiable.g θ is the parametric linear layer.Then, calculating the gradient of θ turns into the following discrete mini-max problem: min In order to smooth objective function Φ during the iteration m, we linearize f j at θ m and obtain the convex approximation of Φ as .
(12) The linearization term smooths max(•) function.Next step is to find descent direction, which minimizes Φ.However, Φ is not strictly convex with respect to θ, the algorithm may not reach global minimum.So a regularization term ∥θ − θ m ∥ 2 is added for finding stable descent direction.Denote the descent direction δ = θ − θ m , the discrete mini-max problem now is equivalent to Problem ( 13) is a semi-definite quadratic programming (QP) problem since we choose ℓ 2 norm as the regularization term.When the number of datapoints in subgroup is large, widely-used QP algorithms, such as active-set method, are timeconsuming.Thus we turn to the dual problem.
Consider the Lagrange multiplier for problem (13), By strong duality theorem, the minimum of original problem is equal to the maximum of dual problem under specific constrains: By setting e = 1, the above problem is equivalent to Note that If 1 − λ T e ̸ = 0, the objective function will be −∞.Thus, we must have 1 − λ T e = 0 when the maximum is attained.The problem is converted to Let the gradient of the inner minimization term to be zero, we have solution δ = −G T λ.By changing the sign of (18), the maximization term is reduced to Suppose λ is the solution of the QP problem (13), then δ = −G T λ is the solution of problem above.Thus, we have the δ as the change of eliminating irrelevant knowledge representation z.By adding δ to the refiner in decoder module, the final generated response would be less affected by the irrelevant knowledge.The effect of δ is demonstrated by the generated responses in

C Additional Case Study
We provide qualitative studies in Section 4.6.It includes 1) Ablation study of our cognition dynamics (Table 5); 2) Additional case study of generated responses from EmpatheticDiaglogues (Table 6); 3) Stackplot of the knowledge selection process for all the emotions in EmpatheticDiaglogues (Figure 5).

Gold
Sorry to hear that.I found that it helps to have multiple savings accounts, so whenever you get a certain amount of money ( like 5 , 10 ) save it and be strict not to spend it.

Context
Speaker: I went to visit a friend one day and her boyfriend made a pass at me.He actually suddenly kissed me, as I was waiting for my friend to return.I could not believe his nerve!Listener: Oh no, what did you do?Speaker: I quietly pushed him away, as I did not want my friend to hear.I was so shocked and upset, since he knew I wouldn't want to make a scene and took advantage.

MoEL
That is so nice of her.

MIME
That is nice of you.

EmpDG
That is a lot of people are not.I am sure you will be able to get another one.

IFigure 1 :
Figure 1: The framework of our proposed empathetic dialogue generation.The listener acknowledges speaker's feeling with the adaptive commonsense selection and respond with respect to the emotion status of speaker.

Figure 2 :
Figure 2: The architecture of our framework.It consists of three modules: (1) Contextual Probing enhances dialogue comprehension by commonsense knowledge; (2) Workspace adaptively modifies the cognition of speaker's status; (3) Knowledge-Aware decoder generates empathetic responses.
r H 9 j 5 N 0 4 e h S Y e G D i c c y 9 z z / F j K Q y 6 7 r e T W 1 p e W V 3 L r x c 2 N r e 2 d 4 q 7 e 3 U T J Z p D j U c y 0 k 2 r H 9 j 5 N 0 4 e h S Y e G D i c c y 9 z z / F j K Q y 6 7 r e T W 1 p e W V 3 L r x c 2 N r e 2 d 4 q 7 e 3 U T J Z p D j U c y 0 k 2 r z 2 6 o c R G 4 I M s u h H B 8 5 U l y u V c L D 2 r 7 5 / v V 4 5 N R H C W y S b b I D g n J I T k m Z 6 R O G o S R B / J E X s i r 9 + g 9 e 2 / e + 0 / p l D f q 2 S B / 4 H 1 8 A 9 z D o S g = < / l a t e x i t > react m < l a t e x i t s h a 1 _ b a s e 6 4 = " v c Y D P b T y X X O 7 5 4 b 5 f P a 9 W j o 5 H c c y R T b J F d k h I D s g R O S U 1 U i e c P J A n 8 k J e v U f v 2 X v z 3 o e t E 9 5 o Z p 3 8 g v f x D R 9 H l p c = < / l a t e x i t >

Figure 3 :
Figure 3: The illustration of the workspace mechanism.During competition stage at each iteration, the most irrelevant knowledge, for example the 'react', is deleted from the set of knowledge representation, demonstrated by Z −I .The dynamics of the deletion is δ.During broadcasting stage, the knowledge-aware representation h k is refined by the dynamic δ.
: An original Transformer, which is trained to optimize the NLL loss.(2) Multi-TRS (Rashkin et al., 2019): A variation of the Transformer for multitask that trained to jointly optimize an additional cross-entropy loss for emotion classification with the NLL loss.(3) MoEL (Lin et al., 2019): A Transformer-based model that uses 32 emotion-specific decoders to generate a response.Therefore, each decoder is optimized to respond appropriately for each emotion.(4) MIME (Majumder et al., 2020): Another Transformer-based model that mimics the context emotion to a varying degree considering its negative and positive emotions, and then generates empathetic response based on the blend of these two emotions.(5) EmpDG (Li et al., 2021a): A multi-resolution adversarial framework which applies an empathetic generator to produce empathetic responses and an interactive discriminator to ensure that the generated responses are consistent with the context and are also empathetic.(6) CEM Models

Figure 4 :
Figure 4: Stackplot of the knowledge selection process.

Figure 5 :
Figure 5: Stackplot of the knowledge selection process.

Table 1 :
Results of automatic evaluation.A * represents the adaptive knowledge selection method in the workspace module.
ContextI ended up staying at home for the holidays.ContextMy friend borrowed quite a lot of money from me.I really do believe he will repay me.SelectionProcess xReact → xIntend → xEffect → xWant → xNeed Selected Knowledge xNeed: To ask for a loan; To get a loan; To ask him to repay; To ask for money

Table 3 :
We report the case study of generated responses from EmpatheticDiaglogues. The responses with yellow background color demonstrate the awareness to the emotion and the selected knowledge.

Table 4 :
Results of human evaluation.We report the average scores of four aspects.Fleiss kappa of the results is 0.36, which constitutes a fair level of agreement.
Table 5, and we also display how the elimination of irrelevant knowledge boost the performance.The set of knowledge representation Z= {Z r [0]}, Z r [0] ∈ R d ,linear layer g θ , θ ∈ R d×q , the context representation vector z ctx ∈ R d from dialogue history encoder, the objective function of emotion classification L emo .•Competition Stage: while

Table 5 :
Ablation study of our cognition dynamics.A * represents the adaptive knowledge selection method in the workspace module.Selection Process illustrates that at each the iteration, the most irrelevant knowledge is selected and eliminated.My family and I are going on vacation in a few weeks.We rented a Largo Beachfront Condo and I can not wait!Listener: Oh I love the beach!! Which beach are you going to go to?Speaker: We are going to be staying at Miramar Beach in Florida, about an hour Panama City Beach.The view is amazing from our balcony and we are so ready!Selection Process xIntend → xEffect → xReact → xNeed → xWant Selected Knowledge xWant: To enjoy the view; To enjoy the beach; To enjoy the day; To have fun Emotion Anticipating Ours Sounds like a lot of fun.I have never been there, but I will check it out!Gold I bet it is tough.A regular income is nice but the freedom you get from working for yourselfs is better.Context Speaker: My parents keep taking my money that is not nice of them.Listener: Why do they keep taking it??Speaker: They need it for other stuff.I never get a chance to save any of it.MoEL That is great!I hope you find a good job!MIME That is a good thing to do.EmpDG That is a very nice thing to do.CEM Oh no!I am sorry to hear that.Process xReact → xEffect → xIntent → xWant → xNeed Selected Knowledge xNeed: To get rid of it; To get rid of something; To buy it; To have it Emotion Annoyed Ours I am sorry to hear that.Maybe you can talk to them about it.
CEM That is terrible.Did you find out who did it?Selection Process xIntent → xEffect → xNeed → xWant → xReact Selected Knowledge xReact: Angry; Upset; Sad; Annoyed; Scared Emotion Angry Ours I would have done the same thing.Context Speaker: My mother recently went to the bank for me.I love her so much.Listener: That is really nice of her.It is nice to have someone to help you do things sometimes.Speaker: Nice people are the best.Empathy is something we all need more of!MoEL That is very nice of you.It is always nice to have family.MIME That is a good thing to do.EmpDG That is a very nice thing to have a good mom.CEM That is nice of you.Process xNeed → xWant → xIntent → xReact → xEffect Selected Knowledge xEffect: Is loved; Is respected; Is liked Emotion Caring Ours I agree with you there.It is nice to have someone to lean on.
GoldYes.Too many people these days are mean to one another!Table 6: Additional case study of generated responses from EmpatheticDiaglogues.