CARE: Causality Reasoning for Empathetic Responses by Conditional Graph Generation

Recent approaches to empathetic response generation incorporate emotion causalities to enhance comprehension of both the user's feelings and experiences. However, these approaches suffer from two critical issues. First, they only consider causalities between the user's emotion and the user's experiences, and ignore those between the user's experiences. Second, they neglect interdependence among causalities and reason them independently. To solve the above problems, we expect to reason all plausible causalities interdependently and simultaneously, given the user's emotion, dialogue history, and future dialogue content. Then, we infuse these causalities into response generation for empathetic responses. Specifically, we design a new model, i.e., the Conditional Variational Graph Auto-Encoder (CVGAE), for the causality reasoning, and adopt a multi-source attention mechanism in the decoder for the causality infusion. We name the whole framework as CARE, abbreviated for CAusality Reasoning for Empathetic conversation. Experimental results indicate that our method achieves state-of-the-art performance.


Introduction
Empathy is the capability to perceive, understand and respond to another individual's feelings, experiences and situation (Paiva et al., 2017;Decety and Jackson, 2004).It is composed of two aspects (Davis, 1983), which are (i) affection, i.e., emotion understanding and appropriate emotional reaction (Hoffman, 2001), and (ii) cognition, i.e., comprehension and reasoning of the other's experiences and situation (Preston and De Waal, 2002).
Earlier work on empathetic response generation merely pays attention to affection (Lin et al., 2019;Majumder et al., 2020;Li et al., 2020a).Consequently, their models lack understanding of the Our response: That is so cool.I bet he is a great player.
Figure 1: Causality reasoning results of GEE MIME (Kim et al., 2021) and our proposed method in a real case.
Arrows indicate relations from cause to effect, while strikeout arrows indicate no causal relations.GEE MIME detects only direct causes and effects of the user's emotion independently, while ours extends the causality scope and reasons causalities interdependently.
user's experiences, resulting in very weak empathy.
Most recent studies begin to consider both affection and cognition by incorporating emotion cause and effect (Wang et al., 2021;Gao et al., 2021;Kim et al., 2021;Sabour et al., 2022).Despite notable improvement, their methods suffer from two critical problems.First, they only consider causalities between the user's emotion and the user's experiences, which are just part of cognition.Causalities between experiences also contribute to the comprehension of experiences.For the case in Figure 1, although brother does not cause impressive directly, it is the subject causing what impresses the user.Therefore, brother should be considered in the causal information for response generation.
Second, these methods reason causalities independently and ignore interdependence among these causalities, leading to low-fidelity causality detection.As shown in Figure 1, GEE MIME , one of these methods, fails to reason killing → impressive, since killing itself ordinarily is the cause or effect of a negative emotion.However, this causality is reasonable when simultaneously considering other causalities including game → killing, as our proposed method models.Due to the above two problems, these previous methods always misunderstand feelings and experiences of the user, impeding empathetic expression in responses.
To solve these problems, we propose to reason all plausible causalities, i.e., causalities stated explicitly in the dialogue history and probably in the future dialogue, interdependently and simultaneously by formulating the reasoning as a conditional graph generation task.Specifically, we aim to generate a causal graph 2 containing all plausible causalities conditioned on the user's emotion, dialogue history, and predicted future dialogue content.Inspired by the Variational Graph Auto-Encoder (VGAE) (Kipf and Welling, 2016), we design a Conditional Variational Graph Auto-Encoder (CV-GAE), which uses latent variables for conditional structure prediction, to accomplish causality reasoning.Accordingly, the model is expected to have a deeper understanding of the user's feelings and experiences.In addition, some feelings and experiences, which are not explicitly stated in dialogue history but contribute to response generation, can be inferred in this process as shown in Figure 1.
In this paper, we propose a novel empathetic response generation model, called CARE (CAusality Reasoning for Empathetic conversation).CARE reasons all plausible causalities by CVGAE, and infuses them into response generation by a multisource attention mechanism in the decoder.In addition, we adopt multi-task learning to integrate causality reasoning and response generation during training.The experimental results on the EMPA-THETICDIALOGUES (Rashkin et al., 2019) benchmark suggest that our method improves the model's understanding of user's feelings and experiences, and CARE achieves state-of-the-art performance on empathetic response generation.
Our main contributions are three-fold: 1).We propose to reason all plausible causalities in empathetic conversation interdependently and simultaneously for a deep understanding of the user's feelings and experiences.
2).We turn causality reasoning into a conditional graph generation task, and introduce CVGAE, 2 Each node is a word to represent the user's feelings and experiences, and each edge indicates a causal relationship between two nodes.which uses latent variables for conditional structure prediction, to achieve the reasoning.
3).We design CARE, which augments empathetic response generation with causality reasoning, and prove its outstanding performance on the EMPATHETICDIALOGUES benchmark.

Related Work
Since empathy is a critical character for social chatting systems (Sharma et al., 2020;Pérez-Rosas et al., 2017), many studies have contributed to empathetic response generation.Earlier work mainly focuses on the affective aspect of empathy.MoEL (Lin et al., 2019) adopts a mixture of experts architecture to combine outputs from different decoders, each of which represents one emotion.Based on the idea of emotion mixture, MIME (Majumder et al., 2020) takes emotion polarity (positive or negative) into account.Moreover, it uses emotion stochastic sampling and emotion mimicry to generate empathetic responses.Li et al. (2020a) propose to capture nuances of emotion at the token-level for decoding.Moreover, an adversarial learning framework is leveraged to involve user feedback.
Having realized that ignorance of cognition impedes empathy in conversation, some recent methods involve both affection and cognition by incorporating emotion causes and effects.Wang et al. (2021) incorporate emotion causes into empathetic response generation by multi-hop reasoning from emotion causes to emotion states.Gao et al. (2021) identify emotion causes from dialogue context, and use gates at the decoder to control the involvement of these emotion causes in the response generation.Kim et al. (2021) emphasize emotion causes in dialogue context by a rational speech act framework.These three methods identify emotion causes via a classifier, which detects whether there is a causal relationship between a conversation fragment and an emotion statement or word each time.CEM (Sabour et al., 2022) uses COMET, an if-then commonsense generator, to generate causes and effects of user experiences, and refines dialogue context with them for response generation.However, all these methods obtain causalities independently.

Transformer-based Response Generation
The response generation model is built upon the vanilla transformers (Vaswani et al., 2017), which generates the response R given dialogue context C as input in an encoder-decoder manner.The encoder encodes the dialogue context and generates the context hidden state.That is: E out ∈ R |C|×d , where d is the hidden size.The decoder takes the right shifted response as input and generates the response.Typically, the whole decoder includes L dec decoder layers, each consisting of three sub-layers.The first one, i.e., the self-attention sub-layer, computes a representation of the input sequence: where H in is the embedding right shifted response for the first decoder layer, and is output of the (l − 1)-th decoder layer for the l-th decoder layer.Then the decoder attends to the dialogue context by a cross-attention sub-layer: (3) The output of the l-th decoder layer is obtained by the feed-forward sub-layer: Finally, we apply linear transformation and a softmax operation on the output of the L dec decoder layer to predict token probability distribution at each token position t: where H L out,t is the final output for the t-th token; W o ∈ R d×d vocab and b o ∈ R d vocab are parameters, and d vocab is the vocabulary size.

Variational Graph Auto-Encoder
Our proposed causality reasoning module, i.e., CV-GAE, is based on VGAE (Kipf and Welling, 2016).Given an undirected graph G = (V, E) with its adjacency matrix A, VGAE generates graph latent variables by an inference model, and reconstructs the adjacency matrix by a generative model.

Inference Model
The inference model encodes G, and generates graph latent variables Z = {z 1 , . . ., z |V| } by a recognition net q(Z|V, A).Each graph latent variable z i is obtained by: Here, N is a sampling function, which follows the Gaussian distribution.µ is the matrix of the mean vectors µ i ; logσ is the matrix of log-variance vectors logσ i .In particular, H V is a shared hidden state obtained by: Generative Model The generative model reconstructs the adjacency matrix by an inner product between latent variables: Inference Stage At the inference stage, adjacency matrix A is unavailable.Therefore, we replace q(Z|V, A) with a prior net p(Z), which is parameterized by a Gaussian distribution: p(z i ) = N (z i |0, 1), to infer Z.Then, we use the same generative model to generate the adjacency matrix.
Objective VGAE is optimized by maximizing: where KL[q(•)|p(•)] is the Kullback-Leibler divergence between q(•) and p(•).prior causal graph as input during inference.The prior causal graph contains causalities explicitly mentioned in previous user utterances, while the posterior one contains additional causalities in the next user utterance.Then, CARE infuses causalities in the reasoned causal graph into response generation by multi-source attention at the decoder.

Graph Construction
As mentioned, we need a prior causal graph G prior = (V, E prior ) and a posterior causal graph G post = (V, E post ), for causality reasoning at inference and training stage, respectively.We construct them with the assistance of a causal knowledge graph, i.e., Cause Effect Graph (CEG) (Li et al., 2020b).These two graphs share the same node set, which theoretically contains all nodes in CEG.

Conditional Variational Graph Auto-Encoder (CVGAE)
We design a novel structure CVGAE to generate a (posterior) causal graph for causality reasoning.
As an extension of VGAE, CVGAE works in a similar manner ( § 3.2).In particular, it generates graphs latent variables for graph reconstruction under some conditions, including a context condition, an emotion condition, and a context latent variable.

Context and Emotion Conditions
The context condition is expected to provide information of dialogue context C, thus it is derived from the encoder output.Following (Wang and Wan, 2019), we use multi-head attention to perform it.That is: where E out ∈ R |C|×d is the encoder output computed by TRS enc (C) in Equation ( 1); v rand ∈ R 1×d is a randomly initialized vector and is regarded as a single query for multi-head attention.
The emotion condition is expected to provide information of the user emotion e. Accordingly, we define the emotion embedding E emo ∈ R d which converts an emotion label into embeddings.The emotion condition is formulated as: Context Latent Variable We use a context latent variable z c to provide information from the future dialogue.This variable is generated by a contextual recognition net (R-Net C in Figure 2) with dialogue context C and the golden response R as input: Here µ c = MLP µ (c lant ) is the mean vector and logσ c = MLP σ (c lant ) is the log-variance vector, where c lant is accessed similar to Equation ( 10): Graph Latent Variables We generate graph latent variables Z g by a recognition net (R-Net G in Figure 2): q g (Z g |V, A post , c cond ), where A post is the adjacency matrix of G post .This process is similar to that of VGAE, i.e., Equations ( 6) and ( 7).
and logσ g = GCNLayer σ (H V , A post ). (15) The shared hidden state H V is generated with attention to the concatenation of c ctx , c emo , and z c : Causal Relation Generation With graph latent variables Z g , we reconstruct the posterior causal graph, i.e., the matrix adjacency Â by Equation ( 8).Then we select top-k relationships from the reconstructed graph according to their probability, denoted as R = (r1, . . ., r k ), where r i is the sum of the head and tail node embeddings.
Inference Stage During inference, R (the golden response) and A post are unavailable, thus we use a prior net p g (Z g′ |V, A prior , c ′ cond ) (P-Net G in Figure 2) to approach q g (Z g ), i.e., Equations ( 15) and ( 16).A prior is G prior 's adjacency matrix, and , where z c′ is obtained by a contextual prior net (P-Net C in Figure 2): with

Graph-Infused Response Generation
To infuse the reasoned R into generation, we enable the decoder to attend to both dialogue context and the causal graph (Multi-Source Decoder in Figure 2).In particular, we slightly modify the cross-attention sub-layer of the original decoder, i.e., Equation (3), with our multi-source attention mechanism.Therefore, the output after this modified sub-layer is computed by: where E out ∈ R |C|×d is the encoder output, W multi ∈ R 2d×d is a group of linear transformation parameters, and H in is the output of the selfattention sub-layer of the decoder computed by Equation (2).Notably, the reset of the original decoder, i.e, Equations ( 2), ( 4) and ( 5), remains the same.In this way, we generate the final response.

Training Objective
We optimize the model with multi-task learning to further integrate the causality reasoning and the graph-infused response generation.For the causality reasoning, we consider graph reconstruction accuracy and similarity between posterior and prior distribution.Similar to Equation ( 9), the corresponding loss can be calculated by: The response generation loss is calculated by: where P t is obtained by Equation ( 5).Finally, we train CARE by maximizing (L r + L g ).

Dataset
We conduct our experiments on EMPATHETICDI-ALOGUES 3 (Rashkin et al., 2019).It contains 25k crowdsourced one-on-one conversations, each of which is developed based on a particular emotion.
There are 32 emotion categories distributed in a balanced way.Following its original division, we adopt approximately 80%, 10%, and 10% of the dataset for training, validation, and testing.

Comparison Models
We select seven models for comparison according to some special considerations.Three models that merely consider the affective aspect of the empathy are selected.They are: In addition, four models that considers both the affection and cognition of empathy are selected: KEMP7 (Li et al., 2022) This model leverages external commonsense knowledge and emotional lexicon to understand and express emotion for empathetic response generation.
CEM8 (Sabour et al., 2022) This model generates causes and effects of the user's latest mentioned experiences, and uses them to refine the context encoding for a better understanding of the user's situations and feelings.

RecEC soft
9 (Gao et al., 2021): This model pays more attention to emotion causes, detected from dialogue context, at word-level by a soft gated attention mechanism in the decoder.

GEE MIME
10 ( Kim et al., 2021): This model uses a rational speech act framework to update the response generated by MIME to obtain the final response that focuses more on the emotion cause words in dialogue context.
All above models, as well as ours, are built upon transformer backbone for a fair comparison.

Implementation Details
Our Model: We implemented our model using PyTorch11 , and trained it on a GPU of Nvidia GeForce RTX 3090.The token embeddings are initialized with 300-dimensional pre-trained Glove vectors (Pennington et al., 2014), and shared between between the encoder, the CVGAE model, and the decoder.The hidden size d is set as 300.The number of node number |V| is 800, and the number of selected relationships k is 512 (0.16%).Both the encoder layer number and the decoder layer number are 2.The batch size is set as 16.
When training the model, we use Adam optimizer (Kingma and Ba, 2015) and vary the learning rate following Vaswani et al. (2017).
Comparison Models: We implement GEE MIME under its official instructions, since only testing codes and instructions are provided by the authors.For the rest of the comparison models, we utilize their official codes released on GitHub.

Automatic Evaluation
Metrics: Three kinds of metrics are applied for automatic evaluation: (1) Perplexity (PPL), which measures the model's confidence in the response generation.(2) BLEU (Papineni et al., 2002), which estimates the matching between n-grams of the generated response and those of the golden response.We adopt BLEU-3 and BLEU-4.(3) BERTScore (Zhang et al., 2020), which computes the similarity for each token in the generated response with that in the golden response.We use its matching precision, recall and F1 score (P BERT , R BERT , and F BERT ).For perplexity, a lower score indicates a better performance; while, for the rest metrics, higher scores indicate better performances.The table does not present the perplexity score of GEE MIME .This is because its generated token probability distribution depends on the mediate results of MIME and its emotion cause detector, and therefore PPL is less relevant to its core structure, i.e., rational speech act framework.Highest BLEU and BERTScore scores indicate that our approach can generate more human-like responses by incorporating causality reasoning.Especially, all the above advantages are significant and stable, evident in high degrees of statistical significance and small standard deviations, respectively.

Human Ratings
Metrics: Although the automatic evaluation has provided useful information about models' performances, it cannot capture some features, such as empathy expression and contextual relevance.Therefore, following previous practices, we randomly sample 128 conversations, and corresponding responses generated by different models for human ratings.We ask three human annotators to score each generated response from the following three aspects: (1) Empathy (Emp.), which measures whether the response understands user feelings and experiences.( 2) Relevance (Rel.), which measures whether the response is on-topic and appropriate given the previous conversation.
(3) Fluency (Flu.), which measures whether the response is fluent and its language is accurate.Each is on a 5−point likert scale, where 5 is the best.Then we compute the average value for each metric.
Annotation Statistics: Table 2 displays the human rating results, and the highest scores are in bold.We calculate Fleiss's kappa to measure interevaluator agreement of the human ratings.The result is 0.41, indicating a moderate level of agreement among three annotators.
Results: From these results, we can draw two conclusions.First, compared with most previous models, CARE achieves the highest scores in terms of Emp. and Rel., and obtains relatively high Flu..It indicates that our causality reasoning in an interdependent and simultaneous way indeed benefits empathetic expression and content relevance as we expect.Thanks to the reasoned causalities, CARE improves the understanding of user feelings and experiences.In addition, the reasoning process enables the model to identify some reasonable user's feelings and experiences that are not explicitly mentioned in the previous conversation.With such information, the model can show strong empathy in response, which is manifest in the case study.Second, models considering both affection and cognition (bottom half of the table) do not always outperform models merely considering affection (upper half of the table).This is also evident in Table 3, i.e., the automatic evaluation results.Although causality reasoning intuitively contributes to the understanding of user's feelings and experiences, inconsiderate reasoning can lead to one-sided understanding and low empathy.

Model Analysis
In § 5.4 and § 5.5, CARE has shown its superior performance.For deeper analyses of our model, we investigate its inner structures and functions.
Ablation Study We propose two variant models to verify the contribution of reasoning and the reasoning condition in CARE: • w/o reasoning: We remove the CVGAE structure, and directly incorporate the prior causal graph into response generation.
• w/o condition: We replace CVGAE with VGAE to eliminate the effect of the reasoning condition.
Results are shown in Table 3 and Table 4, respectively.From Table 3, both variants achieve relatively high automatic evaluation metric scores.Moreover, the variant models surpass previous comparison models in Table 1.It indicates that causalities can help models respond more like humans, given that both variants consider additional causalities between the user's experiences.However, both variants' performances in terms of human evaluation are relatively low.Accordingly, we can draw the following three conclusions: • Not all information in the golden response contributes to empathy.Although two variants have high automatic evaluation scores, they fail to achieve equally high human ratings.Such a phenomenon can also be clearly observed when comparing the performance of EmpDG and KMEP.Analysis of #SelectedRelationships k As shown in Figure 3, the performance of CARE with regard to BLEU-4 first rises and then drops as we increase the number of relationships infused in response generation k.It indicates that sufficient causalities benefit empathetic expression, but excess ones could involve noise and hurt empathy.

Case Study
Table 5 presents a case along with responses generated by our models and comparison models.From the table, CARE can respond more empathically to the user when compared with other models.Notably, CARE is able to show deep and considerate comprehension of the user's feelings and experiences in the response.For instance, it understands that the "apprehensive" emotion comes from lack of confidence and the user has already proposed a quite effective solution (great idea).

Conclusion
In this paper, we propose to reason all plausible causalities in conversation interdependently and simultaneously for a deep understanding of the user's feelings and experiences in empathetic dialogue.Further, we turn the causality reasoning problem into a conditional graph generation task.Correspondingly, we design CVGAE, which uses latent variables for conditional structure prediction, and predicted future conversation content, to implement the reasoning.The reasoned causalities are infused into response generation for the final empathetic responses by a multi-source attention mechanism in the decoder.This whole structure is named as CARE (CAusality Reasoning for Empathetic conversation).Experimental results show that CARE outperforms prior methods in terms of both automatic and manual evaluations.

Limitations
In this paper, we improve the model's empathy from the aspect of affection and cognition, especially the latter one.For this purpose, we incorporate reasoned causal knowledge into response generation.However, other knowledge, such as sentiment knowledge and commonsense knowledge, can also contribute to affection and cognition.KEMP (Li et al., 2022), one of the comparison models in our experiment, has explored incorporating commonsense knowledge and sentiment knowl-edge into response generation.However, according to its model design, its use of knowledge is universal in chitchat conversations and is not aimed at empathetic expression.Therefore, it has low Emp.score as shown in Table 2. Therefore, it is worth exploring the connection between empathy and different types of knowledge.Besides, how to fuse different knowledge in one model for more empathetic responses is also a valuable problem.

Ethical Considerations
The widely-used open-sourced EMPATHETICDIA-LOGUES (Rashkin et al., 2019) benchmark used in our experiment is collected through interaction with Amazon Mechanical Turk (MTurk).In this process, user privacy is protected, and no personal information is contained in the dataset.Therefore, we believe that our research work meets the ethics of EMNLP.

A Appendix
A.1 Automatic Evaluation For our automatic evaluation, we modify codes12 for dialogue evaluations provided by Csáky et al. (2019).In addition, Table 6 shows the full automatic evaluation results.

A.2 Human Evaluation
We implemented a system, as shown in Figure 4, for fair human ratings.For each case, we provide the previous dialogue turns, and the user emotion to the annotator.In addition, all responses generated by different models are displayed in a random order, thus the annotator cannot distinguish the source of each single response.Since human ratings are subjective, we provide some statements and classic examples as the reference for human evaluation.
• Empathy.We prefer responses with the following features: (1).Emotions, e.g., care, concern, and encourage.(2).Content, which shows interests in what the user cares.For instance, we prefer "Did you call the police?"instead of "What movie?" when the user says "It was stolen after the movie.".
• Relevance.We prefer responses, based on which we can infer the topics in the previous dialogue content.
• Fluency.We reduce the marks if the following appears in a response: (1 740 Figure 4: This is the user interface of the system for human ratings.
My brother got a custom-made bowling-ball, and ever since then he's been killing the game!It is awesome to see.

Figure 2 '
Figure 2 presents an overview structure of our proposed model CARE.It first reasons all plausible causalities interdependently by generating a causal graph.Specifically, we use CVGAE to generate this graph under the condition of the user's emotion, dialogue history, and predicted future dialogue content.Notably, CVGAE works differently at the training and inference stages: it reconstructs a posterior causal graph (by R-Net G ) with this posterior causal graph as input during training, while generates a posterior causal graph (by P-Net G ) with a

Figure 2 :
Figure 2: The overview of our proposed framework.The solid lines represent modules or data used for both posterior and prior computation, while the dot lines represent modules or data used only for posterior computation.

Figure 3 :
Figure 3: Model performs (BLEU-4) when we gradually increase the number of selected relationships k.The solid line and dot line represent BLEU-4 and two period moving average, respectively.For each k, we repeat five runs and compute the average BLEU-4.
However, for effectiveness, we only consider those among a certain set of nodes V, which contains emotion label word, words appearing in previous user utterances, and one-hop neighbors of above two kinds of words.The edge sets of these two graphs are different.E prior contains causal relationships in previous user utterances, while E post also contains those in the next utterances.In specific, we collect E prior and E post according to following rules.For any couple nodes in V having a relationship in CEG, if both nodes are covered by the user emotion label word and words in previous user utterances, we add the relationship into E prior ; if both nodes are covered by the user emotion label word and words in previous and next user utterances, we add the relationship into E post .

Table 1 :
Table 1 presents the automatic evaluation results, and the highest score in terms of each metric is in bold.For each model, we repeat five runs with different seeds, and compute the average values and standard deviations.In addition, values that are statistically significant with p < 0.05 are marked with * .According to Table 1, our proposed model CARE outperforms the other models in terms of all metrics.The lowest perplexity score suggests that our proposed architecture is more confident in its generated responses than other models.Automatic evaluation results in terms of PPL, BLEU and BERTScore.For each method, we repeat five runs with different seeds.We display the average values of the results along with the standard deviations.The values marked with * mean the results are statistically significant with p < 0.05.The highest score in terms of each metric is in bold.The full automatic evaluation results can be found in Appendix A.1.

Table 2 :
Results of human ratings in terms of Empathy, Relevance and Fluenct on a 5−point likert scale, where 5 is the best.The highest scores are in bold.The fleiss's kappa is 0.41 indicating a moderate level of agreement.

Table 3 :
Automatic evaluation results of the ablation study for CARE.The metrics are the same as those in Table1.Similarly, we repeat five runs with different seeds, and display the average values.Its full automatic evaluation results can be found in Appendix A.1.

Table 4 :
Human rating results of the ablation study for CARE.The metrics are the same as those in Table2.
User Emotion Type: apprehensive User: I had some hesitations when I was at the gym today, thought I wouldn't be able to handle the weights.Bot: How'd it turn out?Congrats on deciding to lead a healthier life, by the way.

Table 5 :
Responses generated by our method and previous empathetic response generation models.The content showing comprehension of feelings and experiences is highlighted in italic.

Table 6 :
). Inappropriate (obvious features due to bad training) repetition, such as "I am sorry.I am sorry.I am sorry.I am".(2).Grammar mistakes, e.g., misuse of personal pronouns and tense.(3).Conflicting contents, such as "I can understand you.I cannot understand you.".All automatic results from different methods with seed 0, 42, 1024, 1234 and 4096.