Reasoning before Responding: Integrating Commonsense-based Causality Explanation for Empathetic Response Generation

Recent approaches to empathetic response generation try to incorporate commonsense knowledge or reasoning about the causes of emotions to better understand the user’s experiences and feelings. However, these approaches mainly focus on understanding the causalities of context from the user’s perspective, ignoring the system’s perspective. In this paper, we propose a commonsense-based causality explanation approach for diverse empathetic response generation that considers both the user’s perspective (user’s desires and reactions) and the system’s perspective (system’s intentions and reactions). We enhance ChatGPT’s ability to reason for the system’s perspective by integrating in-context learning with commonsense knowledge. Then, we integrate the commonsense-based causality explanation with both ChatGPT and a T5-based model. Experimental evaluations demonstrate that our method outperforms other comparable methods on both automatic and human evaluations.


Introduction
Empathy is a desirable capacity of humans to place themselves in another's position to show understanding of his/her experience and feelings and respond appropriately. Empathy involves both cognitive and affective aspects (Davis, 1983), including the ability to perceive the user's situation and express appropriate emotions.
Previous work on empathetic response generation has primarily focused on the affective aspect of emotional expression (Lin et al., 2019;Majumder et al., 2020; by emotion detection, without sufficient consideration of context understanding. Recently, there has been a growing interest in exploring context understanding by leveraging external commonsense knowledge for reasoning emotion causes-effects or the user's desires, such as Sabour et al. (2022) and Wang et al. (2022b,a).  Figure 1: Two examples to produce a response from different perspectives. The blue solid box contains "xReact" and "xWant" representing the user's emotional reaction and desires. The green dotted box comprises "xReact" and "xIntent," representing the emotional reaction and intention of the actual responder.
However, these approaches focus on understanding the causalities from the user's perspective.
Exploring the causality within the user's context and reasoning his/her desires can be helpful so that the system's intention is aligned with the user's desires, and the response is generated from the user's perspective (Figure 1(a)). However, in real human communication, the responder's intention is not always confined to the user's desires, as shown in Figure 1(b). Relying solely on the user's desire to generate a response may not fully understand the user's experience, and leads to weak empathy, as shown in Figure 1(a). Therefore, it is necessary to incorporate both the user's perspective (exploring his/her desire and reaction) and the system's perspective (reasoning its intention and reaction to mimic humans) for empathetic response generation.
Through the utilization of COMET (Bosselut et al., 2019), which is a pre-trained GPT-2 model (Radford et al. 2018) fine-tuned on the if-then reasoning graph from ATOMIC , the system's possible intentions can be predicted to align with the user's desires. However, the system's intention may not be constrained by the user's desire. Therefore, we do not adopt COMET for the system's intention reasoning.
ChatGPT 1 has shown its efficacy in several tasks (Zhao et al., 2023). Bang et al. (2023) introduced ChatGPT's potential in causal reasoning on humanannotated explainable CAusal REasoning dataset (E-CARE) (Du et al., 2022). However, it is based on whether the model can make a judgment on correct causes or effects instead of generating causality explanations. In this paper, we propose to enhance it by incorporating in-context learning with commonsense reasoning for causality explanation. Our main contributions are as follows: • We propose to integrate a commonsensebased causality reasoning for empathetic response generation, which takes the system's intention and reaction, along with the user's desire and reaction.
• We propose to enhance ChatGPT's capability for causality explanation through the integration of in-context learning with commonsense knowledge (desire, reaction, and intention).
• We present experimental results to demonstrate both ChatGPT and a T5-based model, integrated with the proposed commonsensebased causality explanation, outperform other competitive methods based on both automatic and human evaluations.  Sabour et al. (2022); Wang et al. (2022b) utilized ATOMIC-2020(Hwang et al., 2021, which is a collection of commonsense reasoning inferences about everyday if-then events, to enrich context understanding with information on the user's reactions, intentions, effects, needs, and desires. However, these approaches only focus on understanding the causalities within the context from the user's perspective for empathetic response generation, ignoring the system's perspective.

Large Language Models for Empathetic Response Generation
With the development of large language models such as GPT-3 (Brown et al., 2020) and ChatGPT, many studies have shown their ability on various NLP tasks with either a few-shot or zero-shot setting (Madotto et al., 2021;Lee et al., 2022;Zhao et al., 2023). Lee et al. (2022) introduced two selection methods that choose in-context examples based on emotion and situation information to generate empathetic responses by GPT-3. Zhao et al. (2023) showed ChatGPT's ability on empathetic response generation. In this study, we enhance ChatGPT with a commonsense-based causality explanation prompt for empathetic response generation.

Knowledge Acquisition
In order to generate commonsense inferences for given events, we adopt a modified BART-based  variation of COMET, which was trained on the ATOMIC-2020 dataset (Hwang et al., 2021). This model is suitable for inferring knowledge regarding unseen events (Hwang et al., 2021), like events in the EmpatheticDialogue dataset (Rashkin et al., 2018).
In the training process, we leverage this model to infer the relations of xWant and xReact for each user's utterance in the training set and the relations of xIntent and xReact for the system's utterance, which are inferred from the ground-truth response in training. In the testing, we only infer the relations of xWant and xReact for the user's utterance. The system's xIntent and xReact will be inferred by the proposed causality reasoning module.

In-Context Example Selection
We enhance ChatGPT's causality explanation based on the few-shot setting. Given the sensitivity   of large language models such as ChatGPT to incontext examples (Liu et al., 2021;Lee et al., 2022), we adopt a method similar to Lee et al. (2022) to select top-k examples from the training set based on the similarity between the test conversation and the training conversations. Specifically, we adopt Sentence BERT introduced by Reimers and Gurevych (2019) to encode the sentence semantics of the conversation. In this study, we compute the cosine similarity between the situation utterance of the training set and the test sample, which is annotated in the dataset. Top-k samples are chosen from the training set for each test sample as in-context few shot examples for ChatGPT. Figure 2 shows an overview of our proposed method. It consists of three components: (1) Causality reasoning module, which aims to enhance the ChatGPT or T5 decoder with a causality explanation for empathetic response generation. (2) Enhanced ChatGPT-based response generation. (3) T5-based response generation, which is based on a trained T5 encoder-decoder to be compared with other approaches that have developed their own model using the EmpatheticDialogue dataset (Lin et al., 2019;Majumder et al., 2020;Sabour et al., 2022;Majumder et al., 2022).

Causality Reasoning Module based on ChatGPT
As outlined in Algorithm 1, this module consists of four steps. Initially, for a test input c, we employ the method outlined in Section 3.2 to select the top-k relevant training samples, denoted as S, for in-context learning, such as (context1, response1) and (context2, response2) as exemplified in Table  13 in Appendix B.
In the second step, for each selected sample (c n , r n ) ∈ S, we leverage the COMET model to infer the xWant (c nW ant ) and xReact (c nReact ) knowledge corresponding to the user's utterance c n . Additionally, we extract the xIntent (r nIntent ) and xReact (r nReact ) knowledge pertaining to the ground truth system response r n . This information is then concatenated as few-shot examples (Table  13 in Appendix B), denoted as M prompt .
Thirdly, for the test input c, we obtain the xWant (c W ant ) and xReact (c React ) knowledge using COMET. Finally, they are appended to M prompt as the prompt to ChatGPT, which reasons Intent (r Intent ) and React (r React ) from the system's perspective based on the few-shot learning.

Enhanced ChatGPT-based Response Generation
The prompt provided to ChatGPT encompasses two components: causality explanation from the user's perspective, predicted by COMET, and causality explanation from the system's perspective, derived through the causality reasoning module described in Section 4.1. These components, along with the few-shot examples, are integrated into ChatGPT to generate empathetic responses.
Algorithm 1 Commonsense-based causality explanation prompt Require: A training set D={(c n ,r n )} N n=1 , N is the number of training samples; a test input (c); c, r represents context, ground truth response, respectively; COMET model f θ (·) /*Step 1: Step 2: Get the commonsense knowledge for the selected examples */ M prompt ← empty list for each s ∈ S do Get causality information (desire and reaction of user, intent, and reaction of sys) for the sample in S inferred by COMET k n =c nW ant +c nReact +r nIntent +r nReact M prompt .append(c n ,k n ,r n ) end for /*Step 3: Get the commonsense knowledge for the test sample */ Get causality information (desire and reaction of user) for the test sample c Step 4: prompting ChatGPT, and output the reasoned Intent, React for generating a empathetic response*/ Input: M + prompt =M prompt +c+c W ant +c React Output: r Itent , r React , r ChatGP T

T5-Based Response Generation
Context and Causality Encoding For a test input c, we use the COMET model to infer the user's causality information, which are desire and reaction of the user (k user : c W ant and c React ), and use the causality reasoning module based on Chat-GPT to infer the system's causality information, which are intention and reaction of the system (k sys : r Itent , r React ). We utilize three T5 encoders for encoding input context, the user's causality information, and the system's causality information.
Emotion Classification In order to detect the user's affective state, we concatenate the context representations and the user's causality information, and then pass them through a linear layer followed by a softmax operation to produce the emotion category distribution: where W e is the weight vector of the linear layer. Given the ground-truth emotion label e * for each conversation, the cross-entropy loss is computed to optimize the process of emotion classification: Response Generation We fuse and feed the information of the user's context and the corresponding causality explanation of the user and the system to a fully-connected (FC) layer.
Subsequently, the target response r T 5 = [y 1 ,...,y T ] with length T , is generated by the T5 decoder token by token: where E y<t denotes the embeddings of the tokens that have been generated. The negative loglikelihood for generation is defined as: The combined loss is defined as:

Evaluation of Causality Explanation based on ChatGPT
We first evaluate how the output of the causality reasoning module is matched with the reaction and intention of the actual (ground-truth) response.

Dataset
The EmpatheticDialogues dataset of 25k empathetic conversations is used. The ratio for training/validation/test is 8:1:1.

Setting
For the experiments based on ChatGPT, we used the "gpt-3.5-turbo" engine version with a temperature of 0. We used the 10% of the EmpatheticDialogue test set for this evaluation (250 samples for single-turn and multi-turn settings, respectively).

Automatic Metrics
(Macro-averaged) F1 score (Rajpurkar et al., 2016), precision, and recall are computed by matching the portion of words in the generation and ground truth that overlap after removing stopwords. BLEU (Papineni et al., 2002) evaluates the matching between n-grams of the generated response to the ground truth. We utilize BLEU-2, BLEU-3, and BLEU-4 scores.
BERTScore (Zhang et al., 2019) is a BERT-based evaluation measure for text generation, which focuses on lexical semantic similarity between the generated response and the ground truth. We adopt its precision, recall, and F1 score (PBERT, RBERT, FBERT). We used the RoBERTa-Large  version.

Results
We evaluate the performance of the system's intention/reaction reasoning under a different number of in-context examples. Experimental results in Table 1 show that increasing the value of k allows for ChatGPT to generate reactions and intentions that are more closely aligned with those inferred by COMET from the ground truth response.

Evaluations on ChatGPT-Based Response Generation
Then, we evaluate the responses generated by Chat-GPT.

Evaluation Models
ChatGPT: The prompt given to ChatGPT includes only the chosen in-context raw examples S from the training set, along with the test sample. ChatGPT+Causality user,sys : The commonsensebased causality explanation prompt M + prompt is utilized to generate a response by ChatGPT, as illustrated in Algorithm 1.
6.2 Evaluation Metrics 6.2.1 Automatic Metrics EMOACC: Following Welivita and Pu (2020); Lee et al. (2022), we utilize the EMOACC 2 to measure the emotion accuracy of the generated responses, which is a fine-tuned BERT-base (Devlin et al., 2018) model on the EmpatheticDialogue dataset. EMPTOME (Sharma et al., 2020): It consists of three empathy metrics: Interpretations (IP), which represent expressions of acknowledgments or understanding of the interlocutor's emotion or situation. For example, a response like "I also worked hard for the math exam, which made me anxious," is considered a stronger interpretation than "I understand how you feel." Explorations (EX), which represent expressions of active interest in the interlocutor's situation. For instance, a statement like "Are you feeling terrified right now?" exhibits stronger exploration compared to "What happened?" Emotional Reactions (ER), which represent expressions of explicit emotions. They are computed by pre-trained empathy identification models. 3 Specifically, RoBERTa  models are separately fine-tuned for each metric by evaluating the generated response to the number of 0, 1, or 2, a higher value means stronger empathy. Coherence: We leverage the BERTScore (Zhang et al., 2019) to quantify coherence by computing the semantic similarity between the generated response and the input context.

Human A/B Test
We also conducted A/B test to compare the performance of ChatGPT+Causality user,sys and Chat-GPT. For each comparison, three crowd-workers are asked to choose the better one or select "Tie" based on three aspects: Empathy, Coherence, and Informativeness (Sabour et al., 2022). (1) Empathy (Emp.) measures whether the generated response understands the user's feelings and experiences.
(2) Coherence (Coh.) measures whether the response is coherent/relevant in context. (3) Informativeness (Inf.) evaluates whether the generated response conveys more information corresponding to the context.

Number of In-context Examples
We investigate the effect of the number of in-context examples using our proposed commonsense-based causality explanation prompt. Table 2 shows that setting k to 4 results in the highest emotion accuracy, and setting k to 2 yields better exploration and emotional reactions. Therefore, we select k values of 2 and 4 for the experiments.  Table 3 and Table 4 present the results of Chat-GPT and ChatGPT+Causality user,sys with k set to 2 and 4, under the single-turn and multi-turn settings, respectively. In the single-turn setting, a test sample consists of one utterance, while in the multi-turn setting, a test sample contains multiple turns. From the four comparisons, we observe that ChatGPT+Causality user,sys outperforms ChatGPT in at least 5 out of 7 evaluation metrics. Notably, ChatGPT+Causality user,sys significantly outperforms ChatGPT on EMOACC and ER, indicating that ChatGPT+Causality user,sys can generate responses with appropriate emotions. This can be attributed to the inclusion of inferred user emotions and reasoned system emotions, which provide appropriate affective information for generating empathetic responses. This improvement addresses the limitation of ChatGPT on emotion recognition, as highlighted in Zhao et al. (2023).

Experimental Results
ChatGPT+Causality user,sys performs better when k is set to 2 under the singleturn setting.
Overall, the performance of ChatGPT+Causality user,sys is superior in the single-turn setting compared to the multi-turn setting. This discrepancy can be attributed to COMET, which is trained based on events, not context, making it less effective in predicting causality for long context. To solve the limitation of COMET will be placed on our future work.
The results of the human A/B test in Table  5 show that ChatGPT+Causality user,sys is better than ChatGPT on the aspects of Empathy and Informativeness because of the enriched knowledge by the commonsense-based causality explanations.

Experiments on T5-Based Response Generation
Finally, we evaluate the responses generated by the T5-based model.

Evaluation Models
Affection-based Methods: MoEL (Lin et al., 2019); MIME (Majumder et al., 2020); EmpDG    . COMET-based Method: CEM (Sabour et al., 2022), which employs commonsense knowledge, such as the user's reactions, intentions, desires, needs, and effects, to enhance its understanding of the interlocutor's situations and emotions. T5-based Method: LEMPEx (Majumder et al., 2022), which adopts T5 as the encoder-decoder and utilizes a combination of exemplar-based retrieval, a response generator, and an empathy control module to generate empathetic responses. T5 (Raffel et al., 2020): We utilize the T5 model as our base encoder-decoder architecture, integrating with the emotion classifier. We train it from scratch on the EmpatheticDialogue dataset. T5+Causality user : The T5 model is extended with an additional T5 encoder for user's desires/reactions. T5+Causality user,sys : The T5 model is extended with two T5 encoders for the user's causality attributes (desires/reactions) and the system's causal-ity attributes (intentions/reactions), respectively.

Settings
We trained T5-small (Raffel et al., 2020) from scratch on the EmpatheticDialogues dataset. The learning rate is set to 0.00001, the batch size is set to 8, we utilize the top-k search decoding strategy with k set to 20, and sampling with the temperature set to 0.2, the max generation length set to 40.

Results and Analysis
Previous studies (Sabour et al., 2022;Majumder et al., 2022) have shown that CEM and LEMPEx outperformed MoEL, MIME, and EmpDG. Therefore, we compared our method with CEM and LEMPEx in the human A/B test. Automatic evaluation results shown in Table 6 and human A/B test results shown in Table 7 demonstrate the effectiveness of the proposed commonsense-based causality explanation (Causality user,sys ). The performance comparison presented in Table 8 demonstrates the superiority of our method over the baselines in terms of emotion accuracy (EMOACC), interpretation (IP), and emotion reaction (EX) when compared to the ground truth.

Comparison between T5-based and ChatGPT-based Response Generation
We conducted a performance comparison between the T5-based and ChatGPT-based response generation, as presented in Table 9. In terms of "Em-   pathy," ChatGPT+Causality user,sys outperforms T5+Causality user,sys for EMOACC, EX, and ER, but performs worse for IP. Stronger interpretation (IP), which involves understanding and empathizing through shared experiences (Sharma et al., 2020), is more frequently observed in the T5-based model, which was trained from the ground truth.
In contrast, ChatGPT-based generation is not constrained by the ground truth and tends to respond from the perspective of a machine. In terms of "Diversity" and "BLEU," it is evident that ChatGPT+Causality user,sys exhibits a larger diversity but results in a higher degree of mismatch with the ground truth (lower BLEU scores), indi- cating a potential need of balancing the response diversity and the accuracy in generating empathetic responses.
Comparative case studies between T5-based and ChatGPT-based models with corresponding baselines can be seen in Appendix C.

Conclusions and Future Work
We have proposed a commonsense-based causality explanation approach for diverse empathetic response generation that considers the system's intentions and reactions as well as the user's desires and reactions. Specifically, we enhance ChatGPT's ability to reason the system's intentions and reactions by integrating in-context learning with commonsense knowledge (desire, reaction, and intention). We have integrated the commonsense-based causality explanation with both ChatGPT and a trained T5 model. The experimental results demonstrate that our method outperforms other competitive methods on both automatic and human evaluations.
In the future, we will explore fine-grained approaches for causality explanation from the perspective of both the user and the system.

B Introduction in the prompt for ChatGPT
The introduction in the prompt for ChatGPT is shown in Table 12, and the few-shot examples construction is in Table 13. Table 14 shows a case about the comparison between ChatGPT and ChatGPT+Causality user,sys , and illustrates the impact of our proposed commonsense-based causality explanation. We can see that both the responses by ChatGPT and ChatGPT+Causality user,sys show emotion reactions to the user's context. However, ChatGPT+Causality user,sys outperforms ChatGPT by providing detailed suggestions that align with the user's desires based on reasoned intentions. As discussed in Section A, COMET is not always reliable in its predictions. This sensitivity is evident in Table 15, where the user's inferred desires mislead the reasoned intentions of the system. Table 16 further shows comparative case studies between T5-based and ChatGPT-based models with corresponding baselines. Introduction: Assuming that you are sys, who is a friend of the user. You are empathetic sometimes. In this task, you are given the user's input and the information of "user wants to:" and "user reacts to:": "user wants to:", which means what the user wants to do after the input; "user reacts to:", which means how the user react to the input.

C Case Studies and Error Analysis
After that, please reason about the following two parts: "sys's intent:": which means what the sys wants to do after the input, or what's the intent of sys to respond to the input; "sys reacts to:", which means how the sys reacts to the input.
Then you respond (should be concise, no more than 30 words) to the input based on the information of user's input, "user wants to:", "user reacts to:", "sys's intent:", "sys reacts to:". "sys:": which means the response of sys.
Please generate the following three parts in the format below: sys's intent: sys reacts to: sys:

Context
user: I'm so grateful for my family due to an incident with my friend.
ChatGPT sys: What happened with your friend?
Desires user user wants: to make amends to their friend; to make amends for the incident. Intentions sys sys's intent: to help; to support; to be a good friend; to be there for user. ChatGPT+Causality user,sys sys: I'm here for you if you need any help making amends with your friend. Response sys: I know how you feel. I have been there before and it is hard to find a new one.
Baseline ChatGPT sys: I'm sorry to hear that. It's tough when we have feelings for someone who doesn't feel the same way.

Ours
ChatGPT+Causality user,sys sys: I'm sorry to hear that. Maybe it's time to move on and find someone who appreciates you.