Harnessing the Power of Large Language Models for Empathetic Response Generation: Empirical Investigations and Improvements

Empathetic dialogue is an indispensable part of building harmonious social relationships and contributes to the development of a helpful AI. Previous approaches are mainly based on fine small-scale language models. With the advent of ChatGPT, the application effect of large language models (LLMs) in this field has attracted great attention. This work empirically investigates the performance of LLMs in generating empathetic responses and proposes three improvement methods of semantically similar in-context learning, two-stage interactive generation, and combination with the knowledge base. Extensive experiments show that LLMs can significantly benefit from our proposed methods and is able to achieve state-of-the-art performance in both automatic and human evaluations. Additionally, we explore the possibility of GPT-4 simulating human evaluators.


Introduction
Empathetic dialogue plays an essential role in building harmonious social relationships (Zech and Rimé, 2005).The task of empathetic response generation involves understanding the user's experiences and feelings, and generating appropriate responses (Keskin, 2014;Rashkin et al., 2019).Using dialogue systems to provide empathetic responses has advantages such as easy access and no time constraints (Sharma et al., 2020).Figure 1 shows an example of the empathetic dialogue from the benchmark dataset.
Most previous researchers have established elaborately designed models based on reliable theoretical knowledge (Lin et al., 2019;Majumder et al., 2020;Li et al., 2020;Sabour et al., 2022;Li et al., 2022;Zhou et al., 2022).However, the basic models used are mostly small in scale.Recently, large language models (LLMs) (Brown et al., 2020;Chowdhery et al., 2022;Touvron et al., 2023) have  been widely used in natural language processing (NLP) with superior performance.In particular, the emergence of ChatGPT has elicited substantial attention and interest in academia and industry, and it has demonstrated extraordinary performance in a variety of tasks, especially dialogue generation.These LLMs are trained on a large amount of corpora, encompassing a wealth of knowledge.In specific tasks, even without fine-tuning, outstanding performance can be achieved by adopting some gradient-free techniques (Brown et al., 2020;Wei et al., 2022) (e.g., in-context learning (ICL)).Therefore, it is necessary to empirically explore the performance of LLMs on specific domains, as the methods of solving problems may undergo significant changes.There have been some initial attempts (Roller et al., 2021;Lee et al., 2022) to apply LLMs to empathetic response generation.However, their approaches mainly focus on pre-training or fine-tuning on the training data, or simply exploring the capability of a single model.
To investigate the capability of LLMs in empathetic response generation, this work empirically studies the performance of LLMs on the empathetic dialogue benchmark dataset.We first compare LLMs in the zero-shot and few-shot ICL set-tings with a large number of baseline models.Surprisingly, the performance of the GPT-3.5 series of LLMs with in-context learning settings has comprehensively surpassed state-of-the-art models.This reveals that the paradigm shift brought by LLMs also applies to empathetic dialogue.Furthermore, based on the best performance LLM setting, we propose three possible methods to improve its performance.Specifically, improvement via semantically similar in-context learning, two-stage interactive generation, and combination with the knowledge base.Extensive automatic and human evaluation experiments show that LLMs can benefit from our proposed methods, which can generate more empathetic, coherent, and informative responses.In addition, although human evaluation is crucial in empathetic dialogue, its associated costs and time consumption are enormous.In view of the outstanding performance of LLMs on empathetic response generation, we attempt to use GPT-4 (Ope-nAI, 2023) to simulate human evaluators to evaluate the results.The Spearman and Kendall-Tau correlation results indicate that GPT-4 has the potential to be a substitute for human evaluators.
Our contributions are summarized as follows: (1) To the best of our knowledge, it is the first comprehensive empirical investigation on the performance of LLMs represented by ChatGPT on empathetic dialogue.
(2) We construct a unified prompt template for the empathetic response generation, and LLMs guided by the template achieve outstanding performance.
(3) We propose three targeted improvement methods, and sufficient experiments demonstrate their effectiveness.
(4) We explore the possibility of GPT-4 simulating human evaluators.

Empathetic Response Generation
Empathy is a complex multi-dimensional structure in psychology and has rich forms in practice (Davis et al., 1980).At present, two main forms of modeling empathy are affective empathy and cognitive empathy (Davis, 1983).Affective empathy oriented methods include mixture of experts (Lin et al., 2019), emotion mimicry (Majumder et al., 2020), and multi-resolution user feedback (Li et al., 2020).Cognitive empathy oriented methods include emotion causes (Gao et al., 2021;Kim et al., 2021;Qian et al., 2023a), empathetic intents (Welivita and Pu, 2020;Chen et al., 2022), external knowledge (Li et al., 2022;Sabour et al., 2022;Zhou et al., 2022;Cai et al., 2023).Besides, Wang et al. (2022) models the interaction between knowledge and emotion, Zhao et al. (2022) considers self-other awareness, Bi et al. (2023) and Kim et al. (2022) propose multi-grained and fine-grained levels, respectively.However, most of researchers design elaborate small-scale models and the application of LLMs represented by ChatGPT in the empathetic dialogue has not been fully empirically explored.

Large Language Models
Large language models (LLMs) such as GPT-3 (Brown et al., 2020), PaLM (Chowdhery et al., 2022), andLLaMA (Touvron et al., 2023) are pretrained on extensive and large amounts of data, and their tens or hundreds of billions of parameters contain a lot of knowledge.Recently, in combination with new training techniques such as reinforcement learning from human feedback (RLHF) and instruction tuning (Ouyang et al., 2022), the capabilities of LLMs have made a qualitative leap.For example, the emergence of ChatGPT has aroused great interest in academia and industry, demonstrating extraordinary capabilities in a variety of tasks.GPT-4 (OpenAI, 2023) has given some researchers a glimpse of the spark of artificial general intelligence (AGI) (Bubeck et al., 2023).The powerful in-context learning capabilities of LLMs have also led to a paradigm shift.
There are some preliminary attempts to apply LLMs to empathetic dialogue.Blenderbot (Roller et al., 2021) can properly demonstrate empathy through the introduction of the blended skill talk (BST) setup and the correct choice of generation strategies.However, the implementation of empathy is mainly through pre-training with high-quality data and does not utilize the emerging ICL capabilities of LLMs. Lee et al. (2022) explores the performance of GPT-3 to generate empathetic responses with prompt-based in-context learning capabilities.However, they only explore GPT-3, and the capabilities of LLMs have greatly improved with the emergence of new training technologies.

Overview
Formally, the dialogue context is alternate utterances between the speaker and the listener, defined   as C = {U 1 , U 2 , . . ., U n−1 }, where U i represents the i-th utterance and n denotes the number of utterances in a dialogue.Our goal is to play the role of the listener and generate the empathetic, coherent and informative response Y , which is U n .
The overview of our proposed methods is illustrated in Figure 2, which includes the devised unified template of empathetic response generation and three improvement methods.The left part describes the improvement via two-stage interactive generation, the middle part displays the components of the devised unified template and the improvement via semantically similar in-context learning, and the right part illustrates details of improvement via the knowledge base.

Preliminary Exploration
LLMs possess the ability of in-context learning (ICL) (Brown et al., 2020), by providing task instructions and some examples to LLMs, they can perform related tasks without fine-tuning.This capability significantly alleviates the demand for training data.We first investigate the performance of LLMs on zero-shot ICL and few-shot ICL in empathetic response generation.Since different prompts may affect performance, we strive to maintain a consistent style when designing prompts.The devised prompt template for empathetic dialogue consists of the following components: Task Definition + Guideline Instruction + Exemplars (optional) + Dialogue Context Among them, Task definition is the researchers' standard definition of the task.Guideline Instruction is the instruction we expect the model to follow.Exemplars are complete instances of dialogs used to help models better understand the task.Dialogue Context is the historical dialogue between the speaker and the listener, and the last sentence is the speaker's utterance.Our goal is to let the dialogue system generate the next round of the listener's utterance.The example of the prompt template is listed in Appendix A.
In the preliminary experimental exploration, we perform three groups of settings.0-shot.This represents a straightforward approach to leverage LLMs for empathetic response generation, which means there are no Exemplars.
1-shot.We randomly sample a complete dialogue from the training set as the Exemplar.
5-shot.We randomly sample five complete dialogues from the training set as the Exemplars.

Advanced Exploration
In this section, we gradually introduce three methods to improve the performance of LLMs in generating empathetic responses.

Improvement via Semantically Similar
In-Context Learning As Liu et al. (2022) argues, a small amount of carefully selected data can greatly improve the performance of LLMs without a large amount of data.We reasonably speculate that in addition to the number of instances, the quality of the instances will also have an impact on the model's performance.Therefore, when choosing in-context instances, we select a few instances from the training set whose dialogue context semantics are closest to those in the test set.Specifically, we concatenate the dialogue context of each instance into a long sentence and use a sentence encoder to obtain its vector representation, which represents the semantics of each instance's dialogue context.For the sentence encoder, we adpot the "all-mpnet-base-v2" version of sentencetransformers (Reimers and Gurevych, 2019). 1 It maps sentences to a 768 dimensional dense vector space.The sentence embedding model was trained on very large sentence level datasets using a self-supervised contrastive learning objective.The similarity between semantics is measured by calculating the cosine similarity between the vector representations of two sentences: where E train , E test are the sentence encodings of the dialogue context in the training and test set, respectively.Sim() is used to calculate the similarity of two sentence vectors.

Improvement via Two-stage Interactive Generation
In the setting of the empathetic dialogue task, the The model's thought process during the intermediate step is a basis for generating the final response, enhancing the model's interpretability.At the same time, it also facilitates the analysis of the impact of different key factors (such as emotions and situations) on the final result.Moreover, clearer error analysis is possible when generating responses do not work well.

Improvement via Knowledge Base
Merely inferring the speaker's emotions and situation from the historical dialogue is insufficient.A direct evidence is that the response has almost no non-stopword overlapping with the dialogue history in the benchmark dataset (Li et al., 2022).Dialogue systems need more external knowledge to conduct empathetic dialogue.LLMs store a large amount of knowledge through weights, so when performing specific tasks, how to better stimulate the use of relevant knowledge is crucial for improving the effect.An alternative solution is to fine-tune LLMs for specific tasks, but this process usually requires expensive hardware, time, and training data.Inspired by recent work on empathetic dialogue (Sabour et al., 2022) (Hwang et al., 2021), which contains knowledge not readily available in pre-trained language models, and can generate accurate and representative knowledge for unseen entities and events.The ATOMIC 20 20 knowledge base is in the form of event, relation type and inferred knowledge triples.We adopt the BART version of COMET (Hwang et al., 2021) trained on this knowledge base to generate commonsense inferences of five relations (xIntent, xNeed, xWant, xEffect, xReact) for dialogue contexts.We also design an algorithm to construct the suitable prompt, which can dynamically concatenate the corresponding commonsense inferences according to different dialogue contexts, enriching the input representation, so as to stimulate the relevant knowledge of LLMs more accurately and generate more appropriate responses: The speaker talks about their situation and the listener attempts to understand the speaker's feelings and reply appropriately.

Evaluation Metrics
We follow previous related studies, conducting both automatic and human evaluations, and choose as many metrics as possible.
Automatic Evaluation We adopt Distinct-n (Dist-1/2) (Li et al., 2016), BERTscore (P BERT , R BERT , F BERT ) (Zhang et al., 2020), BLEU-n (B-2/4) (Papineni et al., 2002) as main automatic metrics for the performance of the response generation.Distinct-n measures the proportion of distinct ngrams of the response, which is used for diversity evaluation in open-domain dialogue.BERTScore leverages the pre-trained embeddings from BERT and matches words in candidate and reference sentences by cosine similarity.We employ matching precision, recall and F1 score.BLEU-n measures the similarity and relevance between the generated and golden responses.We don't employ Perplexity (PPL) because there are differences in the vocabulary of multiple models.Additionally, some baseline models perform emotion classification as a part of their training process, we also report the emotion prediction accuracy (Acc).

Human Evaluation
In human evaluation, we randomly sample 100 dialogues from the testing dataset.Considering both the human labor cost and the reliability of the experiment, we select competitive models from the past year (including state-of-the-art) and BlenderBot as representative baselines.Given the dialogue context and these models' generated responses, we recruit three annotators (majority rule) to assign a score from 1 to 5 (1: not at all, 3: OK, 5: very good) to the generated responses based on the aspects of Empathy, Coherence, Informativity, and Fluency.The four aspects are 1) Empathy (Emp): whether the response shows an understanding of the user's feelings and experiences, and expresses appropriately; 2) Coherence (Coh): whether the response is coherent and relevant to the context; 3) Informativity (Inf): whether the response contains more valuable information; 4) Fluency (Flu): whether the response is readable.More details about the human evaluation can be found in Appendix C.
Furthermore, we conduct another human A/B test to directly compare different models, taking into account the variation among different individuals.Following Sabour et al. (2022), we conduct the pairwise preference test based on aspects.Given the context, we pair the responses generated by two different methods and ask annotators to choose the better response based on the context and the above four aspects.If the difference is really not significant, a tie is allowed.

Implementation Details
We use OpenAI's GPT family 2 as our LLMs.More specifically, we use the model gpt-3.5-turboprovided in the OpenAI API, which is the base model of ChatGPT.We also test with GPT-3 davinci and another version of .we set temperature to 0 to make the outputs mostly deterministic in the experiment.We divide the dataset into training, validation, and testing set according to the original paper (Rashkin et al., 2019) with 8:1:1.For a fair comparison, the parameter settings of all SOTA models are consistent with those recommended in their initial paper or code.

Preliminary Exploration Results
Table 1 shows the automatic evaluation results between LLMs and baselines.LLMs significantly outperform existing SOTA baselines and achieve a significant improvement on all automatic metrics, especially diversity.provement, which demonstrates a significant advantage of LLMs in diverse language expression (mainly unigrams and bigrams).In terms of BERTScore and BERT, LLMs achieve the average improvement of 2.1%[=(2.6+1.6+2.1)/3] and 26.95%[=(18.6+35.3)/2],respectively.This highlights the power of LLMs' in-context learning capability that can be quickly applied to unseen specific tasks.In addition, we observe that the number of exemplars is positively correlated with diversity performance, which suggests the addition of examplars can influence the linguistic habits of LLMs.In the human evaluation, we select ChatGPT (+ 5-shot), which leads in most automatic metrics, as the representative of LLMs.The human ratings and the human A/B test results are listed in Table 2 and  Table 4, respectively.We observe ChatGPT also outperforms baselines by a large margin on all aspects, which further demonstrates the outstanding performance of LLMs in generating empathetic, coherent and informative responses.Additionally, we note that the scores of the baselines are lower than in previous studies.This is due to the superior performance of ChatGPT in empathetic dialogue, which relatively raises the standards.It can be corroborated by the fact that in over 70% of the cases in the A/B test that human annotators prefer responses generated by ChatGPT.For Fluency aspect, there is no significant difference between models, since the responses generated by existing models are already fluent.Therefore, we do not need to compare separately in the A/B test.

Advanced Exploration Results
Experimental results of the advanced exploration on LLMs are shown in Table 3 and the lower part of Table 4. Overall, the responses generated by the three improvement methods we proposed are more favorably received by human annotators in all aspects during the human A/B test, compared to the responses generated by the original Chat-GPT.These results verify the effectiveness of the choice for in-context exemplars, two-stage interaction generation and enhancement of knowledge related to the context.In the automatic evaluation, the Similar ICL improvement method obtains the best performance, this is attributed to that most automatic metrics tend to favor responses that are closer to the ground truth.However, introducing more thoughts or more information could potentially deviate from the ground truth, even if it's an appropriate response that humans like.Besides, to verify whether the correct inferred emotion or situation has an impact on the generated responses, we conduct two variants experiments of two-stage interactive generation.By separately replacing the model's thinking outputs in the first stage with the truth emotion and situation, results show an enhancement in both BERTScore and BERT metrics.However, this causes a loss in diversity.

Case Study
The generated responses from five competitive baselines and our proposed methods of LLMs are listed in Table 5.It can be observed that most baselines understand the user's feeling, but only pro-Emotion Sad Context Speaker: I think that the girl of my dreams likes somebody else.I feel very sad about it.Listener: ooh, I am so sorry about that.Have you tried to talk to her? Speaker: It's tough because she has been out of the country for a month, so I will likely discuss it when she returns.EmpHi I'm sure she'll be fine.KEMP I am sure you will do great!CEM Well, you will be fine.CASE (SOTA) I am sure she will be fine.

Blenderbot
That's a good idea.I'm sure you'll be able to figure it out.

ChatGPT
That sounds like a good plan.It's important to communicate your feelings and be honest with her.I hope everything works out for you.+ SS ICL That sounds like a good plan.It's important to communicate your feelings and concerns with her.
In the meantime, try to focus on things that make you happy and take your mind off of it.+ Two-stage I'm sorry to hear that you're feeling sad about the situation with the girl you like.It can be tough to navigate these kinds of situations, especially when the person is out of the country.Have you thought about how you might approach the conversation when she returns?+ Knowledge It sounds like a tough situation to be in.It's good that you're planning to talk to her when she returns.I hope that the conversation goes well and that you can find some clarity and peace of mind.vide simple comforting responses ("will be fine").
Blenderbot generates the response with more information while it only supports the user's idea without giving reasons and suggestions.Compared with other baselines, our proposed methods fully understand the user's feeling and generates more empathetic, coherent, and informative responses.
Then we analyze the performance of the improvement methods in this case.The method of semantically similar ICL provides additional suggestions to alleviate the user's sadness emotion ("focus on things that make you happy", "take your mind off ") by learning from relevant instances.The method of two-stage interaction generation reflects inferred user's emotion and situation more specifically in the response.The method of combination with the knowledge base generates the relevant and empathetic response based on the commonsense inference ("talk to her") of [xwant].More cases can be found in the Appendix D.

Analysis of LLM Simulating Human Evaluators
LLMs have shown outstanding performance in generating empathetic responses.Naturally, we wonder if it is possible to use LLMs to simulate human evaluators to evaluate the performance of other models.Compared to human evaluators, the latter has lower costs and shorter time consumption.Therefore, we adopt GPT-4 as the evaluator to conduct the A/B test under the same settings.Following Zhong et al. (2022), we use Spearman and Kendall-Tau correlations to assess the performance of human evaluators and GPT-4.The results are shown in Table 6.We can observe that GPT-4 achieves the best correlation with human evaluators on the aspect of empathy.We observe that GPT-4 has fairly good results in Spearman and Kendalltau with human evaluators on all aspects (refer to Zhong et al. (2022)), and achieves the best correlation in the aspect of empathy.This indicates the potential of LLMs to simulate human evaluators.

Conclusion and Future Work
In this work, we empirically study the performance of LLMs on empathetic response generation and propose three improvement methods.Empirical automatic and human evaluation results show that LLMs significantly outperform state-of-the-art models, and verify the effectiveness of our proposed improvements of LLMs.
In the future, our work can contribute to deeper comprehension and the application of LLMs for empathetic dialogue, and provide some insights for similar tasks.

Contents Task Definition
This is an empathetic dialogue task: The first worker (Speaker) is given an emotion label and writes his own description of a situation when he has felt that way.Then, Speaker tells his story in a conversation with a second worker (Listener).The emotion label and situation of Speaker are invisible to Listener.Listener should recognize and acknowledge others' feelings in a conversation as much as possible.Guideline Instruction Now you play the role of Listener, please give the corresponding response according to the existing context.
You only need to provide the next round of response of Listener.

Exemplars
The following is the existing dialogue context: Instance 1: (the complete dialogue from the training set...) Dialogue Context Speaker: U 1 Listener: U 2 ...... Speaker: U n−1 Others The additional contents for improvement methods.implicit and explicit intent representations to produce empathetic responses7 .( 6) KEMP (Li et al., 2022): A model leverages external knowledge, including commonsense and emotional lexical knowledge, to explicitly understand and express emotions in empathetic dialogue generation8 .
(8) CASE (Zhou et al., 2022): Align users' cognition and affection at coarse-grained and finegrained levels through the commonsense cognitive graph and emotional concept graph10 .

C Human Evaluation Details
For each case, we provide the annotators with the dialogue context and responses from compared models.The responses of compared models in each case are randomly disrupted, so the evaluators are not affected by the order of the models.

D Additional Case Study
Table 8 shows additional case study of generated responses from SOTA baselines and our proposed methods.
had money.Then she'd sit with me instead of him in the club.The club is not a great place to find the right kind of girl anyways.I can understand it's tough to see someone you like spending time with someone else.But I don't think money is the only way to win someone's attention.Maybe you could try talking to her and getting to know her better.I wish I had money.Then she'd sit with me instead of him in the club.

Figure 1 :
Figure 1: An example of empathetic dialogue from the EMPATHETICDIALOGUES dataset.
the user's emotion, ... situation ...... Now combine your thoughts with the existing dialogue context and give you respnse.
the intent before e [xNeed] the need for e happen [xWant] the want after e [xEffect] the effect of e on sb.

Figure 2 :
Figure 2: The overall architecture and flow of our proposed methods for LLMs in empathetic dialogue generation.
where r represents the relation type, r ∈ R, and R = {xIntent, xNeed, xWant, xEff ect, xReact}.CS kno is the concatenated external knowledge.et al., 2019) is a large-scale benchmark dataset of multi-turn empathetic dialogue in English.Each dialogue in the dataset has an emotion label (32 types in total) and the situation corresponding to the emotion label.
, G, C are Task Definition, Guideline Instruction and Dialogue Context, respectively.eand s represent the inferred emotion and situation, respectively.The prompts we designed in two stages are: Formally, we can express it as: P (Y |T, G, C) = P (e, s|T, G, C)P (Y |e, s), (5) 1 https://huggingface.co/sentence-transformers/all-mpnetbase-v2where T

Table 1 :
Results of automatic evaluation between LLMs and baselines.

Table 3 :
Results of automatic evaluation on the advanced exploration.

Table 4 :
Results of human A/B test on aspects (the statistical significance (t-test) with p-value < 0.01).

Table 5 :
Generated responses from baselines and LLMs.The bold contents show the effect of improvement methods.

Table 6 :
Spearman and Kendall-Tau correlations of different aspects between human evaluators and GPT-4.

Table 7 :
The example of the prompt template.