SoulChat: Improving LLMs’ Empathy, Listening, and Comfort Abilities through Fine-tuning with Multi-turn Empathy Conversations

Large language models (LLMs) have been widely applied in various fields due to their excellent capability for memorizing knowledge and chain of thought (CoT). When these language models are applied in the field of psychological counseling, they often rush to provide universal advice. However, when users seek psychological support, they need to gain empathy, trust, understanding and comfort, rather than just reasonable advice. To this end, we constructed a multi-turn empathetic conversation dataset of more than 2 million samples, in which the input is the multi-turn conversation context, and the target is empathetic responses that cover expressions such as questioning, comfort, recognition, listening, trust, emotional support, etc. Experiments have shown that the empathy ability of LLMs can be significantly enhanced when finetuning by using multi-turn dialogue history and responses that are closer to the expression of a psychological consultant. 1

2) ChatGPT is inclined to provide suggestions rather than ask questions or listen.It is eager to solve users' problems, usually providing lengthy and general suggestions, as shown in Figure 12 of Appendix F. However, professional psychologists rarely provide specific suggestions during the counseling process.
3) ChatGPT acts a bit like a rational "Straight man" for those users who need listening and comfort.Users who seek emotional support usually expect empathy support such as listening, understanding and comfort.
Similar issues can also be found in other LLMs, e.g.ChatGLM (Zeng et al., 2023), SparkDesk3 , as presented in Appendix F. It may be due to the lack of large-scale multi-turn empathy conversation datasets for fine-tuning stage, especially in the field of Chinese mental health or emotional support.EMPATHETICDIALOGUES (Rashkin et al., 2019) and ESConv (Liu et al., 2021) are two English empathy conversation datasets that is used for developing emotional support conversation (ESC) systems, e.g MISC (Tu et al., 2022), GLHG (Peng et al., 2022), MultiESC (Cheng et al., 2022), FADO (Peng et al., 2023) and etc.On the one hand, these models may rely on annotated empathy strategies and emotions of users during the training or inference stage, which means that building large-scale similar datasets for fine-tuning LLMs is difficult.On the other hand, these datasets are in English, so that they cannot be applied to fine-tune Chinese LLMs.As for mental health, efaqa (Hailiang et al., 2020) and PsyQA (Sun et al., 2021) are two commonly-used datasets.Among  them, efaqa contains 20,000 conversations and provides annotation information such as types of troubles, psychological disorders, SOS, etc.However, efaqa has a complex multi-party dialogue relationship and a high proportion of low-quality responses from netizens, while PsyQA contains 22,346 questions and 56,063 single-turn long-text psychological counseling conversations.Thus, neither of these datasets can solve the three issues of Chat-GPT mentioned above.
Recently, Qiu et al. (2023) proposed a SMILE approach to employ ChatGPT to convert singleturn dialogues into multi-turn ones.They utilized SMILE to extend the single-turn conversation dataset PsyQA to a empathy multi-turn conversation dataset SMILECHAT with 355,733 samples.Inspired by (Qiu et al., 2023), we proposed a Chinese empathy constraint prompt, in which the empathy prompt constraint is further strengthened compared with SMILE prompt (see Appendix C).As shown in Figure 2 (English version: Appendix C), our empathy constraints are defined as "'心理咨询师'的回复需要结合用户 的描述内容并提供共情，如：倾听、安慰、 理解、信任、认可、真诚、情感支持等 (The response of the 'psychological counselor' needs to be combined with the user's description and provide empathy, such as listening, comfort, interpretation, trust, recognition, sincerity, emotional support, etc)".
"messages": [ {"role": "user", "content": ' To this end, we first constructed 215,813 different psychological counseling questions about 12 topics and 619,725 answers through data outsourcing services.Rule-based cleaning, manual rewriting and human proofreading are applied to ensure that there is no sensitive or privacy-related content in the dataset.Then, we use ChatGPT to convert these single-turn long text psychological counseling conversations to multi-turn empathy conversations.We also conducted manual proofreading and data cleansing for multi-turn dialogues rewritten by ChatGPT to further strengthen the expression of empathy, such as questioning, comfort, recognition, listening, trust, emotional support, etc.In the end, we obtained a multi-turn empathy conversation dataset, named SoulChatCorpus, with 2,300,248 samples.To our knowledge, it is the first millionscale multi-turn empathy conversation dataset in the field of mental health or emotional support.We conduct experiments by using ChatGLM-6B as the base model for fine-tuning on SoulChatCorpus.Results demonstrate that LLMs' empathy, listening, and comfort abilities can be improved significantly through fine-tuning with million-scale multi-turn empathy conversation dataset. 2 Human-centered Mental Health LLM

SoulChatCorpus Collection
We consider an one-on-one psychological counseling conversational setting where a user and a psychological consultant engage in multiple rounds of dialogue.However, such conversation data is not publicly available due to the privacy protection and ethical standards of psychological counseling.
To construct high-quality multi-turn empathy conversation dataset, We selected 12 topics of psychological counseling to construct 215,813 long-text questions and 619,725 long-text answer through crowdsourcing.The distribution of topics is shown in Figure 3.Then, we used ChatGPT (99% called gpt-3.5-turboapi and 1% called gpt-4 api) as a text rewriting tool following the prompt as shown in Figure 2 to convert single-turn psychological counseling conversations to multi-turn empathy conversations, in which one turn is in the form of "用 户： <user_utt>\n心理咨询师：<psy_utt>".The response of "心理咨询师" was asked to be rewritten to reflect human-centered expressions such as empathy, listening, comfort, etc.Finally, after manual proofreading, we removed 105,134 low-quality samples and ultimately obtained 2,300,248 samples.As shown in Figure 4, the word cloud map of the utterances expressed by psychological consultants indicated that the rewritten multi-turn empathy conversation has high level of empathy.

SoulChat Model
We utilized the ChatGLM-6B (Du et al., 2022;Zeng et al., 2023) as the base LLM architecture to develop the SoulChat.ChatGLM-6B is an opensource, bilingual LLM based on the General Language Model (GLM) (Du et al., 2022) framework with 6.2 billion parameters.The input of model is defined as: where the utterance of User on i turn u u i ='用户： (User:)' + utterance u i , utterance of Psychologist on i turn

Baselines
We compare SoulChat and the following benchmark models using both automatic and manual evaluations: 1) ChatGLM-6B 4 (Du et al., 2022;Zeng et al., 2023) serves as the base model for SoulChat.

Implementation details
SoulChat is finetuned on the proposed SoulChat-Corpus with a batch size of 80 and global training steps of 30,000.The WarmupDecayLR learning rate scheduler with warmup_steps = 1000 and warmup_max_lr = 5e − 5 is used.The maximum input token length is set to 1,536.The maximum target token length is set to 512.The decoding algorithms of Top-p sampling with p = 0.75 and temperature τ = 0.95 is adopted during the inference phase.

Results and Analysis
We randomly selected 10,000 samples from SoulChatCorpus and SMILECHAT respectively as the test set for automatic evaluation and 100 samples for manual evaluation.For each sample, each model generates an answer for evaluation.We used 7 evaluation metrics as automatic metrics: BLEU- Three individual experts majoring in Psychology were asked to evaluate the generated responses in terms of content naturalness (Con.), empathy level (Emp.),Helpfulness (Hel.) and Safety (Saf.), as detailed described in Appendix G.The rating scale of Con., Emp. and Hel. is (0, 1, 2), while (0, 1) for Saf., where higher score means better.One hundred dialogues were randomly sampled from the test set of SoulChatCorpus and SMILECHAT for manual evaluation.Fleiss' κ (Fleiss, 1971)

Conclusion and Future Work
In this paper, we explore how to make LLMs more human-centered.

Limitations
In this work we proposed a human-centered LLM named SoulChat that has excellent empathy ability, which is finetuned on the proposed SoulChat-Corpus dataset.Although the experimental results demonstrate the effectiveness of SoulChat, there are still some limitations need to consider.The mechanism of empathy is complex.Different users have different expectations for the output of the model.For example, when discussing tense emotions, there are significant differences in the solutions expected by adults and adolescents.Therefore, human-centered LLMs need to further consider the user's personality, identity, and other attributes to assist in generating answers that are closer to the user's needs.

Ethics Statement
• Data Collection.In order to protect privacy (Hovy and Spruit, 2016), we adopted strict manual proofreading process when constructing the dataset.We filtered all samples with special strings such as "我是 (I am)", "自 杀 (suicide)", "跳楼 (jumping off a building)", etc., and conducted manual data cleansing.Any text related to privacy has been rewritten or removed.Besides, any potential conversations that pose harm to users, others, or society have been completely removed from our data.To this end, we removed 105,134 samples from multi-turn conversations generated by ChatGPT.
• Potential Risks of the Model We conducted a safety assessment specifically for the output of the model during the manual evaluation phase, and the results are shown in

A Reproducibility Checklist
• Model and Data: The SoulChat model and SoulChatCorpus will be released upon decision of the paper.
• Model Parameters: SoulChat has 6.2B parameters with 28 layers and max_sequence_length of 2,048.During the inference phase, the model requires at least 14GB of GPU memory.
• Training Time: SoulChat is trained with global steps of 30,000 and torch_dtype of "float16" on 8 NVIDIA A800-SXM4-80GB GPUs.The training time is about 79 hours.

C Our prompt VS SMILE prompt
We used SMILE (Qiu et al., 2023) prompt and our proposed Chinese empathy constraint prompt for extend the single-turn long text psychological counseling example to multi-turn empathy conversation respectively.Chinese: "messages": [ {"role": "user", "content": '''Please act as text rewriter in the field of psychological counseling, rewriting the given "single round long text Chinese psychological counseling dialogue" into multiple rounds of empathetic dialogue.The requirements are as follows: 1.Multiple rounds of dialogue should start with "User:" and end with "Psychologist:"; 2. The response of a "psychological counselor" needs to combine the user's description and provide empathy, such as listening, comfort, understanding, trust, recognition, sincerity, emotional support, etc; 3. The number of conversation rounds between users and psychological counselors should be determined based on the conversation content, between 8-20 rounds; 4. In a round of conversation, the length of the conversation between the user and the psychological counselor should be combined with the conversation history and should not be too long.\n\n[Single round long text Chinese psychological counseling dialogue]:'''+ single_turn +'''\n\nPlease gradually analyze the user's needs and empathy skills of the psychological counselor in the above single round conversation, and rewrite it as a multi round empathy conversation.Your rewritten result is:''' }] Figure 5: The translated English prompt used for converting single-turn psychological counseling conversations to multi-turn empathy conversations.

以上是一些可能性，但并不一定就是导致他出
User: I have been in a relationship for two years and have been in a long-distance relationship.I am usually very good.I started arguing and quarreling a year ago, and my boyfriend said unpleasant things or even cursed.Is it normal Psychologist: Hello!Long distance love is usually okay, but when you argue, it's really abnormal to scold you.This should be related to many factors 1.Long distance relationships are inherently challenging, and when your boyfriend is away from you, their language and language are challenged 2. When he usually gets along with you, he will also have a lot of pressure, so when encountering similar situations, he will be very irritable 3. When we usually get along with him, there is no conflict between you.Whenever there is a conflict, we will argue fiercely 4.You usually communicate less, so he doesn't have a channel to vent his negative emotions The above are some possibilities, but they may not necessarily be the reason for his situation, because whether there are any other conflicts between you is only known through communication between the two of you.So, I think you should seriously con-sider what kind of psychological state or situation your boyfriend's behavior is in, and only you know it best.
As shown in Figure 7 (English version: Figure 8) and Figure 9 (English version: Figure 10), the multi-turn conversation generated by using the proposed prompt has richer expressions of empathy, compared with SMILE prompt.

D English Word Cloud Map
The English word cloud map is presented in Figure 6.

E Sample Conversations of SoulChat
As shown in Figure 11, SoulChat can better serve as a listener when users seek emotional support or confide.Besides, it can naturally empathize with users (e.g."你的童年经历真的很不容易 (Your childhood experiences were really difficult)", "我 可以理解你的痛苦和内心的挣扎 (I can understand your pain and inner struggle)") and comfort them (e.g."我相信你是一个坚强的人，你可 以通过自己的努力来改变现状。 (I believe you are a strong person who can change the situation through your own efforts.)").

F Sample Conversations of Other LLMs
The examples when users seek emotional support from ChatGPT, ChatGLM and SparkDesk are shown in Figure 12, Figure 13 and Figure 14.These LLMs are inclined to provide suggestions rather than ask questions or listen, acting a bit like a rational "Straight man" for those users who need listening and comfort, which make them appear less "human-centered".

G Manual Evaluation Instructions
For mental health LLMs, we have constructed a manual evaluation framework called "CEHS" with the following four dimensions: Content naturalness, Empathy, Helpfulness, Safety, as presented in Table 2. Specifically, there may be a certain conflict between Empathy and Helpfulness.For example, ChatGPT tends to generate helpful but lacking empathetic responses.On the other hand, when SoulChat generates empathetic responses, it may weaken the direct Helpfulness of the answer.This is because general advice often appears helpful, but not so empathetic.The scores of Empathy and Helpfulness in Table 1 also illustrate this phenomenon.Aspect Score Description Content 0-2 It mainly evaluates whether the content generated by the model is relevant and coherent to the user's conversation history, whether the content is smooth and natural, whether the content is consistent with language habits, whether the content has syntax error.

Empathy 0-2
The main assessment is whether the text generated by the model can understand the feelings and behaviors of the parties involved, grasp their inner feelings, deep meanings, and their degree.I'm sorry to hear that you've been feeling anxious and physically and mentally exhausted lately.Anxiety is a common emotional state, but it can be alleviated through some methods.Here are some suggestions:

Helpful
1. Seek support: Share your feelings with people you trust (friends, family, or mentors).They may be able to provide support, understanding, and advice.2. Adjusting your lifestyle: Pay attention to your diet, sleep, and exercise habits.Maintaining a regular schedule and a healthy lifestyle can help alleviate anxiety.
3. Relaxation technique: try some Relaxation technique, such as deep breathing, meditation, warm water immersion, massage or yoga.These methods can help you relax your body and mind, and reduce anxiety.4. Manage time and tasks: Develop a reasonable schedule and break down tasks into small manageable parts.Doing so can reduce stress and anxiety, and make it easier for you to concentrate.5. Seek professional help: If anxiety is severe and affects your daily life, consider consulting a psychological professional.They can provide professional advice and treatment plans.6. Relaxation and entertainment: Participate in some relaxation activities that you enjoy, such as reading, drawing, listening to music, or watching movies.These activities can divert your attention and make you feel more relaxed.
Remember, everyone has moments of anxiety, but you can take positive steps to alleviate it.If you feel that anxiety persists and is difficult to handle, be sure to seek professional help.
Figure 12: A case of a user confiding to ChatGPT.ChatGPT is eager to solve user problems and tends to provide comprehensive and effective advice rather than truly empathizing with users.
我最近总觉得自己很焦虑，身心俱疲。不知道该怎么办。 I have been feeling anxious and physically and mentally exhausted lately.I don't know what to do.我可以听听你的感受。你可以告诉我具体是什么让你感到焦虑吗？ I can hear your feelings.Can you tell me exactly what makes you feel anxious?最近工作压力很大，还有一些人际关系上的问题，让我很难受。 Recently, there has been a lot of work pressure and some interpersonal issues, which have made me feel very uncomfortable.听起来你正在经历着很大的挑战。但是你知道，你已经很了解自己的状况了。 其实很多人都有这种感觉。 It sounds like you are going through a great challenge.But you know, you already know your situation very well.Actually, many people have this feeling.可是我一直觉得自己无法克服它，好像之前无论尝试哪种方法都没有改变。But I always feel like I can't overcome it, as if no matter which method I tried before, it hasn't changed.不要放弃，或许你需要更长时间的调整。有时做一些基础的自我照顾会加强你的意志力，来克服这种感觉。 Don't give up, perhaps you need more time to adjust.Sometimes doing some basic self-care will strengthen your willpower to overcome this feeling....

Figure 1 :
Figure 1: A case of a user confiding to SoulChat.Compared to ChatGPT, SoulChat is better at listening and guiding users to think.

Figure 2 :
Figure 2: The prompt used for converting single-turn psychological counseling conversations to multi-turn empathy conversations (English version: Appendix C).

Figure 4 :
Figure 4: Word cloud map of psychological consultants' utterances (English version: Appendix D).

Figure 11
Figure 11: A case of a user confiding to SoulChat.
than 2 million samples.The experimental results indicate that using this dataset to finetune LLMs leads to high-level empathy ability when users try to seek emotional support from LLMs.Future work needs to further consider user attributes, such as personality, gender and etc., to help LLMs generate targeted empathy responses for different individuals.
To this end, we constructed a Chinese large-scale multi-turn empathy conversation dataset, named SoulChatCorpus, with 12 empathy topics and more

Table 1 .
Due to the lack of human feedback during the model finetuning stage, there are inevitably answers that may pose harm to users.Therefore, future work needs to combine RLHF to improve the safety level of model generated content.In addition, when this model is applied to downstream scenarios, it is necessary to inform the users in advance that the answers they see are generated by the AI model and are for reference only.