Empathetic and Emotionally Positive Conversation Systems with an Emotion-specific Query-Response Memory

,


Introduction
Most conversation models capture only the correlation between queries and responses and may overlook speaker's emotional states in the conversation.Emotional conversation models (Zhou et al., 2018;Song et al., 2019) consider speakers' emotions during the dialogues, where the speakers can be either users or chatbots.Those models make chatbots aware of the user's emotions and enable them to respond empathetically.There are two directions for emotional conversation models.(1) Controllable response generation enables chatbots Figure 1: An example of the differences among controllable response generation (in red), empathetic response generation (in blue), and our chatbot (in a mixture of red and blue).When responding to the sad news, blunt cheering up might lead to reinjury and a plain show of empathy tends to aggravate negative feelings.A suitable response should be both empathetic and positive.
to respond conditioned on a certain emotion (Shin et al., 2020;Liu et al., 2021) or style (Zhou and Wang, 2018;Dathathri et al., 2019).Those methods require an explicit emotion label as input and the label dominates the chatbot's response.(2) Empathetic response generation, detects the user's emotion from the query so that the chatbots can respond empathetically considering the user's emotion (Lin et al., 2019(Lin et al., , 2020)).
The above two directions cater to emotions in conversation but still possess some weaknesses.For the first direction, if the given emotion label mismatches with the user's emotion, controllable responses generation leads to an emotional drift (Deng et al., 2020) among dialogues, that is, the emotions of query and response are inconsistent and incoherent.As the example shown in Fig. 1, if the user tells a sad story and the chatbot aggressively encourages the emotion to be happy, the chatbot's responses may be crude resulting in reinjury to the user.As for empathetic response generation, without acquiring any additional inductive bias, the user's emotion does not necessarily hint how to respond empathetically (Shin et al., 2020), so considering the user's emotion usually pushes chatbots to imitate the user's emotion.If the user is sad (Fig. 1), simply imitating the user's emotion may aggravate the user's negative feelings.
Existing research hardly explores to balance empathy and positive emotions in response generation.To achieve that goal, we should adopt different responding strategies to different situations and finally lead positive conversations.For example, to respond to a sad story, we should first show our empathy and then gently lead a positive emotion; for a happy utterance, we can directly congratulate the user and even share other good news.
In this paper, we propose a conversation model that generates empathetic responses and gently guides the mood of the conversation towards a positive direction.To customize responding strategies to different situations, we extract implicit responding strategies for specific emotions and topics by abstracting the training corpus.We store the extracted strategies into an emotion-specific queryresponse memory to assist the response generation.To lead positive emotions, we employ a sentiment evaluator to encourage positive responses.The above operations constrain the strategies and emotions in generation, thus they naturally decrease the diversity of the generated responses.To encourage diverse responses and fully utilize the memory outputs, inspired by conditional variational autoencoder (CVAE) (Sohn et al., 2015), we model the memory outputs with a Gaussian mixture distribution and mix the memory outputs (i.e.responding strategies) by sampling a strategy vector z from the distribution.Finally, the strategy vector z controls a conventional conversation model to generate responses.Our experiments verify that our model not only exceeds the comparing methods in quality and diversify but also encourages positive conversation.Our contributions are threefold: (1) We propose a conversation model to balance users' emotions and positive emotions.(2) We propose to abstract the corpus to build an emotion-specific query-response memory to carry responding strategies for different emotions and topics and propose a strategy mixer to fuse memory outputs.(3) Our model surpasses some latest emotional chatbots in appropriateness, diversity, and positivity of responses.

Conversation Systems
Large scale corpora lead to a great success in datadriven conversation systems, including retrieval-based (Isbell et al., 2000;Ji et al., 2014) and generation-based methods (Shang et al., 2015).Generation-based methods achieve the end-to-end response generation.To prompt conversation systems, Serban et al. (2016) propose to consider the historical utterances.Li et al. (2016) diversify the generated response.Xing et al. (2017) strengthen the topic coherence among conversational utterances.Conversation systems also consider speakers' states, including speaker's personalities (Zhang et al., 2018) and roles (Hu et al., 2019).

Emotional Conversation Systems
Emotional conversation systems can be divided into two directions: catering to the user's emotions and expressing the chatbot's emotions.The first direction aims to capture user's emotions and make empathetic responses (suitable responses cottoning with the user's states) (Lin et al., 2019(Lin et al., , 2020;;Li et al., 2020;Lin et al., 2020;Gao et al., 2021;Li et al., 2022).Empathy refers to the capacity to respond with an appropriate emotion to another's mental states (Zhong et al., 2020).Rashkin et al. ( 2019) design a pipeline system, where a classifier predicts emotion words describing the user's query.Lin et al. (2019) propose an end-to-end neural conversation model that detects the emotion by an encoder and generates responses with decoders.Cao et al. (2020); Zheng et al. (2021) apply GPT (Generative pre-training) (Radford et al., 2018) to empathetic chatbots, and Zhong et al. (2020) employ BERT (Devlin et al., 2019) to emotional response generation.Liu et al. (2021) construct a dataset to simulate the dialogs between psychologist and help-seeker.Those methods mainly cater to users' emotions without planning emotions in future conversations, which is different from our model.
The second direction aims to enable the chatbot to respond conditioned on a given emotion, which is a sub-domain of controllable text generation (Lubis et al., 2018a,b;Dathathri et al., 2019;Colombo et al., 2019;Xu et al., 2021).Lubis et al. (2018a) achieves it with a HRED (hierarchical recurrent encoder-decoder) (Serban et al., 2016).Song et al. (2019) equip a Sequence-to-Sequence (Seq2Seq) (Sutskever et al., 2014) with lexicon-based attention and encourage the model to express emotion implicitly.Shin et al. (2020) leads the positive emotion in conversation via reinforcement learning.ECM (emotional chatting machine) (Zhou et al., 2018) embeds the given emotion, models the emotion expression, and generates emotional words with external memory.Inspired by dual learning, Shen and Feng (2020) enable two chatbots to learn by chatting with each other under the specific emotion.Jiang et al. (2021) lead a happy conversation ending in multi-turn conversations.The conversation systems that explicitly consider the two directions mentioned above are underexplored, which is the goal of this paper.

Variational Autoencoder
Kingma and Welling (2013) propose variational autoencoder (VAE) to reconstruct a sample x through a latent variable z.VAE estimates the intractable posterior P (z|x) with a recognition network.Sohn et al. (2015) extend VAE to conditional VAE (CVAE).CVAE estimates the latent z with a condition on the prior and posterior distribution.Zhao et al. (2017) apply CVAE to conversation to diversify the generated utterances, where queries act as the conditions and responses act as the x.Gao et al. (2019b) equip CVAE with interpretable latent variables.The differences between those methods and our models are: 1) the estimation of our posterior relies on the memory instead of the responses x in CVAE.2) our posterior follows the Gaussian mixture to carry a mixture of potential responses.

Model Architecture
Our model understands the user's query, determines a responding strategy, and generates a response conditioned on the strategy.Our model has five modules (Fig. 2) as follows, • Encoder represents the query q with a vector e.
• Emotion Detector (ED) detects the emotion of the query q. • Responding Strategy Generator (RSG), the core module in this paper, generates a responding strategy vector z to guide the generation.RSG contains an emotion-specific query-response memory and a strategy mixer to determine a suitable responding strategy.• Conditional Conversation Model (CCM) is a transformer-based conversation model generating a response for the given query q with the strategy vector z.• Pre-trained Sentiment Evaluator (PSE) evaluates the sentiment of generated responses and provides feedback to the above modules.
Our training procedure consists of two phases.The first phase (pre-training phase) pre-trains ED, CCM, and PSE separately.The second phase is to learn Encoder and RSG while freezing PSE and CCM.Encoder provides the query representation for RSG, ED detects query emotion for RSG, and then RSG generates a responding strategy vector to guide CCM to generate responses.PSE and CCM provide the feedback to supervise the training of Encoder and RSG.

Encoder
To represent the user's query, we employ a GRUbased encoder to embed the query into a vector.The Encoder consists of a word embedding layer, a GRU (Cho et al., 2014) layer, and a multilayer perceptron.The word embedding layer projects each query's word into a vector.Then, the GRU receives the sequence of word embeddings and outputs its last hidden state.The multilayer perceptron is composed of two fully connected layers with a non-linear function; the multilayer perceptron transfers the GRU's last hidden state to another vector, denoted as e. e is the output of Encoder carrying the query's information.Encoder is involved in the two training phases.We introduce more details about the training in Sec.3.7.

Emotion Detector (ED)
To explicitly represent the emotions of the user's query, we pre-train an emotion detector ED to classify the query's emotion.The output is an explicit emotion category (e.g.happy).As the usage of conventional text classification tasks in BERT (Devlin et al., 2019), we initialize ED's parameters with a pre-trained BERT model, and then fine-tune ED over an emotion classification dataset, which has 7 emotion categories.This module is only trained in the first training phase and provides emotion categories to RSG in the second phase.

Responding Strategy Generator (RSG)
Responding strategy generator (RSG) aims to bridge the user's query with the responding strategies.Given the query vector e and query's emotion label, RSG generates a strategy vector z to guide the chatbot to respond suitable considering the user's emotion and lead to positive emotions.
The RSG is only trained in the second training phase and has two sub-modules: 1. Emotionspecific Query-Response Memory receives the query representation e from Encoder and outputs the right side shows our core module, Responding Strategy Generator, which works in the second phase.Colors represent various groups of similar samples.q, r, r denotes query, ground-truth response, and generated response.
several vectors v carrying information about potential responses; 2. Strategy Mixer mixes the memory output v to obtain a strategy vector z for chatbots.

Emotion-specific Query-Response Memory
The module is designed to carry different implicit responding strategies for different user's emotions.This module consists of several key-value memories and each memory corresponds to a specific emotion.Each emotion-specific memory has K slots, which is a key-value vector pair ⟨k i , v i ⟩.The key vector k i represents a group of similar queries; the value v i carries the information of a group of potential responses.The key-value slot memorizes a mapping from queries to the information of how to respond for the specific emotion and topic.The memory module is constructed by memory write and contributes to other modules by memory read.
• Memory Read: The input of memory read is the user's query representations e from Encoder and the detected query emotion from ED.We first locate the emotion-specific memory according to the detected emotion.Then, in the emotion-specific memory, the read operation search for memory slots by the similarity between query vector e and every memorized key vector k i , in which we measure the similarity via dot-product.Memory read fetches the value with two options: Hard Read is to fetch the value vector whose key vector is most similar to the query vector e, to use the most suitable response information.Soft Read fetches the several most similar value vectors and their similarity scores, which means reading several potentially useful information.The output of memory read acts as the input of Strategy Mixer.In this way, each key vector gathers similar user's queries and memorizes their representative information (i.e.cluster centers).Each value vector v i learns to extract the common responding characteristics for a group of users‚ the queries in a similar topic and a same emotion.Hence, the ⟨k i , v i ⟩ pair memorizes the mapping from the user's query to the responding strategies for a specific emotion, thus our model can generate suitable responses catering to user's specific emotion.

Strategy Mixer
To fully utilize the memory outputs and encourage the diversity, inspired by CVAE, we propose to mix responding strategies (i.e.memory outputs) with the query and obtain the final strategy vector z by sampling.Like our memory module, each emotion has a strategy mixer.All the strategy mixers share the same structure and have their own parameters.For each sample, the model first locates the specific strategy mixer according to the emotion from ED and then uses the located mixer.In the following parts, we introduce the usage of one strategy mixer.
The only difference between our model and the vanilla CVAE is that 1. our posterior Q ϕ (z|M, q) accesses the memory M instead of r; 2. our memory may output one or multiple vectors instead of only one input r in CVAE.The reasonability of using memory M is that M carries the mapping from the user's queries to the potential responses and the memory output is the information about responses, thus the model can infer the response r by reading the memory.We use multiple vectors from memory M by constraining the vectors to follow a same class of distribution.
In Eq. 1, Q ϕ (z|M, q) is the approximate posterior to estimate the true posterior P (z|r, q) while P φ (z|q) acts as the prior.The query q acts as the condition and the response r is the model output (see the deductions in Appendix D). (1) Recognition network and prior network model the posterior Q ϕ (z|M, q) and prior P φ (z|q), respectively.The decoder P θ (r|z, q) is the CCM to generate the responses (introduced in Sec.3.5).The actual input of strategy mixer is the query vector e and memory reading output while its output is the final strategy vector z.
• Recognition Network models the posterior Q ϕ (z|M, q) that generates z by accessing the memory output and the query vector e.Since vector z is hard to model from the memory directly, we construct a mixture of Gaussian over the memory.As each memory slot covers several similar samples, we assume each memory slot corresponds to a specific Gaussian distribution and all samples vectors in this slot follow that Gaussian distribution.Like the idea of Gaussian mixture, each query e may be similar to multiple slots and its corresponding vector z may come from a mixture of memory slots.Hence, z's posterior P θ (z|M, q) approximately follows a mixture of Gaussian.
Particularly, there are K Gaussian distributions and each Gaussian corresponds to a memory slot.
For the i-th slot, the value vector v i acts as the mean of i-th Gaussian distributions.The variance of the Gaussian is a learnable scalar λ i that times an identity vector I as used in (Yang et al., 2019).Each Gaussian is denoted as N (v i , λ i I).
The posterior Q ϕ (z|M, q) describes memory reading output with a mixture of c Gaussian distributions, which is weighted by the probability of the sample belonging to each Gaussian π i .Notice that the memory has K slots in total, and memory reading only fetches c (c ≪ N ) at each time.The probability π i is the normalized similarity between key k i and e.
where c is the number of slots read from the memory.The hard read (c = 1) fetches a single slot (distribution); the soft read (c > 1) leads to a tradeoff between quality and emotional effect.• Prior Network models the prior P φ (z|q) that generates z from query vector e without accessing memory M .As the usage in CVAE, we assume the prior follows Gaussian distribution.We employ two fully connected layers W µ , W σ to transfer the query vector e to the mean and variance of the prior distribution P φ (z|q) = N (W µ e, W σ e).
The motivation for estimating z without accessing the memory is that generating z from the memory cannot be learned by gradient descent since the memory read is not differentiable.

Conditional Conversation Model (CCM)
Conditional conversation model (CCM) is to generate a response for a given query constrained by a specific condition.We first build a conventional conversation model with transformer (Vaswani et al., 2017) that transfers queries to responses.Based on the transformer, we incorporate a condition vector by appending the vector in front of the sequence of the transformer‚ the input word embeddings.CCM is trained only in the first training phase.In the second phase, CCM's parameters are frozen while CCM and PSE act as the feedback to supervise RSG's training.

Pre-trained Sentiment Evaluator (PSE)
Pre-trained sentiment evaluator (PSE) evaluates the sentiment of the responses generated by CCM and provides the feedback to CCM.PSE is a BERTbased sentiment classifier.Based on the initial parameters from a pre-trained BERT (Devlin et al., 2019), we fine-tune PSE over a sentiment classification dataset.PSE is trained in the first phase (pre-training phase) and its training does not involve other modules.After the training, we freeze its parameters and employ PSE to provide feedback in the second phase.

Model Training and Inference
Our training consists of two phases.The first phase, pre-training phase, aims to pre-train ED, PSE, and CCM.ED and PSE are trained alone on their own datasets.The left branch in Fig. 2 shows CCM's pre-training.We pre-train CCM assisted by Encoder, where Encoder's output vector e acts as CCM's input condition.The output vectors cover a variety of conditions since the vectors come from various utterances.
The second training phase (middle branch in Fig. 2) trains Encoder and RSG while freezing ED, PSE, and CCM, since ED, PSE, and CCM is well trained in the first phase.Encoder encodes the query and feeds its output e to RSG.Then, RSG learns to generate the responding strategy vector z.CCM takes z, instead of e in the first phase1 , as its input condition.CCM generates the final response r.PSE evaluates the sentiment of r and feedbacks to its other modules to optimize Encoder and RSG.CCM are not optimized by PSE's feedback to avoid the complicated back-propagation through CCM's every time steps.The loss function in this phase is a combination of the strategy mixer's loss L SM (Eq. 1) and loss of sentiment score from PSE L sent .L SM leads to appropriate and diverse responses since it imitates the emotionspecific query-responses from the training samples.L sent encourages the positive conversation.
where α denotes the weight balancing two losses and PSE's sentiment score is identical to the probability of the generated response being positive.
The inference phase is the same as the second training phase except that the inference omits the recognition network.Solid arrows in Fig. 2 show the inference phase.

Experimental Setting
Following (Shang et al., 2015), we use the conversation dataset from weibo.com with 1.25M samples.We show the details about hyper-parameters, codes, and datasets in Appendix.We conduct the experiments on the following methods.
• Emotion Independent Models.Seq2Seq (S2S) (Shang et al., 2015) and Transformer (Trs) (Vaswani et al., 2017) are conventional conversation models without considering the emotions.• Emotional Conversation Models.To encourage the conversation to be positive, S2S+ECM follows Seq2Seq-based ECM (Zhou et al., 2018) that learns to generate a response with a given emotion label.We train the ECM on a corpus with emotion labels (positive and negative) and feed the positive label to the conversation model in inference.As transformer is more powerful than Seq2Seq, we implement a transformer-based ECM (Zhou et al., 2018) Trs+ECM.Trs+Dual (Shen and Feng, 2020)    We evaluate all methods with both automatic and human evaluations.The automatic evaluations consist of three aspects.First, we evaluate Appropriateness in Bleu-N (Papineni et al., 2002) and Nist-N (Doddington, 2002).Second, we measure Diversity with Dist-N (Li et al., 2016) and Entropy (Ent) (Mou et al., 2016).Third, we measure how positive the generated responses are (Emotion).Sent is the average sentiment scores of the generated responses evaluated by a sentiment classifier, which marks the sentiment of a sentence ranging from 0 (negative) to 1 (positive).Posi% denotes the percentage of the responses being recognized as positive by the sentiment classifier.
The human evaluations involve: quality Qual, diversity Div, and the emotion reflected from the generated responses Emo.Emo score covers: 1. whether the response is emotionally positive; 2. whether the response is empathetic (cottoned with query's emotion).We hire five annotators and each annotator evaluates 250 randomly selected test samples.The annotation scores range from 1 to 5 (See details about the setting in Appendix C).

Overall Performance
Table 1 shows the performances of our models and baselines in automatic evaluations.In our applications, the transformer framework is much more suitable than Seq2Seq according to the comparisons among row 1 ∼ 4, so we implement our model and most baselines based on the transformer framework for a fair comparison.
Row 2 and Row 4 to 6 of Table 1 show the baselines considering the emotion.ECM models (Zhou et al., 2018) (S2S+ECM and Trs+ECM) are more skilled at generating positive responses rather than emotion-independent models (S2S and Trs).Trs+RL (Shin et al., 2020) outperforms Trs but slightly underperforms Trs+ECM in Emotion.The reason is RL encourages positive utterances but it's hard to train due to the differentiation difficulty on generation model (Yu et al., 2017).In terms of appropriateness and diversity, we observe slight performance drops on S2S+ECM, Trs+ECM, and Trs+RL compared to their emotion-independent variants, and the similar phenomenon can be found in their original paper (Shin et al., 2020).Trs+Dual (Shen and Feng, 2020), the strongest baseline, further enhances the performance on all aspects.
Our proposed models surpass all the baselines at most metrics.Our models make clear improvements in Appropriateness and Emotion.Even if Table 3: The results of our models with different c in memory read.c = N shows the model reads all the memory slots, where K is the number of slots.c = 10 and c = 1 are our model with soft and hard read shown in Table 1.
improving Diversity is not our purpose and our constraints on the emotion and responding strategies naturally limit the diversity, our model still works well on diversity owing to Strategy Mixer.Ours (soft) does better in Emotion and Ours (hard) obtains a better performance in Appropriateness and Diversity.We further analyze it in Sec.4.4.Our two variants exceed the baselines on human evaluations.Note that Emo considers not only positive emotion but also empathy (correlated to query's emotion) in responses.Ours (soft) outperforming Ours (hard) in Emo indicates a mixture of emotions is better than a single one.

Ablation Study
Table 2 shows the ablation study on our proposed components.Ours − SM denotes our model's variant without a strategy mixer, where the model directly feeds the memory outputs to CCM.This variant underperforms our model (Ours (hard)) verifying the effectiveness of the strategy mixer proposed in Sec.3.4.The reason is our strategy mixer learns to mix the strategies via gradient descent while the strategy vector z of Ours − SM is from the memory directly which is non-differentiable for end-to-end training.
Ours − fixing CCM means CCM's (Sec.3.5) parameters are not frozen in the second training phase.It optimizes the transformer by treating it as a policy model and regarding text generation as a sequence of actions.The strategy inevitably faces large action space and complicated backpropagation through the transformer, thus Ours − fixing CCM results remain suboptimal.As Ours − fixing CCM's training strategy is similar to our baseline Trs+RL (Shin et al., 2020), the results confirm that Trs+RL's strategy is not so effective.
The variant without loss of sentiment score (Ours −L sent ) behaves similarly to our full model except in Emotion.It shows that the supervision from PSE (Sec.3.6) is necessary to lead to positive conversations and does not affect the appropriateness and diversity so much.Ours − Gaussian and Ours − EmoMemory show the importance of using Gaussian distribution and the emotion-specific memory module in SM.

Analysis about the Memory Reading
We can control the way of reading the queryresponse memory by varying the number of c.Table 3 shows the model performances with different c. c = 1 and c = 10 is identical to Ours (hard) and Ours (soft).c = N indicates an extreme setting that our model reads all memory slots, where K = 1000 in our experiments.In general, the larger c tends to result in high diversity, since the model reads information from more various resources (more memory slots).The smaller c leads to higher performance on Appropriateness and Emotion, because they leverage information from a few relevant memory slots.
For the extreme settings, the variant with c = N reads the whole memory and it leads to overwhelming the important information from the memory.The result on c = 1 shows that the responding strategy with a mixture of emotions instead of a single one harms the overall quality but helps the emotional generation.

Case Study
We conduct the case study on two cases (cases and detailed analyses in Appendix E).Our model gets visible improvements over the baselines.Ours's generated responses show empathy by agreeing with users and guiding positive conversations.

Conclusion
In this paper, we propose an emotional chatbot with the ability to make empathetic and emotionally positive responses.We construct an emotion-specific query-response memory to abstract and memorize the correlation between users' queries and the potential responses.The model fuses the memory outputs into a latent distribution and samples an implicit responding strategy from the distribution.The above operations are supervised by a sentiment evaluator to encourage positive emotions.Our experiments show our model's strengths in appropriateness, diversity, and generation of positive responses.In the future, we will extend our model to multi-turn conversation scenarios.

Ethical Considerations
The target of this paper is to build an empathetic and emotionally positive dialogue system.There are several concerns about applying this paper to real use, which may lead to ethical issues.First, we should carefully choose the applications of this paper.Our model is designed to behave as a chatbot instead of a therapeutic system.Our systems can not replace psychologists to conduct any treatment.We want to remind the users who may use our model that a patient with psychological illness should see a psychologist instead of regarding our model as a treatment.
Second, the training of a conversation model may lead to privacy disclosure because the conversation samples are sometimes from personal conversations.In this work, we try to avoid the issues mentioned above.The data source used in this paper comes from a published dataset and does not involve privacy issues for the data collection.We want to emphasize that other researchers, who want to follow and re-implementation our model, should carefully choose the training data to take care of the privacy concerns.
Third, our work validates the proposed method and baseline models on human evaluation which involves manual labor.We hire five annotators to score the generated sentences, and the hourly pay is set to 15 US$ per person, which is higher than the local statutory minimum wage.transformer (including ours), we implement and follow the setting of "transformer base" (Vaswani et al., 2017) in the original paper.The model dimension and embedding dimension are 512, the stacked-layer number is 6, and the head number is 8.We use Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.98, and ϵ = 10 −9 .The maximum length of queries and responses is limited to 80 for all the methods.We do not explicitly restrict the priority of the empathetic and positive emotions and the weight α in loss function is 0.9.The batch size is 64.The memory size K is 1000, and the c for memory soft reading is 10.The dimension of the z, key, and values in the memory is 512.We infer responses with topK sampling with the size of 20.The PSE and ED utilize the pretrained BERT in the Chinese version 6 .We train the models on GPU (V100) with 32GB memory.The runnning time of the our model ranges from 1.5 to 3 days.For the hyper-parameters, we tuned the memory size K by trying K = 1, 10, 100, 500, N .For the topK sample size, we tried 10, 20, and 40.For the weight alpha of the loss function, we tried 0.5, 0.7, 0.9, 1, 1.2, 1.5, and 5.

C Details about the Human Evaluation
Following the previous papers (Gao et al., 2019b,a;Tian et al., 2019), we hired five annotators from a language labeling company, which are professional in annotations for both industry and academia.The company assigned us some annotators specialized in evaluating samples considering linguistics and psychology.Each annotator evaluates 250 randomly selected samples.The number of the annotators and the quantity of the samples are in the same range as the related papers (Gao et al., 2019b,a;Tian et al., 2019;Song et al., 2020).Different models' outputs for each sample are shuffled among models and the model names are not accessible to the annotators.The annotators are required to view all information about the current sample, including the input query and the ground-truth output.We note that our experiment setting is based on singleturn conversation, so the contexts from the previous turns are not available in both the automatic evaluation and the human evaluation.
When we conduct the human annotation, we ask the annotators to obey the following instructions.
• "Qual" score: A response coherent to the conversational contexts and appropriate to the cur-6 huggingface.co/bert-base-chineserent topic without typos should be marked as 5 points.A valid response that only satisfies the user's query should be marked as 3 points.An irrelevant response should be scored as 1.An incomplete response that the annotators cannot get the speaker's meaning should also be assigned as 1.Points 2 and 4 are for decision dilemmas.
• "Div" score: 5 points stand for an output response with at least two clauses from different topics, where the topic may be transferred from the current conversation to another scenario (For example, the query is "How's the weather?"; and response "It's fine today, let's play basketball" transfers the weather topic to sports.Such a case should be scored as 5 points).A normal response of a single clause or a single topic should be marked as 3 points.1 point is for the universal reply (i.e., "I do think so ¡') or the response containing no more than three unique words (i.e., "That is OK'') should be assigned to point 1.Points 2 and 4 are for decision dilemmas.
• "Emo" score: An output response that satisfies the two aspects (empathetic and emotionally positive) should be assigned as a full mark (5 points).
A response that satisfies only one aspect should be scored as 3 points.A response that cannot satisfy any one aspect should be scored as 1 point.
Points 2 and 4 are for decision dilemmas.For example, when the input query is "Kobe Bryant had a plane crash.We lost him, so sad".5 points are for the response of "Sorry to hear that crash.He's a legend and will be my idol forever.Hope his family and fans can recover soon".3 points are for the response of "Yes, we lost him.I'm extremely heartbroken these days". 2 points are for the response of "Come on, cheer up.We still love NBA".

D Deduction for the Loss in Strategy Mixer
The detailed deduction of the L SM loss mentioned in the Strategy Mixer section is as follows.The L SM is the evidence lower bound (ELBO) of the maximum log-likelihood of P (r|q), where q denotes the query, r denotes the response, M denotes the memory to estimate the r, and z is the latent variable.

E Case Study
Fig. 3 shows two cases mentioned in Section 4.5, which compares the performance of all the methods.In the two queries, users express the emotion of disappointed and angry respectively.S2S and Trs offer general responses that simply submissively follow the user's queries.S2S+ECM, Trs+ECM, and Trs+Dual incorporate the positive emotion into the responses but sacrifice the consistency between queries and responses.Interestingly, Trs+RL generates an ironic response, indicating the explicit feedback makes a big impact on the emotion of the response.Our methods achieve the best performance among all the methods, especially Ours (Soft).The generated responses agree with the user's points and show empathy with the user by continuing their topic.Then, responses also guide the conversation to be positive.

Figure 2 :
Figure 2: Overview of our model.The solid arrows indicate the operations for both training and inference while the dashed arrows mean the operations only for training.The left side shows the first and second training phases;the right side shows our core module, Responding Strategy Generator, which works in the second phase.Colors represent various groups of similar samples.q, r, r denotes query, ground-truth response, and generated response.
2020) uses RL to encourage a transformer to output positive responses.• Ours.Ours (Hard) and Ours (Soft) denote our proposed model with hard read (c = 1 in the recognition network) and soft read (c = 10).

Table 1 :
(Fleiss, 1971ce of comparing methods on automatic metrics and human evaluation.The first six rows indicate the performance of baselines.Ours (hard) and Ours (soft) denote our model with hard memory reading and soft memory reading respectively.In automatic metrics, the underline results indicate that our improvements compared with baselines are statistically significant under t-test (p < 0.05).For human evaluation, the Fleiss' kappa(Fleiss, 1971) among different annotators is 0.42, which shows a moderate agreement among annotators.
applies dual learning to controllable response generation.Trs+RL(Shin et al.,

Table 2 :
The performance over our different variants.Ours − SM, Ours −L sent , and Ours − CCM fixing indicates our variants without the strategy mixer (SM), the loss of sentiment L sent , fixing the CCM's parameters in the second training phase, Gaussian distribution in SM, and emotion-specific memory.