MISC: A Mixed Strategy-Aware Model integrating COMET for Emotional Support Conversation

Applying existing methods to emotional support conversation—which provides valuable assistance to people who are in need—has two major limitations: (a) they generally employ a conversation-level emotion label, which is too coarse-grained to capture user’s instant mental state; (b) most of them focus on expressing empathy in the response(s) rather than gradually reducing user’s distress. To address the problems, we propose a novel model \textbf{MISC}, which firstly infers the user’s fine-grained emotional status, and then responds skillfully using a mixture of strategy. Experimental results on the benchmark dataset demonstrate the effectiveness of our method and reveal the benefits of fine-grained emotion understanding as well as mixed-up strategy modeling.


Introduction
Empathy is the ability to perceive what others feel, think in their places and respond properly. It has a broad application scenarios to endow machines with the ability of empathy, including automatic psycho-therapist, intelligent customer service, empathetic conversational agents, and etc (Fitzpatrick et al., 2017;Shin et al., 2019;Ma et al., 2020).
In this work, we focus on a special kind of human-computer empathetic conversation, i.e., emotional support conversation . Distinguishedly, emotional support conversation happens between a seeker and supporter, where the supporter aims to gradually reduce seeker's distress as the conversation goes. This makes existing approaches unsuitable for our setting for at least two reasons. Firstly, existing work on emotional chatting learns to predict user emotion using a conversation-level emotion label, which is coarse-grained and static to the conversation context Lin et al., 2019c;Li et al., 2020a). However, emotion is complex and user emotion intensity will change during the developing of the conversation . It is thus a necessity to tell seeker's fine-grained mental state at each utterance. Secondly, most of empathetic chatbots are trained to respond emotionally in accordance with the predicted coarse-grained emotion class, without consideration on how to address the seeker's emotional problem (De Graaf et al., 2012;Majumder et al., 2020;Xie and Park, 2021). Hence, they are deficient to apply for emotional support conversation whose goal is to help others work through the challenges they face. To tackle these issues, we propose a novel approach MISC, a.k.a. MIxed Srategy-aware model integrating COMET for emotional support conversation. As to the first issue, we introduce COMET, a pre-trained generative commonsense reasoning model (Bosselut et al., 2019a), and devise an attention mechanism to selectively adopt the COMET knowledge tuples for fine-grained emotion understanding. As shown in Figure 1, this allows us to capture seeker's instantaneous mental state using different COMET tuples. In addition, we propose to also consider response strategy when generating empathetic responses for the second issue. Instead of modeling response strategy as a one-hot indicator, we formulate it as a probability distribution over a strategy codebook, and guide the response generation using a mixture of strategies. At last, our MISC produces supportive responses based on both COMET-enhanced mental information and distributed strategy representation. The unique design of mixed strategy not only helps to increase the expressed empathy, but also facilitates to learn the gradual transition in the long response, as the last utterance in Figure 1, which will in turn make the conversation more smooth.
To evaluate our model, we conduct extensive experiments on ESConv benchmark  and compare with 5 state-of-the-art empathetic chatbots. Based on both automatic metrics and manual judgments, we demonstrate that the responses generated by our model MISC are more relevant and empathetic. Besides, additional experimental analysis reveal the importance of response strategy modeling, and sheds light on how to learn a proper response strategy as well as how response strategy could influence the empathy of the chatbot.
In brief, our contributions are as follows: (1) We present a Seq2Seq model MISC, which incorporates commonsense knowledge and mixed response strategy into emotional support conversation; (2) We conduct experiments on ESConv dataset, and demonstrate the effectiveness of the proposed MISC by comparing with other SOTA methods. (3) We implement different ways of strategy modeling and give some hints on strategyaware emotional support conversation.
2 Related Work 2.1 Emotion-aware Response Generation As suggested in , emotion-aware dialogue systems can be categorized into three classes: emotional chatting, empathetic responding and emotional support conversation. Early work target at emotional chatting and rely on emotional signals (Li et al., 2017;Zhou et al., 2018a;Wei et al., 2019;Zhou and Wang, 2018;Song et al., 2019). Later, some researchers shift focus towards eliciting user's specific emotion (Lubis et al., 2018;Li et al., 2020b). Recent work begin to incorporate extra information for deeper emotion understanding and empathetic responding (Lin et al., 2020;Li et al., 2020a;Roller et al., 2021). Li et al. (2021a) and Zhong et al. (2021) exploit ConceptNet to enhance emotion reasoning for response generation. Different from them, our work exploits a generative commonsense model COMET (Bosselut et al., 2019b), which enables us to capture seeker's mental states and facilitates strategy prediction in emotional support conversation.

Commonsense Knowledge for NLP
Recently, there is a large body of literature injecting commonsense knowledge into various NLP tasks, including classification (Chen et al., 2019;Paul and Frank, 2019), question answering (Mihaylov and Frank, 2018;Bauer et al., 2018;Lin et al., 2019a), story and language generation (Guan et al., 2019;Ji et al., 2020), and also dialogue systems (Zhou et al., 2018b;Zhang et al., 2020;Li et al., 2021a;Zhong et al., 2021). These dialogue systems often utilize ConceptNet (Speer et al., 2017), aiming to complement conversation utterances with physical knowledge. Distinguished from ConceptNet, ATOMIC (Sap et al., 2019) covers social knowledge including event-centered causes and effects as well as person-related mental states. To this end, ATOMIC is expected beneficial for emotion understanding and contributing to response empathy. In this work, we leverage COMET (Bosselut et al., 2019b), a commonsense reasoning model trained over ATOMIC for emotional support conversation.

Strategy-aware Conversation Modeling
Conversation strategy can be defined using different notions from different perspectives. A majority of research works is conducted under the notion of dialog acts, where a plethora of dialog act schemes have been created (Mezza et al., 2018;. Dialog acts are empirically validated beneficial in both taskoriented dialogue systems and open-domain social chatbots (Zhao et al., 2017;Xu et al., 2018;Li et al., 2020c). As to empathetic dialogues, conversation strategy is often defined using the notion of response intention or communication strategy, which is inspired from the theories of empathy in psychology and neuroscience (Lubis et al., 2019;Li et al., 2021b). Whereas Welivita and Pu (2020) define a taxonomy of 15 response intentions through which humans empathize with others,  define a set of 8 support strategies that humans utilize to reduce other's emotional distress. This partially reveals that response strategy is complex, which motivates us to condition on a mixture of strategy when generating supportive responses.

ESConv Dataset
In this paper, we use the Emotional Support Conversation dataset, ESConv . Before conversations start, seekers should determine their emotion types, and tell the situation they are dealing with to supporters. Besides, the strategy of every supporter's utterance is marked, which is the most important to our work. In total, there are 8 kinds of strategies, and they are almost evenly distributed. More details are given in Appendix.

Problem Formulation
For general dialogue response generation, the target is to estimate the probability distribution p(r|c) consists of a sequence of n i utterances in the dialogue history, and r (i) is the target response. For the sake of brevity, we omit the superscript (i) when denoting a single example in the remaining part.
In the setting of emotional support conversation, the seeker's situation s is considered as an extra input, which describes the seeker's problem in freeform text. We also denote the seeker's last post (utterance) as x. Consequently, the target becomes to estimate the probability distribution p(r|c, s, x).

Model: MISC
The overview of our approach is shown in Figure 2. Based on blenderbot-small (Roller et al., 2021), our model MISC consists three main components: (1) a mental state-enhanced encoder (Bosselut et al., 2019a); (2) a mixed strategy learning module; and (3) a multi-factor-aware decoder.

Mental State-Enhanced Encoder
Following common practice, we firstly represent the context using the encoder E: where CLS is the start-token and EOS is the separation-token between two utterances. To better understand the seeker's situation, we exploit COMET (Bosselut et al., 2019a), a commonsense knowledge generator to supply mental state information related to the conversation. Concretely, we treat the situation s as an event, and feed it with different relations into COMET: where N r is the number of pre-defined relations in COMET, and rel j stands for the j-th specific relation, such as xAttr and xReact. 1 Note that given a certain event-relation pair, COMET is able to generate multiple "tails" of free-form mental state information, B s is a set of N s mental state blocks, i.e., B s = {b s j } Ns j=1 . Similarly, we can obtain the set of mental state blocks B x using the seeker's last post x.
Then, all of the free-form blocks will be transformed into dense vectors using our encoder E: and the hidden state of each block's first token will be used to represent the corresponding block.
Later, due to the noisy of COMET blocks, a lot of them are irrelevant to the context. We creatively take attention method to refine the strongly relevant blocks. That operation could be expressed as where LN is the LayerNorm module (Ba et al., 2016). Similarly, we could transform x to H x following the same method as s to H s . At last, we get the conversation-level and utterance-level representation of seeker's mental state H s and H x , which are enhanced with commonsense information.

Mixed Strategy Learning Module
One straightforward way to predict the response strategy is to train a classifier upon the CLS states of the context representation C from Eq. (1): where MLP is a multi-layer perceptron, and p g records the probabilities of each strategy to be used.
To model the complexity of response strategy as discussed before, we propose to employ the distribution p g and model a mixture of strategies for response generation. Here, we masterly learn from the idea of VQ-VAE's codebook to represent strategy (Oord et al., 2017). The strategy codebook T ∈ R m×d represent m strategy latent vectors (here m = 8) with the dimension size d. By weighting T using p g , we are able to obtain a comprehensive strategy representation h g Our codebook-based method has two benefits: (1) It is beneficial when long responses are needed to skillfully reduce the seeker's distress, which is common in emotional support conversation. (2) It is flexible to learn. Intuitively, if a strategy has a higher probability in p g , it should take greater effect in guiding the support conversation. In the extreme case where we have a sharp distribution, one single strategy will take over the control.

Multi-Factor-Aware Decoder
The remaining is to properly utilize the inferred mental states and the strategy representation. To notify the decoder of these information, we modify the backbone's cross attention module as: where CROSS-ATT stands for the backbone's cross attention module, and O is the hidden states of the decoder, which produces the final response by interacting with multi-factors.
Based on blenderbor-small (Roller et al., 2021), we jointly train the model to predict the strategy and produce the response: where n r is the length of response, g is the true strategy label, L g is the loss of predicting strategy, L r is the loss of predicting response, and L is combined objective to minimize.

Experimental Setups
We evaluate our and the compared approaches on the dataset ESConv . For preprocessing, we truncate the conversation examples every 10 utterances, and randomly spilt the dataset into train, valid, test with the ratio of 8:1:1. The statistics is given in Table 1

Evaluation Metrics
We adopt a set of automatic and human evaluation metrics to assess the model performances: Automatic Metrics.
(1) We take the strategy prediction accuracy ACC. as an essential metric. A higher ACC. indicates that the model has a better capability to choose the response strategy. (2) We then acquire the conventional PPL (perplexity), B-2 (BLEU-2), B-4 (BLEU-4) (Papineni et al., 2002), R-L (ROUGE-L) (Lin, 2004) and M (Meteor) (Denkowski and Lavie, 2014) metrics to evaluate the lexical and semantic aspects of the generated responses.
(3) For response diversity, we report D-1 (Distinct-1) and D-2 (Distinct-2) numbers, which assesses the ratios of the unique n-grams in the generated responses (Li et al., 2016). Human Judgments. Following See et al. (2019), we also recruit 3 professional annotators with linguistic and psychologist background and ask them to rate the generated responses according to Fluency, Knowledge and Empathy aspects with level of {0,1,2}. For fair comparison, the expert annotators do not know which model the response is from. Note that these 3 writers are paid and the results are proof-checked by 1 additional person.

Compared Models
Transformer is a vanilla Seq2Seq model trained based on the MLE loss (Vaswani et al., 2017).
MT Transformer is the Multi-Task transformer which considers emotion prediction as an extra learning task (Rashkin et al., 2018). In specific, we use the conversation-level emotion label provided in ESConv to learn emotion prediction.
MoEL softly combines the output states from multiple listeners (decoders) to enhance the response empathy for different emotions (Lin et al., 2019b). MIME considers the polarity-based emotion clusters and emotional mimicry for empathetic response generation (Majumder et al., 2020). BlenderBot-Joint is the SOTA model on ESConv dataset, which prepends a special strategy token before the response utterances .

Implementation Details
We implement our approach based on blenderbotsmall (Roller et al., 2021) using the default sizes of vocabulary and the hidden states. For the last post x and the situation s, we set the maximum number of the retrieved COMET blocks as 30 and 20 respectively. The inferred COMET blocks will be sent to the encoder with a maximum of 10 words. To be comparable with the SOTA model in , we fine-tune MISC based on the blenderbot-small with the size of 90M parameters by a Tesla-V100 GPU. The batch size of training and evaluating is 20 and 50, respectively. We initialize the learning rate as 2e-5 and change it during training using a linear warmup with 120 warmup steps. We use AdamW as optimizer (Loshchilov and Hutter, 2018) with β 1 =0.9, β 2 =0.999 and =1e-8. After training 8 epochs, the checkpoint with the lowest perplexity on the validation set is selected for testing. Following , we also adopt the decoding algorithms of Top-p and Top-k sampling with p=0.3, k=30, temperature τ =0.7 and the repetition penalty 1.03. We will release the source code to facilitate future work. Table 2, the vanilla Transformer performs the worst according to its relatively low PPL, BLEU-n and distinct-n scores. This is not suprising because it does not have any other specific optimization objective to learn the ability of empathy, and it is observed to be deficient for capturing long context as that in the ESConv dataset.

As shown in
The performances of MT Transformer, MoEL and MIME, are also disappointing. Even though they three are equipped with empathetic objectives such as emotion prediction and ensembling listener, they are based on the conversation-level static emotion label, which is not adequate for fine-grained emotion understanding. More importantly, these three empathetic models lack of the ability of strategically consoling the seekers in the setting of emotional support conversation.
By comparing with the SOTA model BlenderBot-Joint, we can see that our model MISC is more effective especially in predicting more accurate response strategy. Whereas BlenderBot-Joint predicts one single strategy at the first decoding step, our method MISC models mixed response strategies using a strategy codebook and allows the decoder to learn the smooth transition and exhibit empathy more naturally. The comparison result suggests that it is beneficial to predict the response strategy as an extra task and to take into consideration the strategy complex for emotional support conversation.
The human evaluation results in Table 3 are consistent with the automatic results. Thanks to the pre-   trained LM blenderbot-small (Rashkin et al., 2018), BlenderBot-Joint and our MISC significantly outperform other models on the Fluency aspect. Notably, our MISC yields the highest Knowledge score, which indicates that the responses produced by our approach contain much more specific information related to the context. We conjecture that our multi-factor-aware decoder successfully learns utilize the mental state knowledge from COMET with the mixture of the predicted strategies. Overall speaking, MISC performs the best on almost every metric. It strongly demonstrates the effectiveness of our approach, and highlights the importance of fine-grained mental state modeling and mixed response strategy incorporation.

Analysis
Our method MISC has two novel designs: considering the fine-grained mental states and incorporating a mixture of response strategy. To investigate more, we conduct extra experiments, and the analysis results give us hints of how to develop better emotional support conversational agents.

Ablation Study
In order to verify the improvement brought by each added part (g, s, x), we drop these three parts from the MISC and check the performance changes. As shown in Table 4, the scores on all the metrics decrease dramatically when the g is albated. Consequently, we suppose the strategy attention is vital for guiding the semantics of the response. In addition, the scores also decline when we remove the the situation s and the seeker's last query x. According to the above experiments, each main part of the MISC is proven effective.

Case Study
In Table 5, an example is present to compare the response generated by the MISC and the other models. Various problems appear in the compared models, such as inconsistency, repetition, contradiction, etc. Intuitively, our model achieves the best performance in contrast. Besides, we present a visualization in Figure 4 to interpret how the MISC organizes the response under the combined effect of the COMET blocks and the mixture of strategies.

Fine-grained Emotion Understanding
As discussed before, one limitation of previous approaches is that they solely rely on a conversationlevel emotion label, which is too coarse to guide the chatbot respond strategically and help the emotional conversation progress healthily. To remedy this issue, we exploit the commonsense knowledge generator COMET to supplement fine-grained information of seeker's mental state.
In order to fairly examine the effects of different emotional information, we discard the COMET blocks and implement a variant of our method MISE, a.k.a. MIxed-Srategy-aware model integrating Emotion, where an extra emotion classification objective is added to the main architecture, as in Rashkin et al. (2018).   drop when replacing the fine-grained mental information with coarse-grained emotion label. To depict the advantage of fine-grained mental state information, we visualize the attended COMET blocks of the example in Table 5. As shown in Figure 4, our chatbot MISC pays much attention of those inferred knowledge that are beneficial for fine-grained emotion understanding and strategy-aware empathetic responding.
More specifically, the attended COMET blocks (xReact, hurt) and (xAttr, sad) permit our chatbot MISC to utter the words "it was painful" which reflects its understanding of the seeker's feeling. Besides, note that the COMET blocks with white background are retrieved using the situation information s, and the grey ones are collected using the seeker's last post x. Despite of some overlapping, the white and grey attended blocks do contain distinct and crucial mental state knowledge. This partially validates that s and x is complementary to each other, and they two are useful information for emotional support conversation.

Mixed-Strategy-Aware Empathetic Responding
Meanwhile, the mixture of response strategy also plays a vital role for emotional support conversation. By analyzing the aforementioned case in depth, we find some hints on why our way to model conversation strategy is more preferred in the setting of emotional support conversation.
Hint 1: Mixed strategy is beneficial for Smooth Emotional Support. In Figure 4, we visualize the predicted strategy representation and the generated support response in Table 5. After understanding the seeker's situation of break-up and feelings of sadness, our MISC reasons that it might be proper to employ the strategies of Self-disclosure, Reflection of feelings to emotionally reply and effectively console the seeker's. Then, MISC organizes the response by firstly reveals that "it" has similar experiences and knows the feelings like. Moreover, the chatbot also supplements detailed information of move on from a relationship to suggest that the life will go on. These added-up words could be regarded as using the strategy of Information or Others, which is useful to transit the conversation to the next step smoothly. This case vividly shows how response generation is guided by the mixed strategies, and how skillful of our chatbot MISC is.
Hint 2: Mixed strategy is more effective than single strategy. In addition to the case study, we also attempt to quantitatively assess the benefit of the mixed strategy modeling. To do so, we implement another variant of our chatbot Single where the mixed representation is replaced with an onehot representation. Typically, we pick up the strategy dimension with the largest probability value as the one-hot output. The comparison results are given in Table 7. Although yielding a slightly better distinct-n scores, the single-strategy variant lags far behind according to the lexical and semantic   scores.
Recall that the SOTA model BlenderBot-Joint  can also be regarded as a single-strategy model where a special strategy token is firstly decoded at the beginning of the response generation. We then compare their way of strategy modeling with our mixed strategy representation. As shown in Figure 5, the top-k strategy prediction accuracy of our MISC always surpasses that of BlenderBot-Joint, and the top-5 accuracy of our model reaches over 80%. This again proves the success of our strategy modeling.   Hint 3: Mixed strategy is suitable for ESC Framework. The emotional support conversations in the dataset ESConv are guided by the ESC Framework, which suggests that emotional support generally follows a certain order of strategy flow. Similar to , here we also visualize the strategy distributions learned from different models, and compare them with the "ground-truth" strategy distribution in the original dataset. As shown in Figure 3, we can find: (1) Comparing our model with the SOTA model BlenderBot-Joint, we can find that our MISC better mimics the skill of strategy adoption in emotional support conversation.
(2) At almost all stages of the conversation, our model is less likely to predict the strategy of Others (the grey part), as compared to BlenderBot-Joint. This indicates that the strategy acquired by our model is more discriminative than those by BlenderBot-Joint.
(3) Overall speaking, the strategy distribution from our model share very similar patterns as compared to the ground-truth distribution. This implies that our way to model the strategy learning is suitable for the ESC framework.

Conclusions
In this paper, we propose MISC, a novel framework for emotional support conversation, which introduces COMET to capture user's instant mental state, and devises a mixed strategy-aware decoder to generate supportive response. Through extensive experiments, we prove the superiority and rationality of our model. In the future, we plan to learn the mixed response strategy in a dynamic way.

Ethical Considerations
At last, we discuss the potential ethic impacts of this work: (1) The ESConv dataset is a publiclyavailable, well-established benchmark for emo-tional support conversation; (2) Privacy: The original providers have filtered the sensitive information such as personally identifiable information ; (3) Nevertheless, due to the limitation of filtering coverage, the conversations might still remain some languages that are emotionally triggering. Note that our work focuses on building emotional support conversational agents. For risky situations such as self-harm-related conversations, we do not claim any treatments or diagnosis.