CoMAE: A Multi-factor Hierarchical Framework for Empathetic Response Generation

The capacity of empathy is crucial to the success of open-domain dialog systems. Due to its nature of multi-dimensionality, there are various factors that relate to empathy expression, such as communication mechanism, dialog act and emotion. However, existing methods for empathetic response generation usually either consider only one empathy factor or ignore the hierarchical relationships between different factors, leading to a weak ability of empathy modeling. In this paper, we propose a multi-factor hierarchical framework, CoMAE, for empathetic response generation, which models the above three key factors of empathy expression in a hierarchical way. We show experimentally that our CoMAE-based model can generate more empathetic responses than previous methods. We also highlight the importance of hierarchical modeling of different factors through both the empirical analysis on a real-life corpus and the extensive experiments. Our codes and used data are available at https://github.com/chujiezheng/CoMAE.


Introduction
Empathy, which refers to the capacity to understand or feel what another person is experiencing (Rothschild, 2006;Read, 2019), is a critical capability to open-domain dialog systems (Zhou et al., 2018b). As shown in previous research, empathetic conversational models can improve user satisfaction and receive more positive feedback in numerous domains (Klein, 1998;Liu and Picard, 2005;Brave et al., 2005;Fitzpatrick et al., 2017;. Recently, there have also been numerous works devoted to improving the dialog models' ability to understand the feelings of interlocutors (Rashkin et al., 2019;Majumder Figure 1: Our proposed hierarchical framework: Co-MAE (right). The directed arrows denote dependencies. We also present the framework (left) of EmpTransfo (Zandie and Mahoor, 2020) for comparison. et al., 2020), which makes the dialog models more empathetic to a certain extent.
However, empathy is a multi-dimensional construct (Davis et al., 1980) rather than merely recognizing the interlocutor's emotion  or emotional responding (Zhou et al., 2018a). It consists of two broad aspects related to cognition and affection (Omdahl, 2014;Paiva et al., 2017). The cognitive aspect requires understanding and interpreting the situation of the interlocutor (Elliott et al., 2018), which is reflected in the dialog act taken in the conversation (De Vignemont and Singer, 2006), such as questioning (e.g., What's wrong with it?), consoling (e.g., You'll get through this), etc. The affective aspect relates to properly expressing emotion in reaction to the experiences and feelings shared by the interlocutor, such as admiration (e.g., Congratulations!), sadness (e.g., I am sorry to hear that), etc. Very recently, Sharma et al. (2020) further characterizes the text-based expressed empathy based on the above two aspects as three communication mechanisms, which is a more higher-level and abstract factor that relates to empathy expression.
In this paper, we propose a novel framework named CoMAE for empathetic response generation (Section 3), which contains the aforemen-tioned three key factors of empathy expression: Communication Mechanism (CM), dialog Act (DA) and Emotion (EM). Specifically, when model these empathy factors simultaneously, we adopt a hierarchical way instead of following previous works that treat multiple factors independently, such like EmpTransfo (Zandie and Mahoor, 2020) that considers both DA and EM (see Figure 1 for comparison). Such approaches hold the hypothesis that different factors are independent of each other, which is intuitively unreasonable. In fact, our empirical analysis (Section 4) on a Reddit corpus (Zhong et al., 2020) shows that there are obvious hierarchical relationships between different factors, which confirms the soundness and necessity of hierarchical modeling.
We then devise a CoMAE-based model on top of the pre-trained language model GPT-2 (Radford et al., 2019) (Section 5), and compare the model performance with different combinations of empathy factors and hierarchical modeling. Automatic evaluation (Section 6.3) shows that combining all the three factors hierarchically can achieve the best model performance. Manual evaluation (Section 6.4) demonstrates that our model can generate more empathetic responses than previous methods. Extensive experiments (Section 6.5) further highlight the importance of hierarchical modeling in terms of the selection and realization of empathy factors.
The contributions of this paper can be summarized in three folds: • Based on the nature of multi-dimensionality of empathy expression, we propose a novel framework, CoMAE, for empathetic response generation. It hierarchically models three key factors of empathy expression: communication mechanism, dialog act and emotion.
• On top of GPT-2, we devise a CoMAEbased model. Experimental results show that our model can generate more empathetic responses than previous methods.
• We empirically analyze the necessity of hierarchical modeling, and highlight its importance especially in terms of the selection and realization of different empathy factors.
2 Related Work

Factors Related to Empathy Expression
Empathy is a complex multi-dimensional construct (Davis et al., 1980) which consists of two broad aspects related to cognition and affection (Omdahl, 2014;Paiva et al., 2017). As shown in Section 1, the two aspects are reflected in the dialog act (DA) taken and the emotion (EM) expressed in the conversation respectively. Based on the theoretical definition of empathy, Sharma et al. (2020) characterize the text-based expressed empathy as 3 communication mechanisms (CM): emotional reaction (ER) (e.g., I feel really sad for you), interpretation (IP) (e.g., This must be terrifying, I also have similar situations), and exploration (EX) (e.g., Are you still feeling alone now?). 1 These communication mechanisms are also applied in the recently proposed task of empathetic rewriting (Sharma et al., 2021).
Besides, Zhong et al. (2020) propose that persona, which refers to the social face an individual presents to the world (Jung, 2016), has been shown to be highly correlated with personality (Leary and Allen, 2011), which in turn influences empathy expression (Richendoller and Weaver III, 1994;Costa et al., 2014). While Zhong et al. (2020) do not explain the explicit connection between persona and empathy expression, they suggest that different speakers may have different "styles" for expressing empathy.

Empathetic Response Generation
In the past years, empathetic response generation has attracted much research interest (Rashkin et al., 2019;Majumder et al., 2020;Zandie and Mahoor, 2020;Sun et al., 2021). Rashkin et al. (2019) suggest that dialog models can generate more empathetic responses by recognizing the interlocutor's emotion.  propose to design a dedicated decoder to respond each emotion of the interlocutor, which makes the generation process more interpretable. Majumder et al. (2020) adopt the idea of emotional mimicry (Hess and Fischer, 2014) to make the generated responses more empathetic. Inspired by the advances in generative pre-trained language models (Radford et al., 2018(Radford et al., , 2019, EmpTransfo (Zandie and Mahoor, 2020) uses GPT (Radford et al., 2018) to generate empathetic responses.
Unlike previous works that only consider the EM factor in empathy modeling, EmpTransfo takes both DA and EM into account. The fundamental difference of EmpTransfo from our work lies in two points: (1) our work further considers communication mechanism in modeling empathy, and (2) we analyze and explore in depth the importance of hierarchically modeling of these empathy factors.

CoMAE Framework and Formulation
Our proposed CoMAE framework is shown in Figure 1. CoMAE uses CM as a high-level factor that provides a coarse-grained guidance for empathy expression, and then takes DA and EM to achieve the fine-grained realization. Formally, given the context x, CoMAE divides the generation of the empathetic response y into four steps: (1) predict CM C y conditioned on the context, (2) predict DA A y conditioned on both the context and CoM, (3) predict EM E y based on all the conditions, and (4) generate the final response y. The whole process is formulated as Equation 1: Note that EM is conditioned on DA, because we intuitively think the expressed emotion is the effect rather than the cause of taking some dialog act. In the other words, one may not adopt the dialog act just for the purpose of expressing some emotion. Hence, realizing the emotion expression as expected is also important in our task, which is the motivation of that we analyze the realization of different factors in Section 6.5.
It is also worth noting that while CoMAE only contains the three factors, such hierarchical framework can be naturally extended to more factors that relate to empathy expression. For instance, Zhong et al. (2020) suggest that persona plays an important role in empathetic conversations. Due to that persona may contain the information about the speaker's style of adopting DA or expressing EM, when integrating persona into empathetic response generation, being conditioned on DA and EM may lead to better performance.

Data Preparation and Analysis
While no empathetic conversation corpora provide annotations of diverse empathy factors, there are abundant publicly available resources that make automatic annotation feasible. In this section, we first introduce our used corpus and the resources and tools used in automatic annotation, then we show our empirical analysis to verify the hierarchical relationships between different empathy factors. Zhong et al. (2020) propose a large-scale empathetic conversation corpus 2 crawled from Reddit. It has two different domains: Happy and Offmychest. The posts in the Happy domain mainly have positive sentiments, while those in the Offmychest domain are usually negative. We adopted their corpus for study for two major reasons: (1) the corpus is real-life, scalable and naturalistic rather than acted (Rashkin et al., 2019), and (2) the manual annotation in (Zhong et al., 2020) shows that most of the last responses are empathetic (73% and 61% for Happy and Offmychest respectively).

Annotation Resources
Communication Mechanism (CM) 3 Sharma et al. (2020) provide two corpora annotated with CM: TalkLife (talklife.co) and Reddit (reddit.com), while only the latter is publicly accessible and we thus used the Reddit part. Note that in their original paper, each mechanism is differentiated as three classes of "no", "weak", or "strong". Due to the unbalanced distribution of three classes, we merged "weak" and "strong" into "yes". Finally, we differentiated each mechanism as two classes: "no" or "yes". Dialog Act (DA) 4 Welivita and Pu (2020) propose a taxonomy of DA (referred as "intent" in the original paper) for empathetic conversations. They first annotate 15 initial types of DA on the ED corpus (Rashkin et al., 2019), and finally obtain 8 high-frequency types of DA with other types merged as others (8+others), which are shown in Figure 2. Emotion (EM) 5 We considered the taxonomy proposed in (Demszky et al., 2020), which contains 27 emotions and a neutral one, because: (1) it has a wide coverage of emotion categories with clear definitions, and (2) the annotated corpus is large-scale and also crawled from Reddit. However, we noted that the original emotion distribution is unbalanced and the too fine-grained taxonomy may lead to the sparsity of partial emotions. Considering the task scenario of empathetic conversation, we adopted the clustering results in (Demszky et al., 2020) and modified the original taxonomy as 9 emotions and a neutral one (9+neutral), which are also shown in Figure 2. We show the mapping between our adopted emotions and the original emotions in Appendix A.

Classifiers
We fine-tuned the RoBERTa 6 (Liu et al., 2019) classifiers for CM, DA and EM, whose performance is summarized in Table 1. They all achieve reasonable performance, ensuring the quality of automatic annotation. However, we noted that the source domain (Rashkin et al., 2019) of the DA classifier is different from the target domain (Reddit). To verify the quality of DA annotation, we recruited three workers from Amazon Mechanical Turk to judge whether the utterance is consistent with the annotated DA. From the utterances that are not annotated with "others", we randomly sampled 25 utterances for each DA (totally 200) to avoid the impact of unbalanced distribution. Finally, the ratio of being judged as consistent is 0.78 with Fleiss' Kappa κ = 0.621 (Fleiss, 1971), which indicates substantial agreement (0.6 < κ < 0.8) and that the automatic annotation of DA is also reliable.

Data Filtering and Annotation
Following the original data split of (Zhong et al., 2020), we first filtered those conversations where there are more than two speakers (about 15%) to ensure that the last utterance is related to the post. We used the aforementioned classifiers to automatically annotate each utterance with DA and EM, and annotate each final response additionally with CM. We found that the last responses that are not annotated with any CM are more likely to 6 https://huggingface.co/roberta-base

Analysis
In order to verify the hierarchical relationships between the three factors, we counted the distribution frequency of each (X, Y ) 7 pair, where (X, Y ) is one of the three factor pairs: (CM, DA), (CM, EM), (DA, EM). We approximated the statistical frequency of (X, Y ) as their joint probability distribution P(X, Y ). We then normalized P(X, Y ) along the X dimension to obtain the conditional distribution of Y given X: P(Y |X). Figure 2 shows the heat maps of the conditional distributions of the three factor pairs. The heat maps reveal obvious patterns of the occurrence of Y given X. For instance, when one adopts the DA encouraging, he usually expresses the EM caring instead of approval or joy. If one expresses empathy with the CM exploration (EX), he almost always adopts the DA questioning and expresses the EM surprise. Hence, considering the hierarchical relationships between different empathy factors is reasonable and natural, and is also necessary for better empathy modeling. Our devised CoMAE-based model uses GPT-2 as the backbone (Radford et al., 2019). The overall architecture is shown in Figure 3. Firstly, our model takes the dialog context x as input. The context x is the concatenation of history utterances: x = (u 1 , u 2 , . . . , u N ) , where N is the length of dialog history. Any two adjacent utterances are also separated by the special token [EOS]. Each history utterance u i contains a sequence of tokens: where l i is the length of u i . Each utterance u i is labeled with the corresponding speaker k u i ∈ {0, 1} (only 2 speakers). We denote the annotated DA and EM of each utterance u i as A u i ∈ [0, 9) and E u i ∈ [0, 10) respectively. Suppose that the token id and the position id of u i,j are denoted as w u i,j ∈ [0, |V|) (V is the vocabulary) and p u i,j ∈ [0, 1024) (the maximum input length is 1024) respectively, the representation of each token u i,j is the summation of the following embeddings: , M E ∈ R 10×d denote the embedding matrices of word, position, speaker, DA and EM respectively, and [·] denotes the indexing operation. We denote the output hidden states after feeding x into the model as H x ∈ R lx×d , where l x is the total length of context x.
Next, we use the hidden state at the last position of the context, h x = H x [−1] ∈ R d , to hierarchically predict the CM, DA and EM of the target response. We first separately predict 8 8 In the mathematical notation used in this paper, we dis-C (i) y ∈ {0, 1} for each i ∈ {ER, IP, EX}, which indicates whether to adopt the CM i: where each F (i) C is a non-linear layer activated with tanh, and each M (i) C ∈ R 2×d denotes the embedding matrix of the CM i ∈ {ER, IP, EX}. Based on the context x and the predicted CMs C y , we next predict DA: where [·; ·] denotes vector concatenation and F A is a non-linear layer. Note that we share the parameters of DA embeddings with the classification head (Equation 6), which is consistent with the way in GPT-2 (Radford et al., 2019) where the parameters of word embeddings are shared with the LM head (Equation 10). EM is predicted similarly but conditioned additionally on the predicted DA A y : where F E is also a non-linear layer. Finally, we add all the factors to obtain the fused embedding e CoMAE that controls the empathy expression of the response: The embedding of each input token y t in the response is as follows: Suppose that the output hidden state corresponding to y t is s t , then we predict the next token y t+1 tinguish the ground truth value and the predicted value of a variable X with the symbols X * and X respectively. through the LM head: y t+1 ∼ P y t+1 y ≤t ; x, C y , A y , E y (10) where the parameters of the LM head are shared with the word embedding matrix M W .

Training
The optimization object contains two parts. One part is the negative log likelihood loss L NLL of the target response: where l y is the length of the golden response. The other part is the prediction losses of CM L C , DA L A , and EM L E : The complete optimization object is the summation of the above losses: L = L NLL + λ (L C + L A + L E ), where λ is the weight of the prediction losses. We set λ to 1.0 in our experiments.

Discussion
It is worth noting that the supervision signals of predictions (from Equation 11 to 13) combined with hierarchical modeling (from Equation 3  Hence, consider two models where one uses hierarchical modeling and the other does not (predicting each factor separately). When the two models are fed with the same empathy factors, saying the triplet (C y , A y , E y ) is designated validly, we can expect that the former model has better performance than the latter one. This conjecture will be verified in the automatic evaluation (Section 6.3).

Compared Models
We investigated the model performance with different combinations of empathy factors and hierarchical modeling: (1) Vanilla: the GPT-2 model directly fine-tuned on the corpus without adding any empathy factor; (2) +CM, +DA, +EM: the GPT-2 models equipped with one of the three factors; (3) CM || DA, CM || EM, DA || EM, CM || DA || EM: the models equipped with two or all of the three factors, but predicting each factor separately without hierarchical modeling; (4) CM → DA, CM → EM, DA → EM, CM → DA → EM: the models that are similar to (3) but utilize the hierarchical relationships, where → denotes dependency.
Note that the baseline DA || EM is consistent with EmpTransfo 9 (Zandie and Mahoor, 2020), and CM → DA → EM is exactly our devised model described in Section 5.1.

Implementation Details
All the models were implemented with PyTorch 10 (Paszke et al., 2019) and the Transformers library 11 (Wolf et al., 2020). We used the pre-trained GPT-2 with the size of 117M parameters (768 hidden sizes, 12 heads, 12 layers) for all the models. The responses were decoded by Top-p sampling with p = 0.9 and the temperature τ = 0.7 (Holtzman et al., 2019). We trained all the models with Adam (Kingma and Ba, 2014) optimizer with β 1 = 0.9 and β 2 = 0.999. The learning rate was 10 −4 and was dynamically changed using the linear warmup (Popel and Bojar, 2018) with 4000 warmup steps. All the models were fine-tuned for 5 epochs with the batch size 16 on one NVIDIA RTX 2080Ti GPU. We selected the checkpoint for each model where the model obtains the lowest perplexity score on the Valid set.

Automatic Evaluation
The automatic evaluation uses the golden responses as reference to evaluate the responses generated by  models. However, when the responses are generated based on the predicted CM / DA / EM, it is not appropriate to compare the generated responses with the reference ones (Liu et al., 2016). Thus, in automatic evaluation we only considered the setting where the models are fed with the ground truth empathy factors. The results where the generated responses are based on the predicted factors will be analyzed in the later experiments. The automatic metrics we adopted include perplexity (PPL), BLEU-2 (B-2) (Papineni et al., 2002), ROUGE-L (R-L) (Lin, 2004), and the BOW Embedding-based (Liu et al., 2016) Greedy matching score. The metrics except PPL were calculated with an NLG evaluation toolkit 12 (Sharma et al., 2017), where the generated responses were tokenized with NLTK 13 (Loper and Bird, 2002).
Results are shown in Table 2. We analyze the results from the following three perspectives: General Performance Our model achieves the best performance on all the metrics on both do-12 https://github.com/Maluuba/nlg-eval 13 https://www.nltk.org/ mains, and most of the advantages over the competitors are statistically significant. Impact of Empathy Factors The model performance vary from different combinations of empathy factors. First, considering more empathy factors always leads to better performance (e.g., CM → DA → EM > CM → EM > +EM > Vanilla). Second, EM brings the most gains to the model performance among the three factors. It may be because emotion is the most explicit factor that influences empathy expression (Sharma et al., 2020). In contrast, CM brings fewer gains than DA and EM. The reason may be that CM provides a highlevel but coarse-grained guidance for empathetic response generation, lacking a fine-grained control like DA or EM. While the responses in the corpus of (Zhong et al., 2020) are not too long (≤ 30 words), we believe that CM plays an important role in generating longer empathetic responses, which may require the planning of multiple methanisms and more diverse usage of DA and EM. Impact of Hierarchical modeling We noticed that for almost all the models that adopt multiple empathy factors, hierarchical modeling always leads to better performanc (e.g., CM → DA → EM > CM || DA || EM, DA → EM > DA || EM). This phenomenon is not trivial because the models with or without hierarchical modeling are all fed with the same empathy factors as the reference responses. It confirms our conjecture in Section 5.2 that hierarchical modeling can establish the connections between the embeddings of different factors, thus leading to a better capacity of empathy modeling. However, (CM, EM) is an exception. It may be due to that the pair (CM, EM) has a weaker correlation (the lowest manual information, Section 4.5) than other pairs.

Manual Evaluation
In manual evaluation, the models generate responses based on the empathy factors sampled from the predicted probability distributions. When sampling DA or EM, we used the Top-p filtering with p = 0.9 (Holtzman et al., 2019) to ensure the validness of the sampled results.
The manual evaluation is based on pair-wise comparison, and the metrics for manual evaluation include: Fluency (which response has better fluency and readability), Coherence (which response has better coherence and higher relevance to the context), and Empathy (which response shows bet-   Table 4: Results of the Hits@1/3 of predicting Y given that X is predicted rightly. "Prop." denotes the proportion of the cases where both models X || Y and X → Y predict X rightly. Scores that are significantly improved after using hierarchical modeling are marked with * (sign test, p-value < 0.001).
ter understanding of the partner's experiences and feelings, and which response expresses empathy in the way that the annotators prefer). The pair-wise comparison is conducted between three pairs of models:  Table 3. From all the three pairs, we find that the responses generated by these GPT-2-based models have similar fluency. The results of (1) indicate that further considering CM can significantly improve the empathy of generated responses, while the coherence may slightly decrease.  It may be because that the communication mechanisms like interpretation sometimes lead to the responses that are less relevant to the contexts (especially those sharing experiences). The results of (2) and (3) indicate that hierarchical modeling improves the coherence of generated responses. The more empathy factors are modeled, the larger improvement can be obtained.

Further Analysis of Hierarchical modeling
To give further insights of the superiority of hierarchical modeling, we analyzed (1) the prediction and (2) the realization of empathy factors. Prediction For each pair (X, Y ) in (CM, DA), (CM, EM), (DA, EM), we paired the models X || Y and X → Y for comparison. Our purpose is to observe whether the prediction of X improves that of Y after using hierarchical modeling. Note that when taking the ground truth as reference, it is not appropriate to directly judge the prediction accuracy by comparing Y and Y * if X = X * . We thus computed the conditional probability that Y is predicted rightly given that X is predicted rightly: Results are shown in Table 4. While the accuracy of predicting X of X || Y and X → Y is close, the prediction of Y is significantly enhanced by hierarchical modeling. The results demonstrate that hierarchical modeling enables the model to select more proper empathy factors. Realization Recall that in manual evaluation, the models generate a response based on the sampled empathy factors C y , A y , E y . To verify whether these factors are well realized, we used the classifiers in Section 4.3 to identify the empathy factors displayed in the generated responses. Suppose that the identification results are Z y , ∀Z ∈ {C, A, E}, we computed the ratio of Z y = Z y as the realization score of Z.
Results are shown in Table 5. The realization of all the factors is significantly improved by hierarchical modeling. It is intuitive because hierarchical modeling can avoid the cases where the sampled factors are inappropriate or even conflicting, thus reducing the noise of empathy factors in response generation.

Case Study
We show the generated responses with different empathy factors in Figure 4. The adoption of the CM emotional reaction causes our model to express the same EM admiration (i'm proud of you!) as DA → EM (good for you, man!), while the two models generate the same sentence (keep it up!) when taking the DA encouraging. However, the further adoption of the CM interpretation causes our model to further share its own experiences and feelings (i have been sober for about 10 years, and it's the best feeling ever). As a result, with the enhancement of multiple empathy factors, the response generated by our model is more engaging and empathetic while maintaining the coherence.
Besides, we noticed another phenomenon occurring when all the three CMs are adopted. In this case, the three CMs are usually represented separately in different sentences (e.g., I am so happy for you! I also had tried to be sober but failed. How did you make it?), which is consistent with the results of empathetic rewriting (Sharma et al., 2021). Recall that we add the same CoMAE embeddings for all the tokens in the response during generation (Section 5.1). Such uniform operation seems non-optimal for the nonuniform realization of different CMs, especially when generating a longer empathetic response that contains multiple sentences with different CMs, DAs or EMs. We believe there is still much room of improvement when applying our CoMAE framework to longer response generation, like combining CoMAE's multifactor hierarchical modeling with planning-based Post you might remember me posting here when i had less than a month sober a little while back. well, yesterday i hit 100 days without alcohol and celebrated by solo hiking my state's tallest mountain! Golden ok that is an awesome pic! love it and the story thank you!

Conclusion
In this paper, we present a multi-factor hierarchical framework CoMAE for empathetic response generation. It contains three key factors of empathy expression: communication mechanism, dialog act and emotion, and models these factors in a hierarchical way. With our devised CoMAE-based model, we empirically demonstrate the effectiveness of these empathy factors, as well as the necessity and importance of hierarchical modeling.
As future work, the CoMAE framework can be naturally extended to more factors that relate to empathy expression, such as persona (Zhong et al., 2020), by exploring the hierarchical relationships between different factors.

A Emotion Mapping
In the original paper of (Demszky et al., 2020) 14 , the authors provide the hierarchical clustering results of the 27 emotions (Figure 2 in their paper), which reflect the nested structure of their proposed emotion taxonomy. Based on the clustering results, we merged the emotions that are highly correlated with each other, and the mapping between our adopted emotions and the original emotions is shown in Table 6.  Table 6: Mapping between our adopted emotions and the original emotions in (Demszky et al., 2020).

B Statistics of Annotation
We computed the proportions of the last responses annotated with ER / IP / EX. In the Happy domain, the proportions are 76.0% / 10.2% / 18.7%, while in the Offmychest domain are 57.1% / 21.4% / 27.9% respectively. The statistics of DA and EM are shown in Figure 5. We can find several differences between two domains. In terms of communication mechanism, the responses in the Offmychest domain prefer interpretation and exploration, while emotional reaction occupies a larger proportion in the Happy domain. In terms of DA, the actions that provide support (such as agreeing, consoling, suggesting, and sympathizing) are more frequently adopted in the Offmychest domain. It is similar when it comes to emotion, where the emotions such as approval and caring are displayed more commonly when responding to the posts with negative sentiments. We also observed that the responses in the Offmychest domain may also display the emotions like anger and sadness, indicating that they do understand 14 https://arxiv.org/abs/2005.00547v2 the experiences and feelings of the conversation partners.