CASE: Aligning Coarse-to-Fine Cognition and Affection for Empathetic Response Generation

Empathetic conversation is psychologically supposed to be the result of conscious alignment and interaction between the cognition and affection of empathy. However, existing empathetic dialogue models usually consider only the affective aspect or treat cognition and affection in isolation, which limits the capability of empathetic response generation. In this work, we propose the CASE model for empathetic dialogue generation. It first builds upon a commonsense cognition graph and an emotional concept graph and then aligns the user’s cognition and affection at both the coarse-grained and fine-grained levels. Through automatic and manual evaluation, we demonstrate that CASE outperforms state-of-the-art baselines of empathetic dialogues and can generate more empathetic and informative responses.


Introduction
Human conversations naturally involve empathetic interactions, which allow both parties to recognize and understand each other's experiences and feelings from the other's shoes (Keskin 2014). As a trait of human conversation, empathy is a crucial factor in establishing seamless relationships (Zech and Rimé 2005). Previous researches Wang et al. 2021) attempt to cultivate empathy in dialogue systems confirmed that empathy is also integral to building a warm conversational AI (Huang, Zhu, and Gao 2020).
In social psychology, empathy is commonly known as consisting of two aspects, i.e., cognition and affection (Davis 1983). The cognitive aspect aims to understand the user's current situation (Cuff et al. 2016). The affective aspect involves the emotional simulation in reaction to the observed user's experiences (Elliott et al. 2018). Although both aspects are considered to generate empathetic responses, existing works still remain issues. On the one hand, most works Lin et al. 2019;Majumder et al. 2020;Li et al. 2020Li et al. , 2022 rely solely on the affective aspect to detect and utilize contextual emotion for enhancing empathy in response generation. On the other hand, previous researches Sabour, Zheng, and Huang 2022) model cognition and affection as two relatively independent aspects to improve the understanding and expression of empathy. However, the birth of empathetic responses in human conversation often results from conscious alignment and interaction between cognition and affection of empathy (Westbrook, Kennerley, and Kirk 2011). For one thing, the emotional state manifested by different contexts affects expression about cognitive situations in responses. The alignment of cognition and affection facilitates the response to express empathetic cognition under the appropriate emotional state. As in case-1 of Fig 1, the alignment of cognitive situation, i.e., intent "to go to the beach", with emotional states, i.e., "excited" vs. "disappointed", produces empathetic expressions that satisfy different contexts, i.e., "love" and "which beach are you going to go to" vs. "hate" and "waiting for the beach". For another, different cognitive situations give rise to different emotional reactions. The alignment of cognition and affection encourages building the association between cognition and emotional reaction to generate highly empathetic responses. As in case-2 of Fig 1, building distinct patterns of association between cognition and emotional reac-tion, i.e., "to give up" and "frustrated" vs. "to try harder" and "hopeful", yields cognitively distinct but both highly empathetic responses, i.e., response-2a vs. response-2b. The above cases highlight the necessity of aligning cognition and affection for empathy modeling in response generation.
To this end, we propose to align Cognition and Affection for reSponding Empathetically (CASE), and integrate commonsense knowledge from COMET (Bosselut et al. 2019) and concept knowledge from ConceptNet (Speer, Chin, and Havasi 2017). Commonsense knowledge infers the user's situation as cognition and emotional reaction to the situation, which are implied in the dialogue. Concept knowledge serves to extract the emotional state manifested in the dialogue context. For encoding two knowledge, we first construct two heterogeneous graphs, i.e., commonsense cognition graph and emotional concept graph, where the initial independent representation of cognition and emotional concept is carefully adjusted by dialogue context adopting graph transformers. Then, we design a two-level strategy to align cognition and affection using mutual information maximization (MIM) (Hjelm et al. 2019). The coarse-grained level considers overall cognition and affection manifested in the dialogue context to align contextual cognition and contextual emotional state, which are extracted with a knowledge discernment mechanism. The fine-grained level builds the fine-grained association between cognition and affection implied in the dialogue to align each specific cognition and corresponding emotional reaction. Further, an empathy-aware decoder is devised for generating empathetic expressions.
Our contributions are summarized as follows: (1) We devise a unified framework to model the interaction between cognition and affection for integrated empathetic response generation.
(2) We construct two heterogeneous graphs involving commonsense and concept knowledge to aid in the modeling of cognition and affection. (3) We propose a twolevel strategy to align coarse-grained and fine-grained cognition and affection adopting mutual information maximization. (4) Extensive experiments demonstrate that the superior of CASE in terms of automatic and human evaluation.

Related Work Emotional & Empathetic Conversation
Emotional conversation gives the manually specified label preset as the emotion generated in the response (Zhou et al. 2018;Wei et al. 2019;Peng et al. 2022). Instead of giving a predefined emotion label, empathetic conversation involves cognitive and affective empathy (Davis 1983;Zheng et al. 2021) and aims to fully understand the interlocutor's situation and feelings and respond empathically (Keskin 2014). On the one hand, most existing works only focus on the affective aspect of empathy and make efforts to detect contextual emotion Lin et al. 2019;Majumder et al. 2020;Li et al. 2020Li et al. , 2022 while ignoring the cognitive aspect. On the other hand, some research leverage commonsense as cognition to refine empathetic considerations (Sabour, Zheng, and Huang 2022). However, the relatively independent modeling between the two aspects (i.e., cognition and affection) violates their interrelated characteristics.

Commonsense & Concept Knowledge
As a commonsense knowledge base, ATOMIC ) focuses on inferential knowledge organized as typed if-then relations. Six commonsense reasoning relations are defined for the person involved in an event, four of which are used to reason commonsense cognitions of a given event, i.e., PersonX's intent before the event (xIntent), what Per-sonX need to do before the event (xNeed), what PersonX want after the event (xWant), and the effect of the event on PersonX (xEffect). Each commonsense cognition is aligned with user's emotional reaction to the situation implied in dialog inferred by xReact (i.e., PersonX's reaction to the event) in our approach. To obtain inferential commonsense knowledge, we use COMET (Bosselut et al. 2019), a pretrained generative model, to generate rich commonsense statements.
As the concept knowledge, ConceptNet (Speer, Chin, and Havasi 2017) provides word-level human knowledge and is widely used in various NLP tasks Zhong et al. 2021). Following Li et al. (2022), we use NRC VAD (Mohammad 2018) to assign emotion intensity to concepts in ConceptNet (processing details are in Li et al. (2022)) severed to extract the contextual emotional state manifested in the dialogue context, and align it with contextual cognition.

Mutual Information Maximization
Mutual information maximization (MIM) aims to measure the dependence between two random variables X and Y , and the mutual information (MI) between them is defined as: M I(X, Y ) = D KL (P (X, Y ) P (X)P (Y )). However, maximizing MI directly is normally intractable. A successful practice to estimate MI with a lower bound is InfoNCE (Kong et al. 2020). Given two different views x and y of an input, InfoNCE is defined by: (1) where f θ is a learnable function with parameter θ. The set Y draws samples from a proposal distribution Q( Y ), and it comprises | Y | − 1 negative samples and a positive sample y. One insight is that when Y always consists all values of Y and they are uniformly distributed, maximizing InfoNCE is analogous to maximize cross-entropy loss: It shows InfoNCE is relevant to maximize P θ (y|x) and approximates summation over elements in Y (i.e., partition function) by negative sampling (Zhou et al. 2020). Upon the formula, we replace X and Y with specific cognition and affection to maximize MI between them.

Approach Architecture Overview
The overall framework of CASE is shown in Figure 2. The dialogue context X = [x 1 , . . . , x N ] contains N utterances, where x i denotes the i-th utterance. CASE contains three stages: (1) The graph encoding stage constructs  and encodes heterogeneous commonsense cognition graph G CS and emotional concept graph G EC from the dialogue context X.
(2) The coarse-to-fine alignment aligns coarsegrained (between contextual cognition and contextual emotional state) and fine-grained (between each specific cognition and corresponding emotional reaction) cognition and affection adopting MIM. (3) The empathy-aware decoder integrates the aligned cognition and affection to generate the response Y = [y 1 , y 2 , . . . , y M ] with empathetic and informative expressions.

Graph Encoding
Commonsense Cognition Graph Construction Given the last utterance x N of the dialogue context X, we segment it into the sub-utterances U = [u 0 , u 1 , u 2 , . . . , u t ], where we prepend the whole x N as u 0 for maintaining the global information of x N . We use COMET to infer l commonsense cognition knowledge K r i = [k r i,1 , k r i,2 , . . . , k r i,l ] for each u i ∈ U , where r is one of the four commonsense relations R = {xIntent, xNeed, xWant, xEffect}, similar to Sabour, Zheng, and Huang (2022). The idea is that human responses tend to inherit the above and transfer the topic. There are differences in the topic and connotation of different sub-utterances, which also affect the listeners' concerns when making empathetic responses.
For constructing the heterogeneous commonsense cognition graph G CS , we use the utterance set U and the commonsense cognition knowledge set K CS = t i=0 r∈R K r i as vertices, i.e., vertex set V CS = U ∪ K CS . There are seven relations of undirected edges that connect vertices. (1) The global relation between the whole x N (i.e., u 0 ) and its subutterances u i (i ≥ 1). (3) The temporary relation between any two successive sub-utterances u j and u j+1 . (4) The four commonsense relations, i.e., xIntent, xNeed, xWant, xEffect, between the utterance u i ∈ U and the corresponding K r i . We use a Transformer-based sentence encoder (cognition encoder) to first encode the vertices V CS of the graph G CS . For each v CS i ∈ V CS , we prepend with a special token [CLS]. Following Devlin et al. (2019), we collect the representation derived from [CLS] as the initial embedding matrix for G CS .

Emotional Concept Graph Construction
We concatenate the utterances in the dialogue context X to obtain the token set T , i.e., T = x 1 ⊕ . . . ⊕ x N = [w 1 , . . . , w n ], where n is the number of all the tokens in the utterances in X. Following Li et al. (2022), we use ConceptNet to infer the related concepts for each token w i ∈ T , among which only the the top N emotional concepts (according to the emotion intensity ω(c)) are used for constructing G EC . Subsequently, the vertices V EC in the heterogeneous emotional concept graph G EC contains a [CLS] token, the dialogue context tokens T , and the above obtained emotional concepts. There are four relations of undirected edges that connect vertices.
The global relation between the [CLS] token and other ones.
(3) The temporary relation between any two successive tokens. (4) The emotional concept relation among a token and its related emotional concepts.
We initialize the vertex embedding matrix for G EC by summing up the token embedding, the positional embedding, and the type embedding for each vertex (signaling whether it is a emotional concept or not).
Graph Encoder Given the commonsense cognition graph G CS , to capture the semantic relationship between vertices, we adopt the Relation-Enhanced Graph Transformer ) for graph encoding. It employs a relationenhanced multi-head attention mechanism (MHA) to encode vertex embeddingv vi for vertex v i (we omit the superscripts CS for simplicity) as: v where the semantic relations between vertices are injected into the query and key vectors: where l vi→v k and l v k →vi are learnable relation embeddings between vertices v i and v k . The self-attention is subsequently followed by a residual connection and a feedforward layer, as done in the standard Transformer encoder (Vaswani et al. 2017). Finally, we obtain the commonsense cognition embedding cs i for each v CS i ∈ V CS . To encode the emotional concept graph G EC , we adopt a vanilla Graph Transformer (i.e., omitting the relation enhancement part in the above Relation-Enhanced Graph Transformer). By superimposing the emotion intensity of each token, we obtain the emotional concept embedding

Coarse-to-Fine Alignment
Context Encoding Following previous works (Majumder et al. 2020;Sabour, Zheng, and Huang 2022), we first concatenate all the utterances in the dialogue context X and prepend with a [CLS] token: [CLS] ⊕ x 1 ⊕ . . . ⊕ x N . This sequence is fed into a standard Transformer encoder (context encoder) to obtain the representation S X of the dialogue context. We denote the representation of [CLS] as s X .
Coarse-grained Alignment To reproduce the interaction of cognition and affection manifested in the dialogue context, we align contextual cognition and contextual emotional state at an overall level. They are separately acquired by cognitive and emotional knowledge discernment mechanisms, which select golden-like knowledge guided by the response.
To obtain the contextual cognitive representation r cog , the knowledge discernment calculates the prior cognitive distribution P CS (cs i | X) over the commonsense cognition knowledge (that is, only K CS rather than all the vertices V CS in G CS , and we thus use 1 ≤ i ≤ |K CS | for simplicity): where ϕ CS (·) is a MLP layer activated by tanh. Similarly, we calculate the prior emotional distribution P EC (ec i | X) (1 ≤ i ≤ |V EC |) and obtain the contextual emotional representation r emo . During training, we use the ground truth response Y to guide the learning of knowledge discernment mechanisms. We feed Y into the cognition encoder (used for initializing the embeddings of G CS above) and the context encoder to get the hidden states S cog Y and S ctx Y , where the [CLS] representations are s ctx Y and s cog Y respectively. The posterior cognitive distribution P CS (cs i | Y ) and the emotional one P EC (ec i | Y ) are calculated as follows: We then optimize the KL divergence between the prior and posterior distributions during training: To further ensure the accuracy of discerned knowledge, similar to Bai et al. (2021), we employ the BOW loss to force the relevancy between cognitive / emotional knowledge and the target response. The BOW loss L BOW is defined as: where η(·) is a MLP layer followed by softmax and the output dimension is the vocabulary size, B denotes the word bags of Y , r cog = Finally, we align the coarse-grained representations of the contextual cognition r cog and the contextual emotional state r emo using mutual information maximization (MIM). Specifically, we adopt the binary cross-entropy (BCE) loss L coarse as the mutual information estimator that maximizes the mutual information between r cog and r emo : where r emo and r cog are the encoded negative samples. f coarse (·, ·) is a scoring function implemented with a bilinear layer activated by the sigmoid function: Fine-grained Alignment To simulate the interaction of fine-grained cognition and affection implied in the dialogue during human express empathy, the fine-grained alignment builds the fine-grained association between each inferred specific cognition and corresponding emotional reaction.
For each u i ∈ U , we infer the commonsense knowledge about emotional reaction K xReact i = k xReact i,1 , . . . , k xReact i,l using COMET, which is regarded as the user's possible emotional reaction to the current cognitive situation. Since k xReact i,j ∈ K xReact i is usually an emotion word (e.g., happy, sad), we concatenate K xReact i and feed it into the Transformer-based encoder (reaction encoder) to get the representation of the emotional reaction H er i . Similar to (Majumder et al. 2020) and (Sabour, Zheng, and Huang 2022), we use the average pooling to represent the reaction sequence, i.e., h er i = Average (H er i ). To avoid overalignment of out-of-context emotional reaction with cognition, we inject contextual information into the representation of reaction. We first connect h er i with the context representation S X at the token level, i.e., S er i [j] = S X [j] ⊕ h er i . Then another Transformer-based encoder takes S er i as input and output the fused information S er i . We take the hidden representation of [CLS] in S er i as the emotional reaction representation er i of u i .
Finally, we build the association between the inferred specific cognition l j=1 cs r i,j from u i for r ∈ R = {xIntent, xNeed, xWant, xEffect} and the emotional reaction er i using MIM. Recall that t i=0 r∈R l j=1 cs r i,j exactly correspond to the commonsense cognition knowlege set K CS . The fine-grained BCE Loss L f ine is defined as: where er i and cs r i,j are the encoded negative samples. f f ine (·, ·) is implemented as: Altogether, the coarse-to-fine alignment module can be jointly optimized by L align loss: where α is a hyper-parameter.

Emotion Prediction
We fuse the contextual emotional state and emotional reaction to distill the affective representation, where we use er 0 as the distillation signal of emotional reaction. This is because er 0 is derived from the speaker's last utterance and represents the overall emotional reaction. A gating mechanism is designed to capture affective representation r af f : We then project r af f to predict the user's emotion e: which is supervised by the ground truth emotion label e * using the cross-entropy loss:

Empathy-aware Response Generation
We employ a Transformer-based decoder to generate the response. To improve empathy perception in response generation, we devise two strategies to fuse the two parts of empathy (i.e., cognition and affection). First, we concatenate the cognitive and affective signals r cog and r af f with the dialogue context representation S X at the token level, which is then processed by a MLP layer activated by ReLU to integrate cognition and affection into the dialogue context: Second, we modify the original Transformer decoder layer by adding two new cross-attention to integrate commonsense cognition knowledge , which are inserted between the self-attention and cross-attention for S X . The decoder then predicts the next token y m given the previously decoded tokens y <m , as done in the standard Transformer decoder.
We adopt the standard negative log-likelihood loss L gen to optimize the response generation module: Finally, we jointly optimize the coarse-to-fine alignment loss, emotion prediction loss, generation loss, and diversity loss proposed by (Sabour, Zheng, and Huang 2022) as: where γ 1 , γ 2 , γ 3 and γ 4 are hyper-parameters.

Experiments Experimental Setup
Dataset We conduct our experiments on the widely used EMPATHETICDIALOGUES , which is a crowdsourced multi-turn empathetic conversation dataset, comprising 25k open domain conversations between speakers and listeners. This dataset considers 32 emotional situations, and each conversation is highly related to a single emotional situation. In a conversation, the speaker confides personal experiences and feelings, and the listener infers the current situation and emotion of the speaker and responds empathetically. Following Rashkin et al. (2019), we split the train/valid/test set by 8:1:1.
Baselines We compare our CASE with several competitive baselines as follows: (1)

Implementation Details
We implemented all models with Pytorch. We initialize the word embeddings with pretrained GloVE word vectors (Pennington, Socher, and Manning 2014). The dimensionality of embeddings is set to 300 for all corresponding modules. We set hyper-parameters l = 5, N = 10, α = 0.2, γ 1 = γ 2 = γ 3 = 1 and γ 4 = 1.5. We use Adam optimizer (Kingma and Ba 2015) with β 1 = 0.9 and β 2 = 0.98. The batch size is 16 and early stopping is adopted. The initial learning rate is set to 0.0001 and we varied it during training following Vaswani et al. (2017). The maximum decoding step is set to 30 during inference. All models are trained on a GPU-P100 machine. The training process of CASE is split into two phases. We first minimize L BOW for pretraining knowledge discernment mechanisms, and then minimize L for training overall model.

Automatic Evaluation
To evaluate the generative performance of the model, we adopt the widely used Perplexity (PPL) and Distinct-1/2 (Dist-1/2) (Li et al. 2016). Perplexity evaluates the general generation quality of a model. Distinct-1/2 evaluates the generated diversity by measuring the ratio of unique unigrams/bigrams in the response. To evaluate the emotion classification performance of the model, we measure the accuracy (Acc) of emotion prediction. Following KEMP and CEM, we do not report word overlap-based automatic metrics (Liu et al. 2016), e.g., BLEU (Papineni et al. 2002). The automatic evaluation results are shown in Table 1. Our model outperforms all compared models and achieves a significant improvement on all metrics. First, our model achieves about 4.0% reduction on PPL compared to the best baseline, which indicates that CASE is more likely to generate ground truth responses. Second, our model achieves about 15.6% and 41.2% improvement on Dist-1 and Dist-2 compared to CEM, which indicates the superiority of CASE in terms of generating informative responses at the unigrams and bigrams level. This is attributed to the coarse-to-fine alignment that allows CASE to inject more informative commonsense cognition on the premise of ensuring the perplexity of the generated response. Third, our model achieves about 17.9% and 7.8% improvement in prediction accuracy compared to KEMP and CEM, respectively. This verifies that CASE considers both aspects of affection (i.e., contextual emotional state and emotional reaction) more effectively than focusing only on a single aspect as KEMP and CEM.

Human Evaluation
In human evaluation, 200 contexts are randomly sampled and each context is associated with two responses generated from our model CASE and baseline. Following Sabour, Zheng, and Huang (2022), three crowdsourcing workers are asked to choose the better one (Win) from two responses by considering three aspects, respectively, i.e., (1) Coherence (Coh.): which model's response is more fluent and contextrelated? (2) Empathy (Emp.): which model's response expresses a better understanding of the user's situation and feelings? (3) Informativeness (Inf.): which model's response incorporates more information related to the context? We use the Fleiss' kappa (κ) (Fleiss 1971) to measure the interannotator agreement. As in Table 2, the results show that CASE outperforms three more competitive baselines on all three aspects. Especially, CASE outperforms baselines significantly in terms of empathy and informativeness, which demonstrates the superior of modeling the interaction between cognition and affection of empathy, and supports the observations from automatic evaluation.

Overall-to-Part Ablation Study
We conduct an overall-to-part ablation study in Table 3 by removing key components of CASE for further dissection.
In the overall ablation, first, we remove the commonsense cognition graph and emotional concept graph, called "w/o Graph". The emotion prediction accuracy decreases significantly, which indicates that the two heterogeneous graphs make remarkable contribution to detecting emotion. Second, we remove the coarse-to-fine alignment, called "w/o I hope you are able to get it fixed, or just let her know it is not too bad. Ground-Truth Well, like I told her, it will grow out, it will just take time. Align". The diversity of generation decreases significantly and emotion prediction accuracy drops distinctly. It supports our motivation that the alignment of cognition and affection leads to informative and highly empathetic expression.
In the part ablation, first, we remove two heterogeneous graphs, called "w/o CSGraph" and "w/o ECGraph", respectively. From the results, we find that the contribution of the commonsense cognition graph is mainly to improve the diversity of generation (i.e., Dist-1/2), while the role of the emotional concept graph is mainly located in the recognition of emotion (i.e., Acc). This also supports our constructed motivation. Second, we remove coarse-grained and fine-grained alignments, called "w/o CGAlign" and "w/o FGAlign", respectively. We observe that the alignment at the fine-grained level is more significant than the coarse-grained level in terms of overall contribution. This also matches our intuition that building the fine-grained association between cognition and affection is closer to the conscious interaction process during human express empathy.

Case Study
Two cases generated from six models are selected in Table 4. Compared to the baselines, CASE has two main advantages: (1) CASE is more likely to accurately identify the conversational emotion, being consistent to "Acc". This is due to the complementary effect of considering both the emotional concept and the emotional reaction. For instance, in the first case, only "scared" in the emotional reaction is insufficient to identify the current conversational emotion, i.e., "Terrified", while "frighten, terrify, etc." in the emotional concept can complement the former. The opposite example is shown in the second case, i.e., for the conversational emotion "Embarrassed", "embarrassed, ashamed, etc." in the emotional reaction vs. only "bad" in the emotional concept.
(2) CASE is more likely to express informative cognition in a highly empathetic tone. For example, in the first case, CASE uses the established association between cognition, e.g., "to be safe", and affection, e.g., "good", to appease the user's "Terrified" experience, i.e., "to stay safe" and "get a little better" in response. In the second case, in the identified user's "Embarrassed" emotional state, CASE expresses empathetic affection with "it is not too bad" and makes informative cognitive statement, i.e., "get it fixed", in response.

Conclusion and Future Work
In this paper, we propose CASE to align cognition and affection for responding empathetically by simulating the conscious alignment and interaction of the two in human conversation, which is inspired by CBT (Westbrook, Kennerley, and Kirk 2011). Extensive experiments verify the superiority of CASE on automatic and human evaluation. In the future, our work will inspire subsequent works to borrow psychological theories for empathetic dialogue and other promising tasks (e.g., emotional support conversation ).