E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation

Achieving empathy is a crucial step toward humanized dialogue systems. Current approaches for empathetic dialogue generation mainly perceive an emotional label to generate an empathetic response conditioned on it, which simply treat emotions independently, but ignore the intrinsic emotion correlation in dialogues, resulting in inaccurate emotion perception and unsuitable response generation. In this paper, we propose a novel emotion correlation enhanced empathetic dialogue generation framework, which comprehensively realizes emotion correlation learning, utilization, and supervising. Specifically, a multi-resolution emotion graph is devised to capture context-based emotion interactions from different resolutions, further modeling emotion correlation. Then we propose an emotion correlation enhanced decoder, with a novel correlation-aware aggregation and soft/hard strategy, respectively improving the emotion perception and response generation. Experimental results on the benchmark dataset demonstrate the superiority of our model in both empathetic perception and expression.


Introduction
Empathy is a desirable human trait that improves the emotional perceptivity in emotion-bonding social activities, helping to achieve a humanized dialogue system (Smith, 2006;Singer and Lamm, 2009).Empathetic dialogue generation (EmpDG) which aims at perceiving the emotional expressions in dialogue to generate appropriate responses rich in empathy, is proposed and has attracted extensive attention with its ability to improve user experience and satisfaction in multiple domains (Fitzpatrick et al., 2017;Wang et al., 2021).Most existing methods follow a multi-task learning paradigm, jointly training an emotional classification task and dialogue generation task to achieve response generation with empathetic constraints.Recent works take their effort on two aspects.The first focuses on improving the emotion perception, for example, by introducing external knowledge (Li et al., 2020b(Li et al., , 2022b;;Sabour et al., 2022), mining emotion causes (Kim et al., 2021;Gao et al., 2021), or more fine-grained emotion modeling (Li et al., 2020a;Kim et al., 2022).The other focuses on promoting the generation strategy, based on mixture of experts (Lin et al., 2019), different emotion look-ahead reward functions (Shin et al., 2020), emotional mimicry (Majumder et al., 2020) and so on.In general, these methods first perform mainemotion prediction with a single-label emotional classifier, then inject the predicted emotion into generation to achieve empathetic expression.
The above paradigm implicitly introduces an independent assumption on different emotions, both in modeling and utilization, which respectively from the learning for maximizing separation be- tween different emotions in classification and the abandonment of secondary emotions in generation.However, studies on social psychology (Vansteelandt et al., 2005;Martinent et al., 2012) suggest that human emotions are not completely independent, but with an intrinsic correlation, manifested as dialogues and responses are typically accompanied with the co-occurrence of multiple emotions (Martinent et al., 2012).This independent assumption with ignoring the emotion correlation directly impairs main-emotion perception, as the main-emotion as a whole feature should be codetermined by the occurred emotions in context.(Fig. 1-a, the common correlation weights of grateful and terrified helps distinguish the true emotions afraid).Moreover, this assumption is also harmful for response generation, as the model dominated by one emotion lacks the ability to recognize emotional transitions (Fig. 1, a transition from afraid for accident to grateful for survival), resulting in unsuitable responses (Fig. 1-b, only "sorry to hear" for survival).Therefore, considering the emotion correlation is necessary for precise emotion perception and better empathetic expression.Statistical result 1 on benchmark dataset in Fig. 2, which is calculated based on the quantity of other-emotionrelated words in samples, further suggests that the emotion correlation learning is significant for Em-pDG task with a proportion of samples containing secondary emotions reaches 84.04%.
As the annotation for all subtle emotions in dialogues is hard and inefficiency, we propose to mine and incorporate this intrinsic emotion correlation into single-labeled EmpDG.There are three challenges: 1) modeling and learning the multiemotion correlation; 2) utilizing the correlated co-occurrence emotions without biasing toward to the labeled emotion; 3) providing supervision to avoid excessive or erroneous introduction of multiemotion information.
To this end, we propose a novel Emotion CORrelation Enhanced empathetic dialogue gen-1 Specific details for statistics are supplied in appendix A. eration framework, namly E-CORE, with three tailored modules to address above challenges.Specifically, we propose a novel directed weighted graph, which captures the subtle emotion interactions in context from different resolutions, further encoding the intrinsic emotion correlation.Then we design an emotion correlation enhanced decoder, which adopts a correlation-aware aggregation and a soft/hard strategy, incorporating the correlated cooccurrence emotions to improve emotion perception and response generation, respectively.Meanwhile, an emotion correlation loss is constructed to provide multi-emotion regular constraints.
Our contributions are summarized as follows: 1) We propose breaking the emotion independence assumption existing in current methods and modeling the intrinsic emotion correlation.To the best of our knowledge, this is one of the first frameworks in EmpDG that explicitly models and utilizes emotion correlation to enhance emotion perception and response generation.2) We propose a distinctive method with three tailored modules respectively addressing the emotion correlation learning, utilizing, and supervising, which effectively and accurately capture the correlated co-occurrence emotions in dialogues even under single-label, enhancing empathy perception and expression.3) Extensive experiments verify the superiority of our method on both emotion prediction (8.34% in accuracy) and response generation (8.53% in perplexity).Ablation studies and specialized experiments on constructed multi-emotion annotated sub-dataset also validates the fidelity of our emotion correlation learning.

Emotional Dialogue Generation
In recent years, open-domain dialogue systems have achieved great progress (Li et al., 2016;Liu et al., 2016;Zhong et al., 2019;Zhang et al., 2020a;Shen et al., 2021;Zhu et al., 2022).As the combination of emotion and personality leads to a more human-like system, the emotional dialogue generation task which aims to generate emotional responses according to specified emotion label, was proposed and developed (Song et al., 2019;Dong et al., 2021;Ide and Kawahara, 2021;Liang et al., 2021;Li et al., 2021;Tu et al., 2022).Some works (Firdaus et al., 2021) also make efforts in multi-emotion guided generation, however, these works based on manually annotated emotions, focus on the encoding for provided multi-emotion.Our work more simulates real dialogue scenarios, imitating the listener's perception and inference for context emotions, which focuses on the multiemotion learning with emotion correlation.

Empathetic Dialogue Generation
Unlike emotional dialogue generation, the empathetic dialogue generation task aims at generating empathetic responses, based on perceived emotions instead of definite annotated emotions.Rashkin et al. (2019) first proposed the task and contributed a new task benchmark and a large-scale empathy dialogue dataset.Then several works (Majumder et al., 2020;Li et al., 2020a;Kim et al., 2021;Gao et al., 2021) make efforts in enhancing empathy perception.Lin et al. (2019) proposed a multi-decoder model combining the emotional responses of the appropriate listeners, as every listener is independent.Kim et al. (2022) proposed a feature transition recognizer for identifying feature shifts between utterances, enhancing semantic understanding.Li et al. (2022b); Sabour et al. (2022) introduced commonsense knowledge to improve situation understanding.Li et al. (2022a) further proposed a serial encoding and emotion-knowledge interaction method which effectively utilized finegrained emotion features and commonsense knowledge to enhance empathy response.However, these works mostly rely on single-emotion prediction to capture empathy signals, ignoring the emotion co-occurrence existing in dialogues.In this work, we investigate the correlation-based emotion cooccurrence to enhance empathetic perception and expression.

Proposed Approach
Given a dialogue context U = [u 1 , u 2 , . . ., u m ] of m utterances, empathetic dialogue generation aims to generate the next empathetic response y with emotional consistency and informative expression.Optionally, the task performs emotion prediction based on context semantic understanding to achieve empathetic constraints.In this section, we give a detailed introduction to our proposed E-CORE, which explicitly mines and incorporates the emotion correlation to enhance empathetic perception and expression.The framework consists of 3 phases: context encoding, multi-resolution emotion graph network, and emotion correlation enhanced decoding, as is illustrated in Fig. 3.

Context Encoding
Following previous methods (Sabour et al., 2022;Li et al., 2022b), we first concatenate the dialogue context U into a long word sequence and insert a special [CLS] token at the start, i.e., X = [CLS, x 1 , x 2 , . . ., x M −1 ], where M − 1 is the total number of words in U and x 0 indicates [CLS].Then we represent the context embedding as a synthesis of three kinds of embeddings: word embedding, position embedding (Vaswani et al., 2017) and dialog state embedding, as the dialog state indicates each word comes from the speaker or the listener.The context embedding x is fed into a transformer encoding layer (Vaswani et al., 2017) to obtain the contextual representation: x = e w (X) + e p (X) + e d (X), (1) h X = Enc trans (x).
(2) h X ∈ R M ×D and D is the feature dimension.

Multi-resolution Emotion Graph Network
Inspired by the methods in social psychology studies (Vansteelandt et al., 2005;Scherer, 2013) which explore the emotion correlated co-occurrence through emotion words interaction, we construct a multi-resolution emotion graph based on the word emotion intensities, to capture the context-based emotion interaction from different resolutions, for further emotion correlation learning.Similar to Li et al. (2022b), we construct the emotion intensity annotation from SKEP (Tian et al., 2020), serving as the bridge for emotion graph modeling.As SKEP outputs a [0,1] score η(x i ) identifying the positive degree of word x i (0.5 means neutral), the emotion intensity of each word is defined as c i = (η(x i ) − 0.5) 2 , and c = [c 1 , . . ., c M −1 ] is the emotion intensity for all context words.Graph Construction.
Specially, the multiresolution emotion graph is composed of two kinds of nodes, i.e., M word nodes V w for M context words (including [CLS]) and P emotion nodes V e for P emotions; and two kinds of edges, i.e., interacted connections for word nodes, and correlated connection for emotion nodes.
For word nodes, the emotion graph is required to capture the subtle emotions interaction existing in the context for correlation learning.Starting from the global interaction, as different emotional transitions will lead to different response emotions, we innovate a basic interacted connection, i.e., a word node connects to previous word nodes and all emotion nodes.Further, to capture more direct emotion interactions, as the emotion intensity c preliminary indicates the word emotional importance, by setting different thresholds and screening out relatively unimportant word nodes, the basic graph will be extended to refined interacted graphs that attend to emotional information at multi-resolution.
For emotion nodes, the emotion graph is required to model the intrinsic emotion correlation, thus we construct the emotion correlated connection, i.e, edges from emotion nodes to each other, combined with a global learning matrix R ∈ R P ×P , simply yet effectively encoding the correlation weights.
Considering the symmetry of emotion correlation, more generally, we adopt a re-parameterization trick to replace the direct training for R , by representing R as the inner-product of the re-parameter matrix S ∈ R P ×P , i.e, R = S T S, where the diagonal values are always set to 1.
Logically, we define the initial edge weights for each node as the normalization for its corresponding neighboring nodes: Additionally, all nodes are connected to [CLS] node with weight 1 for context interaction.The initial features h 0 for word nodes V w and emotion nodes V e are defined as the word embeddings x and emotion embeddings e w (V e ) (Eq.1).

Graph Updating.
We design a novel multiresolution attention mechanism that effectively realizes the independent updating and layer-out fusion of graph features for different resolutions, without increasing complexity.Specifically, the nodes and edges features of layer l on k-th graph are updated: where ij indicates the calculated attention score of node i to neighboring node j on the k-th graph and l-th layer.⊙ denotes element-wise multiplication and Π l is a MLP network.|| denotes concatenation for each graph.This design smoothly promotes the multi-head attention into multi-resolution updating, where node features and edge weights are independently updated with corresponding connection in each resolution (head), then node features are fused layer-by-layer for global feature sharing.
After several rounds of graph updating and a sum process for K-graph edge weights, we obtain the representation of emotion graph: word-to-emotion edge weights E w-e ∈ R M ×P ; emotion-to-emotion edge weights E e-e ∈ R P ×P and word node features h node ∈ R M ×D , used for subsequent emotion perception and response generation.

Emotion Correlation Enhanced Decoding
With the sample-specific emotion correlation captured by graph, we detail the utilization of the correlation-based emotion co-occurrence, to enhance the emotion signal perception and empathetic response generation, respectively.Emotion Signal Perceptron.
We adopt correlation-aware aggregation to enhance emotion perception.Specifically, as the edge weights of graph intuitively reflect the attention to emotions, we define the global perception signal: This processing refers to Fig. 1-a, where the columnsummation for E w-e fuses the attention weights of M words to each emotion.E e-e is initialized by R and updated with sample context, equivalent to the sample-specific emotion correlation weights with diagonal values reset to 1.This design smoothly achieve an attention correlated aggregation for the co-occurrence emotions in the context.Then the global perception signal is combined with contextual representation h X (Eq.2), followed by a linear layer and a softmax layer to obtain the emotion category distribution.Specifically: h X ∈ R D is the mean pooling feature of h X , W ϵ ∈ R P ×2P and W x ∈ R P ×D are weight matrixs of linear layers.h m emo is the obtained main perception signal.Our model minimize the cross-entropy loss between the predicted main-emotion ϵ and ground truth emotion ϵ * for optimization.Soft/Hard Gated Generator.Main-emotion signal perception provides annotated emotion supervision, but may also suppress other emotions, impairing subsequent generation.Thus, we design both soft and hard gated strategies to capture the meaningful co-occurrence emotions, combined with the emotion graph to pay more attention to meaningful emotions, and further achieve co-occurrence emotions guided generation.
Specifically, to avoid the supervised suppression may be caused by direct use of h m emo , a gated attention mechanism is adopted to extract meaningful emotion features from the global and main emotion perception signals, which both contain rich emotional information: where W e is weight metric and h emo ∈ R P indicates the final attention features to P emotions.With the final emotion attention, soft and hard strategies are proposed respectively, to improve the graph for an effective utilization of correlated cooccurrence emotions.A straightforward way is soft strategy, which treats the attention features as an emotional soft label, serving as the new initial edge weight for emotion nodes: However, the soft strategy may introduces redundant emotional information, resulting in noise interference.Therefore, we further propose another hard strategy to directly screen emotions.As the context-irrelevant/relevant emotions reflect a great distinction in attention features, we divide emotions into irrelevant and relevant categories, based on the principle of maximizing the variance between the two categories, also known as the OTSU algorithm (Otsu, 1979) (details in appendix B): By removing the nodes and connected edges of irrelevant emotions, hard strategy helps realize comprehensive attention to important emotions.
In summary, the soft strategy is more flexible while the hard is more stable, both of which successfully achieve an adaptive selection and utilization of co-occurrence emotions.
Finally, after carrying out soft or hard strategy to improve the emotion graph to focus on significant emotions, we obtain the improved graph features through another forward process, based on the parameters/weights-shared improved graph network.As node features reserve not only emotional information, but also emotion-interacted semantic information, we fed the improved node features ĥnode into the modified transformer decoder (details in appendix C) for generation: where y <t = [y 0 , ..., y t−1 ] is the masked response and h X is the contextual representation.As most dialogue generation tasks, the negative log-likelihood loss is used as the optimization objective:

Emotion Correlation Loss
Finally, to avoid excessive or erroneous introduction of emotional information, we construct an emotion correlation loss for regular constraints: where V ′ is the learned co-occurrence emotions, taking top-3 emotions for soft strategy and V relevant e for hard strategy.Obviously, minimizing L eco loss prevent to introduce multi-emotion with low correlation weights, as low weights indicate the emotions are unlikely to occur in the same context.
Considering above all components, a joint loss function is adopted as the overall optimization objective to achieve end-to-end paradigm learning: 4 Experiment Settings

Datasets
We evaluated our E-CORE on the EMPATHETIC-DIALOGUES (Rashkin et al., 2019) dataset, which is collected from Amazon Mechanical Turk and contains about 25k open-domain dyadic conversations.Each conversation comes from a speaker and a listener, in which the speaker is asked to talk about personal feelings, and the listener responds empathetically.We split the train/val/test set into 19, 533/2, 770/2, 547 conversations.
In addition, to further validate the fidelity of the E-CORE in emotion correlation modeling, we also construct a sub-dataset 2 with multi-emotion annotation.This sub-dataset is obtained by: 1) emotional annotating with large-scale language models Chat-GPT(OpenAI, 2022) and ChatLLaMa(Nebuly-AI, 2023) on the above test set; 2) screening the samples that have identified the ground-truth emotion and contained multi-emotion labels; 3).filtering the mistaken annotation with manual inspection.This sub-dataset composes of 739 samples, with average of 2.93 emotion labels per sample.
2 Sub-dataset detailed in appendix H.

Baselines
We conduct experiments to compare our E-CORE with the following state-of-the-art baselines: 1) Transformer (Vaswani et al., 2017)

Evaluation Metrics
Automatic Evaluation.
Following previous works, for response generation, we adopt perplexity (PPL) (Serban et al., 2015) and distinct-n (Distn) (Li et al., 2015) as the main automatic metrics which measures the quality and diversity of generated responses, respectively.For emotion perception, we employ the emotion accuracy (Acc) to measure the consistency between predicted mainemotion and ground-truth emotion.Human Evaluation.To test the model's ability on generating human-like responses, we conduct human ratings to evaluate the generated responses from three aspects: Fluency (fluency of responses), Relevance (relevant to dialogue context) and Empathy (empathetic expression of responses).We randomly select 100 dialogues, paired with the dialogue context and responses from the baselines and our E-CORE.Three human annotators are asked to score the selected instances on three metrics in the range of [1,5], with the higher the better.The average scores of all annotators are the human rating

Models Automatic Evaluation
Human Evaluation PPL ↓ Dist-1 Dist-2 Acc Fluency Relevance Empathy Transformer (Vaswani et al., 2017)  results.In addition, for more direct model comparison, we also conduct the human A/B test with the best-performing SOTAs.Three annotators are required to carry out pairwise response comparisons, selecting better response for each instance.Tie is allowed if both are good or bad.More details for human evaluations are covered in appendix I.

Results and Analysis
We conduct experiments on the benchmark dataset to verify the promise of emotion correlation learning in both emotion perception and empathetic generation.Then we investigate the ability of co-occurrence emotions recognition on the multiemotion annotated subset, to further validate the essence of emotion correlation learning in our method E-CORE.

Comparison with State-of-the-Art
Automatic Evaluation.As SOTAs are mainly trained from scratch, we report the results trained from scratch for comparison fairness in Tab.1. 3Our proposed E-CORE exhibits better performances than SOTAs on all automatic metrics, verifying the effectiveness of emotion correlation modeling in empathetic understanding.The significant improvements in response quality (8.53% in relative, 3.08 in absolute for PPL) and diversity (11.5% in relative, 0.36 in absolute for Dist-2) show that 3 Results on pre-trained model are supplied in appendix E. Results on Sub-dataset.The current evaluation confirms that, even for EmpDG under singlelabel guidance, our model effectively achieves cooccurrence emotions learning with correlation modeling.To further validate the ability of our multiemotion learning, we conduct extensive experiments on the multi-emotion annotated sub-dataset, using the metric Recall@k for quantitative evaluation, which indicates the number of ground-truth emotions covered by the top-k predicted emotions   (or the predicted relevant emotions for hard strategy).The great promotion shown in Tab.3 (44.1% in R@3, 31.3% in R@5) reflects the significant superiority of E-CORE on multi-emotion learning.In addition, a greater improvement in original metrics (11.6% in Acc), further proves that our E-CORE has a stronger learning ability for complex samples with multiple emotions over SOTAs.Further, we visualize the emotion correlation of the dataset and E-CORE for a more intuitive comparison.Among these, the correlation weight for the dataset is calculated based on the co-occurrence counts of emotion pairs, and for E-CORE is directly using the learned weight R after model training.As shown in Fig. 4, our model shows a very close emotion correlation to real distribution, proving the accuracy of emotion correlation modeling.

Ablation Study
To fully examine the contribution of each design in Our E-CORE for addressing corresponding challenges, we conduct ablation studies through the Speaker 1 : I went through some of my old stuff yesterday, and I found my security blanket that I used when I was a kid! MIME I am sure you will do great.

EmpDG
That is so sweet.

SEEK
That is so great.KEMP I am glad you are able to get it fixed.CEM I am sure you will get a good time.

Ours(soft)
That is so nice of you to go back memories.Ours(hard) I also love those moments.
(Relevant Emotion: Sentimental, Nostalgic) Gold Awww I bet that brought back memories.As reported in Tab.4,all modules make reasonable contributions to E-CORE.For learning, replacing the emotion graph with transformer causes significant performance degradation, verifying the effectiveness of multi-resolution emotion graph for emotion correlation learning.For utilizing, models without correlation utilizing on perceptron or generator respectively perform weakly in emotion accuracy and response quality, indicating that our designed aggregation and soft/hard strategies effectively incorporate the correlated co-occurrence emotions to enhance empathetic perception and expression.Finally, the result without correlation loss proves its importance for global supervision.More ablation studies and analyses in appendix J.

Case Study
As the case in Fig. 1 has shown the ability of E-CORE to jointly guide generation with captured very different co-occurrence emotions (afraid and grateful), Tab.5 exhibits a case with similar cooccurrence emotions for comprehensive qualitative analysis.As the speaker expresses sentimental for "old stuff", relying on the significant correlation between sentimental and nostalgic, our E-CORE successfully identifies the auxiliary emotion nostalgic, generating more relevant phrases "go back memories" and "those moments", while the baseline models only produce universal responses.In general, whether similar or distant co-occurrence emotions are significant for EmpDG, which all help for global and detailed empathetic expression.Our E-CORE with emotion correlation learning helps provide sufficient emotion guidance, yielding more humanized responses rich in empathy.

Conclusion
In this paper, we propose to exploit the intrinsic emotion correlation in dialogues to enhance empathetic dialogue generation.A distinctive framework with three effective modules respectively addressing the emotion correlation learning, utilizing, and supervising, is designed.Extensive experiments on the benchmark dataset prove the significant advantages of our framework in improving emotion perception and empathetic generation.Specific analysis further demonstrates the accuracy of our emotion correlation learning.In the future, our work can inspire other approaches to explore emotionrelated tasks with multi-emotion correlation learning, without being limited by single-emotion label.
Limitations 1) Firstly, as we analyzed in the introduction, almost all dialogues are accompanied by subtle emotions besides the main-emotion.However, it is almost impossible to annotate all subtle emotions and even the emotion weights for a dialogue.Although our method based on emotion correlation modeling has effectively achieved multi-emotion learning for EmpDG under single-label guidance, how to improve the network to utilize existing in-formation to provide more effective supervision for multi-emotion learning still needs to be considered.This is also a common problem faced by many emotion-related generation tasks.The ablation study on the model without response reconstruction loss supervision shown in appendix J indicates that the supervision for multi-emotion partly sources from the empathetic response, which may serve as an improved inspiration.2) Secondly, all existing methods are evaluated on the unique benchmark dataset EMPATHETICDIALOGUES (Rashkin et al., 2019).As empathetic dialogue generation is an emerging task, only one relevant English dataset has been proposed, lacking of datasets in more languages and categories for reference.3) Finally, we observed in the experiment that the existing models tend to generate generic responses, especially for complex hard samples, which are difficult to capture the key points.Therefore, the learning of hard samples is also a developing direction of the empathetic dialogue generation task.

A Statistics for Dialogue Emotions
To verify the importance of the multi-emotion correlation for the empathetic dialogue generation task, we make statistics on the quantity of emotionrelated words of other emotions contained in the dialogue samples of the benchmark dataset EM-PATHETICDIALOGUES (Rashkin et al., 2019), to preliminary observe the emotions co-occurrence situation in the dataset.Specifically, our annotators provide the annotation of high-frequency emotionrelated words for all 32 emotions in the dataset.By counting the number and frequency of emotionrelated words of different emotions in the dialogue, we can roughly inform the co-occurrence situation of various emotions in the dataset.The annotated emotion-related words of 32 emotions are shown in Tab.12.

B Details of Hard Strategy
In this section, we elaborate on the hard strategy, which adopts the OTSU algorithm, i.e., maximizing the variance between the two categories to divide emotions into irrelevant and relevant categories: Taking final emotion attention h emo as the emotion attention value for emotion nodes V e , specifically, the above equation can be written in detail: Where µ, µ V , µ V is the mean attention values of emotion nodes in the corresponding set V e , V, V.
In the formula, there is no obvious distinction between V and V, and we set the part with larger attention values corresponding to V.
Obviously, for N emotions, there are at most N segmentation thresholds, as the emotions with a higher attention feature than the threshold will be regarded as the relevant category.Based on the statistical results shown in Fig. 2, there are almost no five or more emotions in the dialogues in the dataset.To facilitate the operation, only the first five segmentation thresholds are considered, comparing their inter-categories variance to obtain the optimal division of emotion.

C Modified Transformer Decoder
In this section, we provide a detailed introduction for the modified transformer decoder used in the soft/hard gated generator: Taking masked response y <t = [y 0 , ..., y t−1 ], contextual representation h X , improved graph node feature ĥnode as the input, the detailed implementation is: where MH-ATT and FFN denote multi-head attention layer and feed-forward network respectively.Through a simple concatenation operation, the modified transformer decoder effectively introduce the graph node feature which is rich in emotioninteracted semantic information into the decoding process.

D Implementation
The model is implemented in PyTorch (Paszke et al., 2017) with a single NVIDIA GeForce RTX 3090 GPU, and trained for about 10 epochs with batch size 16 and dropout rate 0.2.The training time of E-CORE is about 3 hours for around 26000 iterations.The vocabulary size is 23, 714, and use the pre-trained Glove vectors (Pennington et al., 2014) for word embedding initialization.Our model is optimized by Adam optimizer (Kingma and Ba, 2014) and the learning rate is changed during training according to Vaswani et al. (2017) with the final learning rate is 3.5e-4.We also introduce the commonsense knowledge and the label smoothing strategy used in the SOTA model (Li et al., 2022b) as a trick to improve performance, without losing comparative fairness.SOTAs using a pre-trained language model Di-aloGPT (Zhang et al., 2020b).As shown in Tab.6, all models have a significant improvement while using the pre-trained language model, indicating that the huge amount of pre-trained dialogue datasets is beneficial for the empathetic dialogue generation task.Besides, our E-CORE consistently shows superior performance over SOTAs, which demonstrates the advantages of our model over SOTAs whether uses a pre-trained model or not.

F Experiments on Stability Testing
To verify the stability of the model, we evaluate the variance and statistical significance for E-CORE.Specifically, we adopt 5 different random seeds to conduct experiments on our model, based on soft and hard strategies respectively.Tab.7 reports the variance of all evaluation metrics, verifying the performance stability of the model.

G Statistical Significance
For statistical significance, we conduct one-side Student's t-test, proving our model (soft) significantly outperforms the best-performing baseline CEM with p = 5.68e-5 (p < 0.05 indicates that the hypothesis that A (E-CORE) outperforms B (CEM) is significantly valid).These conclusions hold for hard strategy.

H Sub-dataset with Multi-emotion Annotation
We construct a sub-dataset with multi-emotion annotations as an auxiliary test set of the benchmark dataset EMPATHETICDIALOGUES, to verify the accuracy of emotion correlation modeling in our E-CORE.In this section, we will provide a detailed explanation for this sub-dataset.
Firstly, based on the large-scale pre-trained language models ChatGPT (OpenAI, 2022) and ChatLLaMa (Nebuly-AI, 2023), we conduct emotion annotation for all 2, 547 conversations of the test set of EMPATHETICDIALOGUES, obtaining an intermediate dataset of 1, 536 samples labeled with multiple emotions.Secondly, we screen a total of 1, 097 samples that successfully identify the ground-truth emotion, which are proven to have higher annotation quality.Finally, a manual examination is conducted to filter out mistaken annotations.The final dataset is composed of 739 samples, with an average of 2.93 emotion labels per sample.Tab.11 shows some dialogue samples of this auxiliary dataset.

I Details of Human Evaluation
To evaluate the model's ability on generating human-like responses, we conduct experiments on human evaluation from three aspects: Fluency, Relevance and Empathy.Three human annotators are asked to score the instances on these there aspects in the range of [1,5].We use Spearman's Rank correlation coefficients to evaluate the agreement among the annotators.The coefficients between any two annotators are all near 0.6 and at an average of 0.64, which shows the consistency and reliability of human evaluation scores.
In the following, we further provide the guidelines regarding how to judge the quality of the model's result on these three aspects in terms of different features.

I.1 Fluency
This metric measures the fluency of the model's result.The definitions of different scores are: • [5]: The generated responses are human-like, grammatically correct, fluent, and very easy to understand.
• [4]: Choose this score when you are hesitant between the score 3 and score 5.
• [3]: The generated responses have a few grammar errors, but not hinder understanding.
• [2]: Choose this score when you are hesitant between the score 1 and score 3.

I.2 Relevance
This metric measures the informativeness and relevance of the model's result.The definitions of different scores are: • [5]: The generated responses are perfectly related to the dialogue context.
• [4]: Choose this score when you are hesitant between the score 3 and score 5.
• [3]: The generated responses are to some extent related to the dialogue context.
• [2]: Choose this score when you are hesitant between the score 1 and score 3.
• [1]: The generated responses are completely unrelated to the dialogue context.

I.3 Empathy
This metric measures the empathy of the model's result.The definitions of different scores are: • [5]: The generated responses are rich in emotional expression, and the expressed emotions perfectly correspond to the dialogue context.
• [4]: Choose this score when you are hesitant between the score 3 and score 5.
• [3]: The generated responses to some extent contain the emotional expression, and the expressed emotions to some extent correspond to the dialogue context.
• [2]: Choose this score when you are hesitant between the score 1 and score 3.
• [1]: The generated responses do not contain the emotional expression, or the expressed emotions do not correspond to the dialogue context.

J More Ablation Study
To further examine our E-CORE, more ablation studies are conducted through following variants: 1) w/o gate-g: for testing the sub-module of soft/hard gated generator, the model without gated attention, directly using the main emotion feature for guidance.2) w/o gen-loss: for testing the sources of emotional supervision, the model without response generation loss L gen .
As we can see in Tab.8, all sub-modules contribute a lot to the whole model.The gated attention mechanism has a great impact on generation, suggesting that gated attention is helpful for meaningful emotions extracting, which further provide more sufficient emotional guidance to enhance expression.It is worth noting that models without response generation loss (w/o gen-loss) not only show a decline in the generation, but also perform poorly in the emotion prediction (42.57 to 39.37 in Acc), indicating that emotion supervision not only comes from the single-emotion label, but also from the empathetic responses, further proving the reliability of our emotion correlation learning, which modeled by emotion graph with context-based interaction capturing and multi-sample joint transfer learning.

J.1 More Ablation Study on Emotion Intensity
We also conduct additional ablation studies to evaluate the performance of emotion intensity, by comparing the variants without emotion intensity or with different emotion intensity labeling on both soft and hard strategies.(Sebastiani and Esuli, 2006), VADER (Hutto and Gilbert, 2014), VAD (Zhong et al., 2019), SKEP (Tian et al., 2020).
As we can see from Tab.9, the model without emotion intensity, which degenerates into a singleresolution emotion graph, performs weakly, indicating the great impact of multi-resolution graph modeling for correlation learning.Furthermore, the model exhibits strong robustness to different emotion intensity labeling, also indicating that the emotion correlation learning more relies on graph training, instead of the performance of the original emotion intensity.

K More Case Studies
Typically, two more generation cases of singleemotion guidance and double-emotion guidance are shown in Tab.10.In the first case, E-CORE extracts the key information "proud" and "single mother" from the context, and generates more detailed and accurate phrases "great parent" and "proud", while the baseline models only predict a generic phrase "great".In the second case, the speaker expresses her husband's "faithful" and her "health problems".Our E-CORE successfully detects two emotions in the dialogue context, generating the praise "great" for faithful and wishes "get through" or "great life" for hopeful, while other baselines either take a wrong understanding or only notice one emotion.In general, the emotion correlationn learning enhances the emotion-interacted semantic understanding, resulting in more humanized responses rich in information and empathy, whether in simple or complex dialogue contexts.

L Visualization Analysis
To further explore the working mechanism of our emotion graph, we visualize the edge features from dialogue word nodes to 32 emotion nodes for the case of Fig. 1.As shown in Fig. 5, our Ours(Soft) puts the highest attention on the words containing informative meaning, among which the words "terrified", "hit" and "drunk driver", contribute to emotion terrified and afraid, as the words "glad" and "alive" pay more attention to grateful.We can conclude that our multi-resolution emotion graph effectively learns diverse emotional information.
In addition, we also explored the effect of correlation-aware aggregation for emotion perception, and visualize the working mechanism in Equation 8 based on the same case as above.As we can see from Fig. 6, for the initial graph word-emotion edge fusion features, similar emotion terrified was mistakenly selected while the ground-truth emotion is afraid.Although afraid is already very close, for the existing methods, it will be discarded while it is recognized as a secondary emotion.However, in our E-CORE, after carrying out correlationaware aggregation, the greater weight of grateful for afraid assists in identifying the afraid as the main-emotion, achieving full utilization of all co-occurrence emotions.This further confirms the effectiveness of correlation-based emotion cooccurrence learning in enhancing emotional perception, especially for hard samples with similar emotions.10586

IFigure 1 :
Figure 1: A real empathetic dialogue generation case based on our method (right) and existing methods (left), which is divided into two stages: (a) main-emotion perception, (b) response generation.For more detailed visualization of (a) refer to Fig.6.

Figure 2 :
Figure 2: Statistics of secondary emotions proportion in the EMPATHETICDIALOGUES dataset samples.
Figure 3: (a).The overview of the proposed E-CORE, which consists of three phases.1) context encoding: encoding the dialogue context and all emotions into embedding features and contextual representation; 2) multi-resolution emotion graph network: capturing the context-based emotion interaction from different resolutions to encode the emotion correlation; 3) emotion correlation enhanced decoding: incorporating the emotion correlation to enhance emotion signal perception and response generation.(b).The design of soft/hard gated generator used in phase 3.
: a transformerbased model for response generation.2) MIME(Majumder et al., 2020): a model connsidering polarity-based emotion clusters and emotional mimicry.3) EmpDG(Li et al., 2020a): a model exploiting multi-resolution emotions.4) KEMP(Li et al., 2022b): a model introducing external knowledge.5) CEM(Sabour et al., 2022): a model leveraging commonsense to draw more information.6) SEEK(Li et al., 2022a): a model exploiting serial encoding and emotion-knowledge interaction.For a fair and clear comparison, without otherwise stated, all models and model variants of our E-CORE and SOTAs are trained from scratch based on dialogue-level emotion annotations.Our model is explored using soft and hard strategies respectively, as introduced in Sec.3.3, denoted as Ours(Soft) andOurs(Hard).The model is based on the transformer(Vaswani et al., 2017) framework with 4 blocks and 3 heads, with the emotion graph of layer L = 2 and resolution level K = 3, corresponding to the threshold [0, 0.075, 0.15].The parameters for loss function are γ 1 = γ 2 = 1.More implementation details are covered in appendix D.

Figure 4 :
Figure 4: Visualizations of emotion correlation for dataset and E-CORE.Displayed edges are between emotions with correlation weights greater than 0.3 after maximum-value normalization.Same emotions in (a) & (b) are highlighted in the same color, which is marked based on the emotion distribution in the dataset.
following variants: 1) w/o graph: the model without multi-resolution emotion graph, which directly implements other modules with the vanilla transformer framework.2) w/o co-p: the model without correlation-aware aggregation in emotion perceptron.3) w/o co-g: the model without correlated co-occurrence emotions guidance in generator, which not uses soft/hard strategy and generates responses with main-emotion.4) w/o coloss: the model without emotion correlation loss.

Figure 5 :Figure 6 :
Figure 5: The visualization of the fusion of word-to-emotion edge features for E-CORE.

Table 1 :
Comparisons with SOTAs.↓ suggests that the performance is better with a lower score.

Table 2 :
Comparisons with SOTAs on human A/B test.

Table 3 :
Comparisons with SOTAs on the sub-dataset.
E-CORE generates more relevant comments rich in diversity, as more sufficient emotional information is provided.The great promotion of emotion accuracy (8.34% in relative, 3.28 in absolute for Acc) proves that, adopting correlation-aware aggregation rather than simply separating emotions, is beneficial for main-emotion perception.

Table 4 :
Results on ablation studies.

Table 5 :
Case study of the generated responses by our E-CORE and the baselines.

Table 6 :
Comparisons results based on pre-trained language model.

Table 7 :
Results of variance on all evaluation metrics.

Table 8 :
More results on ablation study.
• [1]: The generated responses have numerous grammar errors and difficult to understand.

Table 9 :
More ablation studies on the performance of emotion intensity, where SKEP is used for original method.
: Lately I have felt proud of my success as a newly single mother.It gets lonely sometimes, but I can honestly say I have been doing everything I can and more.My husband is the most faithful man.L 1 : That is great to hear!A faithful spouse is a blessing.S 2 : I have so many health problems and he is always there for many no matter what being loving and caring.Gold I am sorry to hear about that!I hope everything gets better for you!MIME Oh no!That is terrified for you.EmpDG That is great for you!SEEK I hope you can get better.KEMP I agree with you.I am sure he will have a great relationship.CEM That is good.I hope he gets better.Ours(Soft) That is great!Hope you can get through that!Ours(Hard) That is great!I hope you have a great life!(Relevant Emotion: Faithful, Hopeful) The different emotion intensity values are obtained by four different emotion analysis models, specifically: SentiWordNet

Table 10 :
More case studies of the generated responses by our E-CORE and the baselines.Key words in context and responses of different emotions are highlighted in different colors.S i and L i respectively correspond to the i-th sentence from the speaker or listener.