Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes

Empathy is a complex cognitive ability based on the reasoning of others’ affective states. In order to better understand others and express stronger empathy in dialogues, we argue that two issues must be tackled at the same time: (i) identifying which word is the cause for the other’s emotion from his or her utterance and (ii) reflecting those specific words in the response generation. However, previous approaches for recognizing emotion cause words in text require sub-utterance level annotations, which can be demanding. Taking inspiration from social cognition, we leverage a generative estimator to infer emotion cause words from utterances with no word-level label. Also, we introduce a novel method based on pragmatics to make dialogue models focus on targeted words in the input during generation. Our method is applicable to any dialogue models with no additional training on the fly. We show our approach improves multiple best-performing dialogue agents on generating more focused empathetic responses in terms of both automatic and human evaluation.


Introduction
Empathy is one of the hallmarks of social cognition. It is an intricate cognitive ability that requires high-level reasoning on other's affective states. The intensity of expressed empathy varies depending on the depth of reasoning. According to Sharma et al. (2020), weak empathy is accompanied by generic expressions such as "Are you OK?" or "It's just terrible, isn't it?", while stronger empathy reflects the other's specific situation: "How is your headache, any better?" or "You must be worried about the job interview". In order to respond with stronger empathy, two issues must be tackled: reasoning (i) where to focus on the interlocutor's utterance (for the reason behind the emotion) and (ii) how to generate utterances that focus on such words.
Firstly, which words should we focus on when empathizing with others? As empathy relates to other's emotional states, the reasons behind emotions (emotion cause) should be identified. Imagine you are told "I got a gift from a friend last vacation!" with a joyful face. The likely words that can be the causes of his/her happiness are "gift" and "friend". On the other hand, "vacation" has less to do with the emotion. If you respond "How was your vacation?", the interlocutor may think you are not interested; rather, it is better to say "Wow, what was the gift?" or "Your friend must really like you." by focusing on the emotion cause words.
We humans do not rely on word-level supervision for such affective reasoning. Instead, we put ourselves in the other's shoes and simulate what it would be like. Perspective-taking is this act of considering an alternative point of view for a given situation. According to cognitive science, perspectivetaking and simulation are key components in empathetic reasoning (Davis, 1983;Batson et al., 1991;Ruby and Decety, 2004). Taking inspiration from these concepts, we propose to train a generative emotion estimator for simulating the other's situation and identifying emotion cause words.
Secondly, after reasoning which words to focus on, the problem of how to generate focused responses still remains. Safe responses that can be adopted to any situations might hurt other's feelings. Generated utterances need to convey the impression that concerns the specific situation of the interlocutor. Such communicative reasoning is studied in the field of computational pragmatics. The Rational Speech Acts (RSA) framework (Frank and Goodman, 2012) formulates communication between speaker and listener as probabilistic reasoning. It has been applied to many tasks to increase the informativeness of generated text grounded on inputs (Andreas and Klein, 2016;Fried et al., 2018;Cohn-Gordon and Goodman, 2019;Shen et al., 2019). That is, RSA allows the input to be more reflected in the generated output.
However, controlling the RSA framework to re-flect specific parts of the input remains understudied. We introduce a novel method for the RSA framework to make models focus on targeted words in the interlocutor's utterance during generation. In summary, we recognize emotion cause words in dialogue utterances with no word-level labels and generate stronger empathetic responses focused on them without additional training. Our major contributions are as follows: (1) We identify emotion cause words in dialogue utterances by leveraging a generative estimator. Our approach requires no additional emotion cause labels other than the emotion label on the whole sentence, and outperforms other baselines.
(2) We introduce a new method of controlling the Rational Speech Acts framework (Frank and Goodman, 2012) to make dialogue models better focus on targeted words in the input context to generate more specific empathetic responses.
(3) For evaluation, we annotate emotion cause words in emotional situations from the validation and test set of EmpatheticDialogues dataset (Rashkin et al., 2019). We publicly release our EMOCAUSE evaluation set for future research.
(4) Our approach improves model-based empathy scores (Sharma et al., 2020) of three recent dialogue agents, MIME (Majumder et al., 2020), DodecaTransformer , and Blender (Roller et al., 2021) on EmpatheticDialogues. User studies also show that our approach improves human-rated empathy scores and is more preferred in A/B tests.

Related Work
Empathetic dialogue modeling. Incorporating user sentiment is one of early attempts for empathetic conversation generation (Siddique et al., 2017;Shi and Yu, 2018). Rashkin et al. (2019) collect a large-scale English empathetic dialogue dataset named EmpatheticDialogues. The dataset is now adopted in other dialogue corpus such as DodecaDialogue  and BST (Smith et al., 2020). As a result, pretrained large dialogue agents such as DodecaTransformer  and Blender (Roller et al., 2021) now show empathizing capabilities. Empathyspecialized dialogue models are another stream of research. Diverse architectures have been adopted, including emotion recognition (Lin et al., 2020), mixture of experts (Lin et al., 2019), emotion mimicry (Majumder et al., 2020) and persona (Zhong et al., 2020). Li et al. (2020) use lexicon to extract emotion-related words from utterances and feed them to a GAN-based agent.
We aim to improve both pretrained large dialogue agents and empathy-specialized ones by making them focus on emotion cause words in context.
Compared to those tasks, we recognize emotion cause words with no word-level labels using a generative estimator. Our method does not require word-level labels other than the emotion labels of the whole sentences. We then generate more specific empathetic responses focused on them.
Compared to previous use of RSA, we propose an approach that can control the models to focus on targeted words from the given input.

Identifying Emotion Cause Words with Generative Emotion Estimation
Our approach consists of two steps: (i) recognizing emotion cause words from utterances with no word-level labels ( §3), and (ii) generating empathetic responses focused on those words ( §4). In this section, we first train a generative emotion estimator to identify emotion cause words.

Why Generative Emotion Estimator?
We leverage a generative model by taking inspiration from perspective-taking (i.e. simulating one-self in other's shoes) to reason emotion causes; not requiring word-level labels. Our idea is to estimate the emotion cause weight of each word in the utterance while satisfying the following three desiderata.
(1) Do not require word-level supervision for learning to identify emotion cause words in the utterances. Humans do not need word-level labels to infer the probable causes associated with the other's emotion during conversation.
(2) Simulate the observed interlocutor's situation within the model. Simulation theory (ST) from cognitive science explains that this mental imitation helps understanding the internal mental states of others (Gallese et al., 2004). Much evidence for ST is found from neuroscience including mirror neurons (Rizzolatti and Craighero, 2004), action-perception coupling (Decety and Chaminade, 2003), and empathetic perspective-taking (Ruby and Decety, 2004).
(3) Reason other's internal emotional states in Bayesian fashion. Studies from cognitive science argue that human reasoning of other's affective states and minds can be described via Bayesian inference (Griffiths et al., 2008;Ong et al., 2015;Saxe and Houlihan, 2017;Ong et al., 2019).
Interestingly, a generative emotion estimator (GEE), which models P (C, E) = P (E)P (C|E) with text sequence (e.g. context) C and emotion E, satisfies all the above conditions. First, the generative estimator computes the likelihood of C by generating C given E, which can be viewed as a simulation of C. Second, it estimates P (E|C) via Bayes' rule. Finally, the association between the emotion estimate and each word comes for free by using the likelihood of each words; without using any word-level supervision. We use BART (Lewis et al., 2020) to implement a GEE.

Training to Model Emotional Situations
Dataset. To train our GEE, we leverage the Em-patheticDialogues (Rashkin et al., 2019), a multiturn English dialogue dataset where the speaker talks about an emotional situation and the listener expresses empathy. An example is shown in Table 1. The emotion and the situation sentence are only visible to the speaker. Situations are collected beforehand by asking annotators to recall related experiences for a given emotion label. The dataset includes a rich suite of 32 emotion labels that are evenly distributed.  Training. Given an emotion label E, GEE is trained to generate its corresponding emotional situation C = {w 1 , ..., w T }, where w i is a word. As a result, our GEE learns the joint probability P (C, E). The trained GEE shows perplexity of 13.6 on the test situations of EmpatheticDialogues.

Recognizing Emotions
Once trained, GEE can predict P (E|C = c) for a word sequence c (e.g. utterance) using Bayes' rule: (1) We compute the likelihood P (C = c|E) by GEE's generative ability as described in §3.1. Since emotions in EmpatheticDialogues are almost evenly distributed, we set the prior P (E) to a uniform distribution. Finally, we find the emotion with the highest likelihood of the given sequence c.
We comparatively report the emotion classification accuracy of GEE in Appendix.

Weakly Supervised Emotion Cause Word Recognition
We introduce how GEE can recognize emotion cause words solely based on emotion labels without word-level annotations. For a given word sequence c = {w 1 , w 2 , ..., w T } (e.g. utterance), GEE can reason the association P (W |E =ê) of each word w t in the sequence c to the recognized emotionê in Bayesian fashion: The emotion likelihood is computed as where w <t is the partial utterance up to time step t − 1. Since computing the expectation over all possible partial utterance w <t is intractable, we approximate it by a single sample. We build set E to includeê and emotions with the two lowest probability of P (E|C = c) when recognizing emotion in Eq.(1). We assume the marginal P (W ) is uniform. We choose the top-k words reasoned by GEE as emotion cause words, and focus on them during empathetic response generation.

Controlling the RSA framework for Focused Empathetic Responses
We introduce how to control the Bayesian Rational Speech Acts (RSA) framework (Frank and Goodman, 2012) to focus on targeted words in the context during response generation. We first preview the basics of RSA for dialogues ( §4.1). We then present how to control the RSA with word-level focus ( §4.2), where our major contribution lies. Figure 1 is the overview of our method.

The Rational Speech Acts Framework
Applying the RSA framework is computing the posterior of the dialogue agent's output distribution over words each time step. Hence, it is applicable to any existing pretrained dialogue agents on the fly, with no additional training. The RSA framework formulates communication as a reference game between speaker and listener. Based on recursive Bayesian formulation, the speaker (i.e. dialogue model) reasons about the listener's belief of what the speaker is referring to. We follow the approach of Kim et al. (2020) for adopting RSA to dialogues. Our goal here is to update a base speaker S 0 to a pragmatic speaker S 1 that focuses more on the emotion cause words in dialogue context c (i.e. dialogue history).
Base Speaker S 0 . Let c and u t denote dialogue context and the output word of the model at time step t, respectively. The base speaker S 0 is a dialogue agent that outputs u t for a dialogue context and partial utterance u <t : S 0 (u t |c, u <t ). As described, one can use any dialogue models for S 0 .
Pragmatic Listener L 0 . The pragmatic listener is a posterior distribution over which dialogue con- Figure  text the speaker is referring to. It is defined in terms of the base speaker S 0 and a prior distribution p t (C) over the context in Bayesian fashion:

Sample Distractor Contexts
. (4) The shared world C is a finite set comprising the given dialogue context c and other contexts (coined as distractors) different from c. Our contribution lies in how to build world C to endow the dialogue agent with controllability to better focus on targeted words, which we discuss in §4.2. We update prior p t+1 (C) with L 0 from time step t as follows: p t+1 (C) = L 0 (C|u ≤t , p t ). β is the rationality parameter which controls how much the base speaker's distribution is taken into account. We note that L 0 is simply a distribution computed in Bayesian fashion, not another separate model. Pragmatic Speaker S 1 . Integrating L 0 with S 0 , we obtain the pragmatic speaker S 1 : Since the pragmatic speaker S 1 is forced to consider how its utterance is perceived by the listener (via L 0 ), it favors words that have high likelihood of the given context c over other contexts in shared world C. Similar to Eq. 4, α is the rationality parameter for S 1 .

Endowing Word-level Control for RSA to Focus on Targeted Words in Context
We aim to make dialogue models focus on targeted words from the input (i.e. dialogue context) during generation via shared world C. The shared world C consists of the given dialogue context c and other distractor contexts. It is used for computing the likelihood of the given context c in Eq. 4. Previous works of RSA in NLP manually (or randomly) select pieces of text (e.g. sentences) entirely different from the given input (Cohn-Gordon and Goodman, 2019; Shen et al., 2019;Kim et al., 2020). In our context, it means distractors will be totally different contexts from c in the dataset. For example, when given a context "I got a gift from my friend.", a distractor might be "Today, I have an exam at school.". Although such type of distractors helps improve the specificity of the model's generated outputs, it is difficult to finely control which words the models should be specific about.
Our core idea is to build distractors by replacing the emotion cause words in c with different words via sampling with GEE. It can enhance the controllability of the RSA by making models focus on targeted words (e.g. emotion cause words recognized by GEE) from the dialogue context.
For a dialogue context c = {w 1 , ..., w T } where w i is a word, GEE outputs top-k emotion cause words regarding the recognized emotionê 1 from context c, denoted by W gee . Next, we concatenate the least likely n emotions from GEE with the context c removing the top-k emotion cause words: [ê −1 , ...,ê −n ; c − W gee ], which is input to GEE. We then sample different words (w i ,w j , . . . ,w k ) from GEE's output in place of W gee to construct a distractorc. For example, given a context c "I was sick from the flu" and "sick, flu" as the top-2 emotion cause words, a sampled distractorc can be "I was laughing from the relief ". We use these altered contexts {c 1 , ...,c i } as distractors for the shared world C in the pragmatic listener L 0 (Eq. 4). We set n and cardinality of world C to 3 (i.e. C = {c,c 1 ,c 2 }). We run experiments and find the best k (= 5) (see Appendix).
The only difference between the original context c and the sampled distractorc is those emotion cause words. The pragmatic speaker S 1 (Eq. 5) prefers to generate words that have a higher likelihood of the given context c (including the original emotion cause words W gee ) than the distractor contextc. As a result, the pragmatic agent can generate    Annotators are required to have a minimum of 1000 HITs, 95% HIT approval rate, and be located at one of [AU, CA, GB, NZ, US]. We pay the annotators $0.15 per description. To further ensure quality, only annotators who pass the qualification test are invited to annotate. Nevertheless, speculations for emotion causes are subjective and can vary among annotators. Therefore, we use only unanimously selected words (i.e. earning all three votes) to ensure maximum objectivity.

Analysis
We analyze the characteristics of our emotion cause words in the EMOCAUSE evaluation set. In Table 3 and Figure 2, we compare the basic statistics of our annotation set and RECCON (Poria et al., 2020), which is an English dialogue dataset annotating emotion cause spans on the DailyDialog (Li et al., 2017) and IEMOCAP (Busso et al., 2008) with a total of 8 emotions. Since our EMOCAUSE is based on emotional situations from an empathetic dialogue dataset (Rashkin et al., 2019), emotion causes play a more important role than in casual conversations from RECCON. While 74% of REC-CON's labels belong to a single emotion happy, EMOCAUSE provides a balanced range of 32 emotions labels. Therefore, our evaluation set presents a wider variety than RECCON. Table 4 shows some examples of the annotated emotion cause words. Table 5 reports the most frequent cause words for some emotions. We find "embarrassing" events happen frequently in toilets and in front of people. "Proud" and "disappointed" are closely related to children. Interestingly, phones are associated with "trusting", which may be due to smartphones containing sensitive personal information. More examples and results can be found in Appendix.

Experiments
We first evaluate our generative emotion estimator (GEE) on weakly-supervised emotion cause word recognition ( §6.2). We then show our new controlling method for the RSA framework can improve best performing dialogue agents to generate more empathetic responses by better focusing on targeted emotion cause words ( §6.3).

Datasets and Experiment Setting
EmpatheticDialogues (ED) (Rashkin et al., 2019). This dataset is an English empathetic dialogue dataset with 32 diverse emotion types ( §3.2). The task is to generate empathetic responses (i.e. responses from the listener's side in Table 1) when only given the dialogue context (i.e. history) without emotion labels and situation descriptions. It contains 24,850 conversations partitioned into training, validation, and test set by 80%, 10%, 10%, respectively. We additionally annotate cause words for the given emotion for all situations in the validation and test set of EmpatheticDialogues ( §5).
EmoCause ( §5). We compare our GEE with four methods that can recognize emotion cause words with no word-level annotations: random, RAKE (Rose et al., 2010), EmpDG (Li et al., 2020), and BERT (Devlin et al., 2019). For random, we randomly choose words as emotion causes. RAKE is an automatic keyword extraction algorithm based on the word frequency and degree of co-occurrences. EmpDG leverages a rule-based method for capturing emotion cause words using EmoLex (Mohammad and Turney, 2013), a largescale lexicon of emotion-relevant words. Finally, we train BERT for emotion classification with the emotion labels in ED. For BERT, we select the words with the largest averaged weight of BERT's last attention heads for the classification token (i.e. [CLS]). More details can be found in Appendix.
Dialogue models for base speakers. We experiment our approach on three recent dialogue agents: MIME (Majumder et al., 2020), Dodeca-Transformer , and Blender (Roller et al., 2021). MIME is a dialogue model explicitly targeting empathetic conversation by leveraging emotion mimicry. We select MIME, since it reportedly performs better than other recent empathy-specialized models (Rashkin et al.   Automatic evaluation metrics. For weaklysupervised emotion cause word recognition, we report the Top-1, 3, 5 recall scores. For EmpatheticDialogues, we report coverage and two scores for specific empathy expressions (Exploration, Interpretation) measured by pretrained empathy identification models (Sharma et al., 2020). The coverage score refers to the average number of emotion cause words included in the model's generated response.
The (i) Exploration and (ii) Interpretation are metrics for expressed empathy in text, introduced by Sharma et al. (2020). They both require responses to focus on the interlocutor's utterances and to be specific. (i) Explorations are expressions of active interest in the interlocutor's situation, such as "What happened?" or "So, did you pass the chemistry exam?". The latter is rated as a stronger empathetic response since it asks specifically about the interlocutor's situation. (ii) Interpretations are expressions of acknowledgments or understanding of the interlocutor's emotion or situation, such as "I know your feeling." or "I also had to speak in front of such audience, made me nervous." Expressions of specific understanding are considered to be more empathetic. RoBERTa models (Liu et al., 2019) that are separately pretrained for each metric rate each agent's response by returning values of 0, 1, or 2. Higher scores indicate stronger empathy.  BERT, GEE) perform better. Selecting words by BERT's attention weights does not attain better performance on capturing emotion cause words than GEE. The gap between GEE and other methods widens when the number of returned words from models is more than one (i.e. Top-3, 5).

Weakly-Supervised Emotion Cause Word Recognition
We also evaluate human performance to measure the difficulty of the task. We randomly sample 100 examples from the test set and ask a human evaluator to select five best guesses for the emotion causes. As the performance gap between GEE and human is significantly large, there is much room for further improvement in weakly-supervised emotion cause recognition.

Empathetic Response Generation
Results on Automatic Evaluation. Table 7 reports the performance of different dialogue agents on EmpatheticDialogues (Rashkin et al., 2019) with automatic evaluation metrics. Our Focused S 1 significantly outperforms the base model S 0 in terms of Interpretation and Exploration scores that measure more focused and specific empathetic expression. We also test the plain pragmatic method (Plain S 1 ) that use random distractors as in previous works (Cohn-Gordon et al., 2018;Kim et al., 2020). The Focused S 1 consistently outperforms Plain S 1 on Interpretation score with similar or better Exploration scores. The Focused S 1 models show higher coverage scores than other mod-   (Majumder et al., 2020) Focused S1 vs S0 46.7% 20.0% 33.3% DodecaTransformer  Focused S1 vs S0 42.1% 28.8% 29.1% Blender (Roller et al., 2021) Focused S1 vs S0 44.6% 37.4% 18.0% Table 9: Comparison of our approach (Focused S 1 ) with base speakers (S 0 ) on A/B test for empathetic response generation. The win and lose rates are based on Focused S 1 .

Model Win Lose Tie
Blender (Roller et al., 2021) Focused S1 vs Plain S1 39.6% 32.3% 28.0% Table 10: Comparison of our approach (Focused S 1 ) with plain pragmatic approach (Plain S 1 ) on A/B test for reflecting the given emotion cause words. The win and lose rates are based on Focused S 1 . els, indicating they more reflect the context's emotion cause words in responses. As MIME is only trained on EmpatheticDialogues, its Exploration and Interpretations scores are lower than models pretrained on other larger corpus. As a result, we find our approach is effective in both large pretrained open domain dialogue models and empathyspecialized one. We also finetune DodecaTransformer and Blender with explicit emotion information (S 0 +Emotion). Following Rashkin et al. (2019), we concatenate the ground-truth emotion label to the dialogue context during training. At inference, the top predicted emotion from GEE is used. We find the Interpretation or Exploration scores of S 0 +Emotion models drop. Thus, simply adding Emotion: Angry Context: I just moved to this neighborhood and some dumb criminals shot my neighbors and ran into the woods! Emotion cause words from GEE: dumb, neighbors, shot, moved, ran S0: That's terrible. Did you get a chance to talk to them? Plain S1: Oh my god, that is horrible! Where are they? Focused S1: Oh my god, that is horrible! Did they get caught?
Emotion: Trusting Context: I'm a big believer in listening to your gut. Sometimes I might wonder where my intuition is leading me, but it always seems to be spot on.
Emotion cause words from GEE: believer, leading, intuition, wonder, spot S0: That's great. I'm glad you have a good intuition. Plain S1: Good for you. I have a gut feeling about a lot of things. Focused S1: That's great! I'm a big believer in intuition too.

Emotion: Proud
Context: We were living overseas at the times -the most impressive part was crossing the Grand Canal over a pontoon of boats to finish the marathon ;-) Emotion cause words from GEE: impressive, Canal, Grand, part, finish S0: Wow, that sounds like a lot of fun! I'm glad you had a good time! Plain S1: Wow, that sounds amazing! I bet you were so excited! Focused S1: Wow, that sounds amazing! I've never been to Europe, but I've heard of it. Table 11: Examples of recognized emotion cause words from our GEE and responses from S 0 and Focused S 1 . We underline words where our Focused S 1 reflects the emotion cause words returned by GEE. 2 emotion information is insufficient to make models focus more on the interlocutor's emotional event.
Results on Human Evaluation. We conduct user study and A/B test via Amazon Mechanical Turk. We randomly sample 100 test examples, each rated by three unique human evaluators. Following previous works (Rashkin et al., 2019;Lin et al., 2019;Majumder et al., 2020), we rate empathy, relevance, and fluency of generated responses.
Given the dialogue context and model's generated response, evaluators are asked to rate each criterion in a 4-point Likert scale, where higher scores are better. We also run human A/B test to directly compare the Focused S 1 and base S 0 . We ask three unique human evaluators to vote which response is more empathetic. They can select tie if both responses are thought to be equal. Table 8 and 9 summarizes the averaged human rating and A/B test results on MIME (Majumder et al., 2020), DodecaTransformer , and Blender (Roller et al., 2021). Our Focused S 1 agents are rated more empathetic and relevant to the dialogue context than the base agent S 0 , with better fluency. Also, users prefer responses from our Focused S 1 agent over those from the base agent S 0 . The inter-rater agreement (Krippendorff's α) for human rating and A/B test are 0.26 and 0.27, respectively; implying fair agreement.
In addition to the coverage score in Table 7, we run A/B test on Blender (Roller et al., 2021) to compare the Focused S 1 and Plain S 1 for reflecting the given emotion cause words in the responses. We random sample 200 test examples and ask three unique human evaluators to vote which response is more focused on the given emotion cause words from the context. Table 10 is the result of A/B test for focused response generation on Blender (Roller et al., 2021). Users rate that responses from Focused S 1 more reflect the emotion cause words than those from the Plain S 1 approach. Thus, both quantitative and qualitative results show that our Focused S 1 approach helps dialogue agents to effectively generate responses focused on given target words.
Examples of the recognized emotion cause words from GEE and generated responses are in Table 11. Our Focused S 1 agent's responses reflect the context's emotion cause words returned from our GEE, implicitly or explicitly.

Conclusion
We studied how to use a generative estimator for identifying emotion cause words from utterances based solely on emotion labels without word-level labels (i.e. weakly-supervised emotion cause word recognition). To evaluate our approach, we introduce EMOCAUSE evaluation set where we manually annotated emotion cause words on situations in EmpatheticDialogues (Rashkin et al., 2019). We release the evaluation set to the public for future research. We also proposed a novel method for controlling the Rational Speech Acts (RSA) framework (Frank and Goodman, 2012) to make models generate empathetic responses focused on targeted words in the dialogue context. Since the RSA framework requires no additional training, our approach is orthogonally applicable to any pretrained dialogue agents on the fly. An interesting direction for future work will be reasoning how the interlocutor would react to the model's empathetic response. Such reasoning is an essential part for expressing empathy.

A Implementation Details
Weakly-supervised emotion cause word recognition. We use rake-nltk 3 to implement RAKE (Rose et al., 2010), and the official code of Em-pDG 4 from the authors (Li et al., 2020). We respectively finetune BERT-based-uncased (Devlin et al., 2019) for BERT-Attention and BART-large (Lewis et al., 2020) for our generative emotion estimator (GEE). We set a learning rate to 3e-5 for BERT-Attention and 1e-5 for GEE. Other than the learning rate, we follow the default hyperparameters in ParlAI framework 5 (Miller et al., 2017). We select the best performing checkpoint using the Top-1 recall for emotion cause word recognition on the validation set. We run experiments 5 times with different random seeds and report averaged scores on Table 6.
Dialogue models. We use MIME (Majumder et al., 2020), DodecaTransformer , and Blender 90M (Roller et al., 2021) as dialogue models for base speakers. For MIME, we use the codes and pretrained weights of the authors' official implementation 6 as is. For DodecaTransformer and Blender, we use the ParlAI framework with the default hyperparameters and finetune them on EmpatheticDialogues (Rashkin et al., 2019). We select the best performing checkpoint via perplexity on the validation set.
During inference, we use greedy decoding and set RSA parameter α and β to 2.0 and 0.9 for MIME, 3.0 and 0.9 for DodecaTransformer, and 4.0 and 0.9 for Blender. We select the best performing α and β from the candidates of [1.0, 2.0, 3.0, 4.0] and [0.5, 0.6, 0.7, 0.8, 0.9, 1.0] with one trial for each. Inference on the test set of EmpatheticDialogues takes 0.4 hours with Blender 90M base speaker.
Evaluation metrics. To compute Exploration and Interpretation scores (Sharma et al., 2020), we separately finetune RoBERTa-base for each score using the author's official code 7 .
Sensitivity to k of top-k emotion cause words. In all experiments, we use k = 5, which is found by validation with k = 1, 2, 4, 8 using Blender (Roller et al., 2021)   Experiments for emotion cause word recognition and emotion classification are run on one NVIDIA Quadro RTX 6000 GPU. Experiments for empathetic response generation are run on two GPUs.

B Emotion Classification
We report the classification performance of emotion classifiers used in empathetic response generation.    Table 15 shows Top-10 frequent cause words per emotion. Interestingly, same words can be seen in both positive and negative emotions. For example, we can find the word interview on both "Anxious" and "Confident". "Anticipating" and "Disappointed" are closely related to vacation. This result shows that understanding the context is one of key prerequisites for emotion cause word recognition.

Emotion: Surprised
We just got a new puppy . My older dog knew to let that one out first when I get home from work .

Emotion: Faithful
My boyfriend is going out with a bunch of people I do n't know tonight . But I trust him that he will be a good boy .